1 Introduction

Stochastic gradient descent (SGD) is a popular iterative optimization algorithm that has been widely used in machine learning systems. With the growing data volume, SGD algorithms have to access the data stored on the secondary storage instead of main memory. There are two prominent scenarios: (1) In-database machine learning (in-DB ML) systems have to access the data tables stored on the secondary storage via the buffer manager [1]; (2) Deep learning (DL) systems such as TensorFlow [2] and PyTorch [3] need specialized data loader/scanner to access the datasets that are stored in parallel/distributed file systems.

In-database ML and deep learning systems In-DB ML systems and deep learning systems have been extensively studied for many years [4,5,6,7,8,9,10,11,12]. For in-DB ML, its major benefit is that users do not need to move the data out of DB to another specialized ML platform, since it is often time-consuming or even infeasible (due to privacy or security concerns). With the help of in-DB ML systems such as MADlib [4, 13] and Bismarck [5], users can train an ML model (e.g., SVM) using a simple SQL query as follows:

figure a

For deep learning systems such as PyTorch and TensorFlow, they usually provide users with simple Dataset/DataLoader APIs to load data from secondary storage into memory and further into GPUs, as shown in the following lines of code. Deep learning systems can automatically perform model training in train() with SGD, using a number of GPUs.

figure b

A Fundamental Discrepancy A fundamental problem is that SGD requires random data order to converge, but the data are usually not guaranteed to be stored in a random order, for both in-DB ML and deep learning systems. As identified by previous work [5, 14, 15], the worst case is that the data are stored in a clustered order. For example, if the data are clustered by labels, data with negative labels might always come before data with positive labels [5]. Another example is that the data are ordered/clustered by one of the features. There are common cases when the data are naturally ordered by some features such as timestamps, usernames/types, item prices, or there is a clustered B-tree index on a subset of the feature columns or the label column (if data is stored in a relational database). In these cases, directly running sequential scans over the clustered data can slow down the convergence of SGD.

A common solution is to perform a full data shuffle on the original data. However, when data are stored on block-addressable secondary storage such as HDD and SSD, it can be extremely time-consuming to either randomly access the data during SGD, or shuffle the data once with data copy and run SGD over the shuffled copy, due to massive random I/O’s. For example, shuffling a 50GB dataset in PostgreSQL using ‘ORDER BY RANDOM()’ took about 50 mins in our experiments, and shuffling a scalability dataset in DB did not finish even in one day, as reported by previous work [5]. Moreover, sometimes, it is infeasible to shuffle the data in DB — in-place shuffling might have an impact on other indices, whereas shuffling over a data copy leads to \(2\times \) storage overhead. Likewise, the parallel/distributed file systems such as HDFS and Lustre [16] do not support/recommend randomly accessing small data tuples, which will significantly degrade the I/O performance. How efficient SGD algorithms can be designed without requiring even a single pass of full data shuffle? Understanding this question can have a profound impact on the system design of both in-DB ML and deep learning systems.

Fig. 1
figure 1

The convergence rate and performance of SVM on the higgs dataset clustered by labels, with different data shuffling strategies. a Today’s ML systems, including in-DB ML systems (e.g., MADlib and Bismarck) and TensorFlow, are sensitive to the data order. b Forcing a full data shuffle before training accommodates this clustered data issue, but introduces large overhead that is often more expensive than training itself

Existing Landscape and Challenges To solve the data shuffling problem of SGD, previous work has proposed several data shuffling strategies in the context of in-DB ML or deep learning systems. TensorFlow adopts a sliding-window-based shuffling strategy, which constantly loads data into a buffer and randomly fetches data from the buffer for SGD [17]. Bismarck [5] proposes a “multiplexed reservoir sampling” (MRS) shuffling strategy, which leverages two threads to update the model concurrently. One thread reads the data sequentially with reservoir sampling, while the other thread reads data from a small in-memory buffer filled with the sampled data. Although these strategies improve the I/O performance, they suffer from convergence shortcomings. As demonstrated in Fig. 1a, both strategies proposed by Bismarck and TensorFlow suffer from lower accuracy given a clustered data. In contrast, shuffling data once before training, i.e., the curve corresponding to “MADlib/Bismarck (Shuffle Once)”, can accommodate such convergence problem but introduce a significant overhead as shown in Fig. 1b.

Our Contributions Inspired by these previous efforts, we ask the following questions in this paper:

Can we design an SGD-style algorithm with efficient data shuffling strategy that can converge without requiring a full data shuffle? Can we provide a rigorous theoretical analysis on the convergence behavior of such an algorithm? Further, can we integrate such an algorithm into database systems as well as deep learning systems?

In this paper, we systematically study these questions and make the following contributions.

C1. An Anatomy and Empirical Study of Existing Algorithms. We first conduct a systematic empirical study of existing data shuffling strategies for SGD, including (1) Epoch Shuffle, which performs a full shuffle before each epoch, (2) Shuffle Once, (3) No Shuffle, (4) Sliding-Window Shuffle, and (5) MRS Shuffle. We compare them by using SGD to train generalized linear models and deep learning models, over both label-clustered and feature-ordered datasets. Our study reveals that existing strategies cannot simultaneously achieve good hardware efficiency (i.e., I/O performance) and statistical efficiency (i.e., convergence rate and converged accuracy). Specifically, Epoch Shuffle and Shuffle Once achieve the best statistical efficiency in terms of convergence rate of SGD, since the data have been fully shuffled; however, their hardware efficiency is suboptimal due to additional shuffle overhead and storage overhead. In contrast, No Shuffle achieves the best hardware efficiency as no data shuffle is required; however, its statistical efficiency suffers as it has the lowest accuracy. The other two strategies, Sliding-Window Shuffle and MRS Shuffle, perform like a compromise between Shuffle Once and No Shuffle, but still suffer in terms of statistical efficiency (Sect. 3).

C2. A Simple but Novel Algorithm with Rigorous Theoretical Analysis. To address the limitations of existing strategies, we propose CorgiPile, a novel SGD-style algorithm with a two-level hierarchical data shuffling strategy.Footnote 1 The main idea is to first sample and shuffle the data at block level, and then shuffle data at tuple level within the sampled data blocks. That is, we first randomly sample data blocks (e.g., one block refers to a batch of table pages in DB) into a buffer, and then shuffle the tuples from all the blocks in the buffer for SGD. While this two-level strategy seems simple, it can achieve both good hardware efficiency and statistical efficiency. The hardware efficiency is intuitive—randomly accessing data blocks is much faster than randomly accessing small tuples, especially for large block size. However, the statistical efficiency requires some non-trivial analysis. To this end, we further present a rigorous theoretical study on the convergence behavior of CorgiPile.

C3. Implementation, Optimization, and Deep Integration with PostgreSQL. For in-DB ML, we aim to integrate CorgiPile with PostgreSQL, which requires careful design, implementation, and optimization. Unlike previous in-DB ML systems such as MADlib and Bismarck that integrate ML algorithms using user-defined aggregates (UDAs), our technique requires a deeper system integration since it needs to directly interact with the buffer manager and pages. Therefore, we operate at the “physical level” and enable in-DB ML inside PostgreSQL [18] via three new physical operators: a BlockShuffle operator, a TupleShuffle operator, and an SGD operator for our customized SGD implementation. We can then construct an execution plan for the SGD computation by chaining these operators together to form a pipeline, which naturally follows the built-in Volcano query execution paradigm [19] of PostgreSQL. We also design a double-buffering mechanism to optimize the TupleShuffle operator, to reduce the data copy and shuffle overhead.Footnote 2

C4. Multi-Process CorgiPile and Integration with PyTorch. Today’s deep learning systems usually work in the parallel/distributed environment with multiple processes and GPUs. To adapt to this environment, we further extend single-process CorgiPile to multi-process CorgiPile, by enhancing the tuple-level shuffle. The multi-process CorgiPile also contains three operators: BlockShuffle, TupleShuffle, and SGD. For block-level shuffling, we randomly distribute data blocks to different processes. For tuple-level shuffling, we use multi-buffer-based shuffling instead of single-buffer-based shuffling—in each process we allocate a local buffer to read blocks and shuffle their tuples. The SGD operator performs mini-batch SGD and synchronizes the computation of gradients/parameters among different processes for each batch. We demonstrate that multi-process CorgiPile generates random data order similar to that of single-process CorgiPile. We further integrate multi-process CorgiPile into PyTorch and wrap it as a new CorgiPileDataset API for ease of useFootnote 3.

C5. Comprehensive Empirical Evaluations. We perform comprehensive evaluations to demonstrate hardware efficiency and statistical efficiency of CorgiPile. For in-DB ML, we compare our PostgreSQL implementation with two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, in terms of both convergence rate and end-to-end performance. The results show that CorgiPile achieves comparable model accuracy to the best Shuffle Once baseline on both label-clustered and feature-ordered datasets. Meanwhile, CorgiPile gains 1.6\(\times \) \(-\)12.8\(\times \) speedup compared to MADlib and Bismarck, since it does not require the full data shuffle. In contrast, other strategies suffer from lower convergence rate or lower accuracy. For deep learning systems, we compare CorgiPile with other shuffling strategies in PyTorch using deep learning models for image classification. The results are similar to those in PostgreSQL—CorgiPile in PyTorch again achieves similar model accuracy compared to the (best) Shuffle Once baseline, while other data shuffling strategies result in lower accuracy. Specifically, on ImageNet CorgiPile is 1.5\(\times \) faster than Shuffle Once to converge, using 8 GPUs.

This paper is an extension of our previous work [20]. Our new contributions beyond [20] are the following:

  • To enable CorgiPile to work in a parallel/distributed environment, we extend the previous single-process CorgiPile to multi-process CorgiPile (Sect. 6). We further implement multi-process CorgiPile in PyTorch, by enhancing two operators BlockShuffle and TupleShuffle and wrapping them as a new CorgiPileDataset API.

  • We demonstrate that multi-process CorgiPile can obtain random data order similar to that of single-process CorgiPile for mini-batch SGD (Sect. 6.3).

  • We evaluate the multi-process CorgiPile with deep learning models for image classification (Sect. 7.3). On ImageNet [21], CorgiPile achieves comparable model accuracy to the (best) Shuffle Once baseline but is 1.5\(\times \) faster to converge.

  • We expand the evaluation in [20] with datasets ordered by subsets of features. The new results reinforce the result that CorgiPile is comparable with Shuffle Once, whereas the other approaches suffer from lower accuracy and/or lower convergence rate.

  • We extend the evaluation in [20] by comparing convergence rate among different shuffling strategies for mini-batch linear models such as LR and SVM.

Paper Organization We first review the SGD algorithm and its implementation in Sect. 2. We next perform an empirical study on the existing data shuffling strategies for SGD in Sect. 3. We then present our CorgiPile strategy and provide a theoretical analysis on its convergence in Sect. 4. We detail our implementation of CorgiPile inside PostgreSQL in Sect. 5. We present the multi-process CorgiPile and its implementation within PyTorch in Sect. 6. We compare the end-to-end performance and convergence rate of CorgiPile with other baseline approaches in Sect. 7. We summarize related work in Sect. 8 and conclude in Sect. 9.

2 Preliminaries

In this section, we briefly review the standard SGD algorithm and its implementation in state-of-the-art in-DB ML and deep learning systems.

2.1 Stochastic gradient descent (SGD)

Given a dataset with m training examples \(\{{\textbf{t}}_i\}_{i\in [m]}\), i.e., m tuples if these examples are stored as a table in a database, typical ML tasks essentially solve an optimization problem of minimizing a finite sum over m training examples with respect to model \({\textbf{x}}\).

$$\begin{aligned} F({\textbf{x}}) = \frac{1}{m}\sum \limits _{i=1}^m f_i({\textbf{x}}) \end{aligned}$$

Here each \(f_i\) corresponds to the loss over each training tuple \({\textbf{t}}_i\). SGD is an iterative algorithm that takes as input hyperparameters such as the learning rate \(\eta \) and the maximum number of epochs S. It works as follows:

  1. 1.

    Initialization—Initialize the parameters of model \({\textbf{x}}\), often randomly or as zero.

  2. 2.

    Iterative computation—In each iteration, it draws a (batch of) tuple \({\textbf{t}}_i\), randomly with replacement, computes the stochastic gradient \(\nabla f_i({\textbf{x}})\) and updates the parameters of model \({\textbf{x}}\). In practice, most systems implement a more efficient variant, where the random tuples are drawn without replacement [5, 22,23,24]. To achieve this, SGD shuffles all tuples before each epoch and sequentially scans these shuffled tuples. For each tuple, SGD computes the stochastic gradient and updates the model parameters.

  3. 3.

    Termination—The iterative computation ends when it converges (i.e., the parameters of model \({\textbf{x}}\) no longer change) or has attained the pre-defined maximum number of epochs.

2.2 In-database machine learning systems

There has been a plethora of work in the past decade focusing on in-DB ML [4,5,6,7,8,9,10,11,12, 14, 15, 25,26,27]. Most existing in-DB ML systems implement SGD algorithm using “user-defined aggregates” (UDA) [4, 5]. In detail, each epoch of SGD is done via an invocation of the corresponding UDA function, where the parameters of model \({\textbf{x}}\) are treated as the state and updated for each tuple.

To implement the data shuffling step required by SGD, different in-DB ML systems adopt distinct strategies. For example, some systems such as MADlib [4] and DB4ML [28] assume that the training data has already been shuffled, so they do not perform any data shuffling. Other systems, such as Bismarck [5], do not make this assumption. Instead, they either perform a pre-shuffle of the data in an offline manner and then store the shuffled data as a replica in the database, or perform partial data shuffling based on sampling technologies such as reservoir sampling and sliding-window sampling. As we will see in the next section, such partial data shuffling strategies, despite alleviating the computation and storage overhead of the preshuffle strategy, raise new issues regarding the convergence of SGD, since the data is insufficiently shuffled and does not follow the purely random order required.

2.3 Deep learning systems

Deep learning systems such as PyTorch and TensorFlow are now widely used in industry and academia for AI tasks, including image classification, natural language processing, speech recognition, etc. These systems usually leverage the SGD optimizer or its variants [29,30,31] for training deep learning models. To facilitate data loading, these systems classify datasets into two types, namely, map-style datasets and iterable-style datasets. Map-style datasets refer to datasets whose tuples can be randomly accessed by indices. For example, if an image dataset is stored in an in-memory array as \(\langle \texttt {image}, \texttt {label} \rangle \) tuples, it is a map-style dataset that can be randomly accessed by the array index. Iterable-style datasets refer to datasets that can only be accessed in sequence, which is usually used for datasets that cannot fit in memory. It is easy to shuffle map-style datasets since we only need to shuffle the indices and access the tuples based on the shuffled indices. However, if map-style datasets are stored on secondary storage such as HDD/SSD, this random access usually leads to low I/O performance (see Fig. 4). To alleviate this problem, TensorFlow provides Sliding-Window Shuffle using sliding-window-based sampling. As we will see in Sect. 3, Sliding-Window Shuffle often results in low accuracy of the trained model.

3 Study of data shuffling strategies for SGD

In this section, we present a systematic analysis of data shuffling strategies used by existing ML systems. We consider five common data shuffling strategies: (1) Epoch Shuffle, (2) Shuffle Once, (3) No Shuffle, (4) Sliding-Window Shuffle [17], and (5) MRS Shuffle [5]. We use diverse SGD workloads, including generalized linear models such as logistic regression (LR) and support vector machine (SVM), as well as deep learning models such as VGG [32] and ResNet [33].

Experimental Setups. We use the criteo dataset [34] for generalized linear models, and use the cifar-10 image dataset [35] for deep learning. Each dataset has two versions: a label-clustered (or clustered for short) version and a feature-ordered version. In the clustered version, all tuples are clustered by their labels, whereas in the feature-ordered version all tuples are ordered by the first feature without loss of generality—we observed similar results by ordering other features, as shown in [20](Ref. Section 6.3.3). The usage of clustered datasets is inspired by similar settings leveraged in [5], with the goal of testing the worst-case scenarios of data shuffling strategies for SGD. For example, the clustered version of criteo dataset has the negative tuples (with “\(-1\)” labels) ordered before the positive tuples (with “\(+1\)” labels).

3.1 “Shuffle once” and “Epoch shuffle”

The Shuffle Once strategy performs an offline shuffle of all data tuples, either in-place or by storing the shuffled tuples as a copy. SGD is then executed over this shuffled copy. Albeit a simple (but costly) idea, it is arguably a strong baseline that many state-of-the-art in-DB ML systems assume when they take as input an already shuffled dataset. For Epoch Shuffle, it shuffles the training dataset before each training epoch. Therefore, the data shuffling cost of Epoch Shuffle grows linearly with respect to the number of epochs.

Convergence. As illustrated in Fig. 2, Shuffle Once can achieve a convergence rate comparable to Epoch Shuffle on both clustered and feature-ordered datasets, confirming previous observations [5].

Performance. Although Shuffle Once reduces the number of data shuffles to only once, the shuffle itself can be very expensive on large datasets due to the random access of tuples, as we will show in our experiments. Previous work has also reported that shuffling a huge dataset could not be finished in one day [5]. Another problem of Shuffle Once is that, when in-place shuffle is not feasible, it needs to duplicate the data, which can double the space overhead.

Fig. 2
figure 2

The convergence rates of SGD with different data shuffling strategies for both label-clustered and feature-ordered datasets, using the same buffer size (10% of the dataset size) for MRS and Sliding-Window Shuffles. LR and SVM use the standard SGD, while VGG19 and ResNet18 use mini-batch SGD with batch size of 64

3.2 “No Shuffle”

The No Shuffle strategy does not perform any data shuffle at all, i.e., the SGD algorithm runs over the given data order in each epoch. Running MADlib over a dataset or running PyTorch over an iterable-style dataset (IterableDataset) picks the No Shuffle strategy.

Convergence. On both clustered and feature-ordered data, No Shuffle suffers from the lowest model accuracy. This is not surprising, as SGD relies on random data order to converge.

Performance. No Shuffle is the fastest among the five data shuffling strategies, as it can always sequentially, instead of randomly, access the data tuples [36].

3.3 “Sliding-Window Shuffle”

The Sliding-Window Shuffle strategy leverages a sliding window to perform partial data shuffling, which is used by TensorFlow [17]. It includes the following steps:

  1. 1.

    Allocate a sliding window and fill tuples into the window as they are scanned.

  2. 2.

    Randomly select a tuple from the window and use it for the SGD computation. The slot of the selected tuple in the window is then filled in by the next incoming tuple.

  3. 3.

    Repeat (2) until all tuples are scanned.

Convergence. As illustrated in Fig. 2, for clustered datasets, Sliding-Window Shuffle can achieve higher model accuracy than No Shuffle but lower accuracy than Shuffle Once when SGD converges. The reason is that this strategy shuffles the data only partially. For two data examples \({\textbf{t}}_i\) and \({\textbf{t}}_j\) where \({\textbf{t}}_i\) is stored much earlier than \({\textbf{t}}_j\) (\(i \ll j\)), it is likely that \({\textbf{t}}_i\) is still selected before \({\textbf{t}}_j\). As a result, on the clustered datasets used in our study, negative tuples are more likely to be selected (for SGD) before positive ones, which distorts the training data seen by SGD and leads to low model accuracy. This accuracy degradation also happens in feature-ordered datasets when training LR and SVM. For VGG and ResNet on the feature-ordered cifar-10 dataset, Sliding-Window Shuffle achieves only 0.3–0.4% lower accuracy than Epoch Shuffle and Shuffle Once. The reason is that ordering the feature pixels of cifar-10 images results in good data randomness.

Performance. Sliding-Window Shuffle can achieve I/O performance comparable to No Shuffle, as it also only needs to sequentially access the data tuples with limited additional CPU overhead to maintain and sample from the sliding window.

3.4 “Multiplexed reservoir sampling shuffle”

Multiplexed Reservoir Sampling (MRS) Shuffle uses two concurrent threads to read tuples and update a shared model [5]. The first thread sequentially scans the dataset and performs the reservoir sampling. The sampled (i.e., selected) tuples are stored in a buffer \(B_1\), and the dropped (i.e., not selected) ones are used for SGD. The second thread loops over the tuples from another buffer \(B_2\) for SGD, where tuples in \(B_2\) are simply copied from \(B_1\) (by swapping \(B_1\) and \(B_2\) once reservoir sampling ends). PyTorch’s shuffling strategy uses pure reservoir sampling [37], which is a weaker version of (i.e., has lower convergence rate than) MRS Shuffle, as detailed in Section 3.4 in [5]. Therefore, we use MSR Shuffle instead of reservoir sampling as one baseline.

Convergence. As illustrated in Fig. 2, MRS Shuffle achieves higher accuracy than Sliding-Window Shuffle but lower accuracy than Shuffle Once for clustered datasets. The reason is quite similar to that given to Sliding-Window Shuffle, as the shuffle based on reservoir sampling is again partial. Specifically, the order of the dropped tuples is also increasing, i.e., if \(i \ll j\), \({\textbf{t}}_i\) is likely to be processed by SGD before \({\textbf{t}}_j\). Moreover, looping over the sampled tuples may lead to suboptimal data distribution—the sampled tuples in the looping buffer \(B_2\) may be used multiple times, which can cause data skew and lower model accuracy (e.g., the accuracy of VGG/ResNet on the feature-ordered cifar-10 dataset).

Performance. MRS Shuffle is fast, as the first thread only needs to sequentially scan the tuples for reservoir sampling. However, it is slightly slower than Sliding-Window Shuffle and No Shuffle, as there is a second thread that loops over the buffered tuples.

Table 1 A summary of different data shuffling strategies, where bold fonts represent the “ideal” scenario. We assume all strategies that require an in-memory buffer have reasonably large buffer size, e.g., 10% of dataset size
Fig. 3
figure 3

The tuple id distribution (ae) and corresponding label distribution (fj). Tuple id denotes the tuple position after shuffling. #tuple refers to the number of negative/positive tuples in every 20 tuples shuffled

3.5 Analysis and summary

Table 1 summarizes the characteristics of different data shuffling strategies. As discussed, the effectiveness of data shuffling strategies for SGD largely depends on two somewhat conflicting factors, namely, (1) the degree of data randomness of the shuffled tuples and (2) the I/O efficiency when scanning data from disk. There is an apparent trade-off between these two factors:

  • The more random the tuples are, the better the convergence rate of SGD is. Epoch Shuffle introduces data randomness at the highest level, but is too expensive to implement in in-DB ML and deep learning systems. Shuffle Once also introduces good data randomness, which is usually the best practice in terms of SGD convergence for in-DB ML systems.

  • A higher degree of randomness implies more random disk accesses and thus lower I/O efficiency. As a result, the No Shuffle strategy is the best in terms of I/O efficiency.

The other strategies (Sliding-Window Shuffle and MRS Shuffle) try to sacrifice data randomness for better I/O efficiency, leaving a room for improvement.

Example 1

To better understand these issues, consider a clustered dataset with 1,000 tuples, each of which has a tuple-id and a label, where tuple-id of the i-th tuple is i. The first 500 tuples are negative and the next 500 tuples are positive. Figure 3 plots the tuple-id distributions and corresponding label distributions after Sliding-Window Shuffle and MRS Shuffle, with a comparison to the ideal distributions from a full shuffle. The tuple-id distribution illustrates the positions of the tuples after shuffling, whereas the label distribution illustrates the number of negative/positive tuples in every 20 tuples shuffled.

We can observe that Sliding-Window Shuffle results in a “linear”-shape distribution of the tuple-id after shuffling, as shown in Fig. 3b, which suggests that the tuples are almost not shuffled. The corresponding label distribution in Fig. 3g further confirms this, where almost all negative labels still appear before positive ones after shuffling. Similar patterns can be observed for MRS Shuffle in Fig. 3c, h, though MRS Shuffle has improved over Sliding-Window Shuffle. In summary, the data randomness achieved by Sliding-Window Shuffle or MRS Shuffle is far from the ideal case, as shown in Fig. 3d, i. In contrast, we will see in the next section that the data randomness of our CorgiPile is closer to the ideal full shuffle (Fig. 3e, j).

4 CorgiPile

As illustrated in the previous section, data shuffling strategies used by existing ML systems can be suboptimal when dealing with data that are not fully shuffled. Although recent efforts have significantly improved over baseline methods, there is still a large room for improvement. Inspired by these previous efforts, we present a simple but novel data shuffling strategy named CorgiPile. The key idea of CorgiPile lies in the following two-level hierarchical shuffling mechanism:

We first randomly select a set of blocks (each block refers to a set of contiguous tuples) and put them into an in-memory buffer; we then randomly shuffle all tuples in the buffer and use them for the SGD computation.

Despite its simplicity, CorgiPile is highly effective. In terms of hardware efficiency, when the block size is large enough (e.g., 10MB+), a random access on the block level can be as efficient as a sequential scan, as shown in the I/O performance test on HDD and SSD in Fig. 4. In terms of statistical efficiency, as we will show, given the same buffer size, CorgiPile converges much better than Sliding-Window Shuffle and MRS Shuffle. Nevertheless, both the convergence analysis and its integration into PostgreSQL and PyTorch are non-trivial. In the following, we first describe the CorgiPile algorithm precisely and then present a theoretical analysis on its convergence behavior.

Fig. 4
figure 4

Random access performance vs. block size. Randomly accessing a data block is faster than randomly accessing a data tuple. Larger block size results in more sequential accesses on data tuples and fewer cache misses. When data block is large (e.g., 10 MB), random block access can be as efficient as sequential scan

Notations and definitions. The following is a list of notations and definitions that we will use:

  • \(\Vert \cdot \Vert \): the \(\ell _2\)-norm for vectors and the spectral norm for matrices;

  • \(\lesssim \): For two arbitrary vectors ag, we use \(a_s\lesssim g_s\) to denote that there exists a certain constant C that satisfies \(a_s\le C g_s\) for all s;

  • N, the total number of blocks (\(N\ge 2\));

  • n, the buffer size (i.e., the number of blocks kept in the buffer);

  • b, the size (number of tuples) of each data block;

  • \(B_l\), the set of tuple indices in the l-th block (\(l \in [N]\) and \(|B_l| = b\));

  • m, the number of tuples for the finite-sum objective (\(m = N b\));

  • \(f_i(\cdot )\), the function associated with the i-th tuple;

  • \(\nabla F(\cdot )\) and \(\nabla f_i(\cdot )\), the gradients of the functions \(F(\cdot )\) and \(f_i(\cdot )\);

  • \(H_i(\cdot ):= \nabla ^2 f_i(\cdot )\), the Hessian matrix of the function \(f_i(\cdot )\);

  • \({\textbf{x}}^*\), the global minimizer of the function \(F(\cdot )\);

  • \({\textbf{x}}^s_k\), the model \({\textbf{x}}\) in the k-th iteration at the s-th epoch;

  • \(\mu \)-strongly convexity: function \(F({\textbf{x}})\) is \(\mu \)-strongly convex if \(\forall {\textbf{x}}, {\textbf{y}}\),

    $$\begin{aligned} F({\textbf{x}}) \ge F({\textbf{y}}) + \left\langle {\textbf{x}}-{\textbf{y}}, \nabla F({\textbf{y}}) \right\rangle + \frac{\mu }{2}\Vert {\textbf{x}}-{\textbf{y}}\Vert ^2. \end{aligned}$$
    (1)

4.1 The CorgiPile algorithm

Algorithm 1
figure c

CorgiPile Algorithm

Algorithm 1 illustrates the details of CorgiPile. At each epoch (say, the s-th epoch), CorgiPile runs the following steps:

  1. 1.

    (Sample) Randomly sample n blocks out of N data blocks without replacement and load the n blocks into the buffer. Note that we use sample without replacement to avoid visiting the same tuple multiple times for each epoch, which can converge faster and is a standard practice in most ML systems [5, 22, 38,39,40].

  2. 2.

    (Shuffle) Shuffle all tuples in the buffer. We use \(\varvec{\psi }_s\) to denote an ordered set, whose elements are the indices of the shuffled tuples at the s-th epoch. The size of \(\varvec{\psi }_s\) is bn, where b is the number of tuples per block. \(\varvec{\psi }_s(k)\) is the k-th element in \(\varvec{\psi }_s\).

  3. 3.

    (Update) Perform gradient descent by scanning each tuple with the shuffle indices in \(\varvec{\psi }_s\), yielding the updating rule

    $$\begin{aligned} {\textbf{x}}^s_k = {\textbf{x}}^s_{k-1} - \eta _s \nabla f_{\varvec{\psi }_s(k)} \left( {\textbf{x}}^s_{k-1} \right) , \end{aligned}$$

    where \(\nabla f_{\varvec{\psi }_s(k)}(\cdot )\) is the gradient of the function associated with the data sample with index \(\varvec{\psi }_s(k)\), and \(\eta _s\) is the learning rate for gradient descent at the epoch s. We initialize \({\textbf{x}}_0^0\), and the parameter update is performed for all \(k=1,\ldots ,bn\) in one epoch.

Intuition behind CorgiPile. Before we present the formal theoretical analysis, we first illustrate the intuition behind CorgiPile, following the same example used in Sect. 3.5.

Example 2

Consider the same settings as those in Example 1. Recall that CorgiPile contains both block-level and tuple-level shuffles. Suppose that the block-level shuffle generates a random order of blocks as {b20, b8, b45, b0,...} and the buffer can hold 10 blocks. The tuple-level shuffle will put the first 10 blocks into the buffer, whose tuple_ids are {b20[400, 419], b8[160, 179], b45[900, 919], b0[0, 19],...}. After shuffling, the buffered tuples will have random tuple_ids in a large non-contiguous interval that is the union of {[0, 19], [160, 179],..., [900, 919]}, as shown in the first 200 tuples in Fig. 3e. The buffered tuples therefore follow a random order closer to what is given by a full shuffle. As a result, the corresponding label distribution, as shown in Fig. 3j, is closer to a uniform distribution.

Performance. While No Shuffle only requires sequential I/O’s, our CorgiPile needs to (1) randomly access blocks, (2) copy all tuples in these blocks into a buffer, and (3) shuffle the tuples inside the buffer. Here, random accessing a block means randomly picking a block and reading the tuples of this block from secondary storage (e.g., the disk) into memory. If the block size is large enough, the I/O performances of random and sequential accesses are close. CorgiPile incurs additional overheads for buffer copy and in-memory shuffle. However, these I/O overheads can be hidden via standard techniques such as double buffering. As we will show in our experiments on PostgreSQL, the optimized version of CorgiPile only incurs 11.7% additional overhead compared to the most efficient No Shuffle baseline.

4.2 Convergence analysis

Despite its simplicity, the convergence analysis of our CorgiPile is not trivial—even reasoning about the convergence of SGD with sample without replacement is an open question for decades [40,41,42,43], not to say a hierarchical sampling scheme like ours. Fortunately, a recent theoretical advancement [40] provides us with the technical language to reason about CorgiPile ’s convergence. In the following, we present a novel theoretical analysis for CorgiPile.

Note that in our following analysis, one epoch denotes going through all tuples in the sampled n blocks.

Assumption 1

We make the following standard assumptions, as that in other previous work on SGD convergence analysis [44, 45]:

  1. 1.

    \( F(\cdot )\) and \( f_i(\cdot )\) are twice continuously differentiable.

  2. 2.

    L-Lipschitz gradient: \(\exists L \in {\mathbb {R}}_+\), \(\Vert \nabla f_i ({\textbf{x}}) - \nabla f_i ({\textbf{y}})\Vert \le L \Vert {\textbf{x}}-{\textbf{y}}\Vert \) for all \(i \in [m]\).

  3. 3.

    \(L_H\)-Lipschitz Hessian matrix: \(\Vert H_i({\textbf{x}}) - H_i({\textbf{y}})\Vert \le L_H\Vert {\textbf{x}}-{\textbf{y}}\Vert \) for all \(i\in [m]\).

  4. 4.

    Bounded gradient: \(\exists G \in {\mathbb {R}}_+\), \(\Vert \nabla f_i ({\textbf{x}}_k^s) \Vert \le G\) for all \(i \in [m]\), \(k\in [K-1]\), and \(s\in \{0,1\ldots ,S\}\).

  5. 5.

    Bounded Variance: \({\mathbb {E}}_{\xi } [\Vert \nabla f_{\xi } ({\textbf{x}}) - \nabla F({\textbf{x}}) \Vert ^2]\) \(= \) \(\frac{1}{m}\sum _{i=1}^m\Vert \nabla f_i({\textbf{x}})-\nabla F({\textbf{x}})\Vert ^2\le \sigma ^2\) where \(\xi \) is the random variable that takes the values in [m] with equal probability 1/m. Here \(\sigma ^2\) denotes the upper bound of the variance for sampling the gradient \(\nabla f_{\xi } ({\textbf{x}})\).

Factor \(h_D\). In our analysis, we use the factor \(h_D\) to characterize the upper bound of a block-wise data variance:

$$\begin{aligned} \frac{1}{N}\sum _{l=1}^N \left\| \nabla f_{B_l}({\textbf{x}}) - \nabla F({\textbf{x}}) \right\| ^2 \le h_D \frac{\sigma ^2}{b}, \end{aligned}$$

where \(b=|B_l|\) is the size of each data block (recall the definition of b). Here, \(h_D\) is an essential parameter to measure the “cluster” effect within the original data blocks. Let’s consider two extreme cases: (1) (\(h_D=1\)) all samples in the data set are fully shuffled, such that the data in each block follows the same distribution; (2) (\(h_D=b\)) samples are well clustered in each block, for example, all samples in the same block are identical. Therefore, the larger \(h_D\), the more “clustered” the data.

We now present the results for both strongly convex objectives (corresponding to generalized linear models) and non-convex objectives (corresponding to deep learning models), respectively, in order to show the correctness and efficiency of CorgiPile. Due to the space limitation, we detail the proof of the following theorems in Appendix of our technical report [46].

Strongly convex objective We first show the result for strongly convex objective that satisfies the strong convexity condition (1).

Theorem 1

Suppose that \(F({\textbf{x}})\) is a smooth and \(\mu \)-strongly convex function. Let \(T = S n b\), that is, the total number of samples used in training and \(S\ge 1\) is the number of tuples iterated, and choosing \(\eta _s = \frac{6}{bn\mu (s+a)}\) where \(a \ge \max \left\{ \frac{8LG + 24\,L^2 + 28L_H G}{\mu ^2}, \frac{24\,L}{\mu } \right\} \), under Assumption 1, CorgiPile has the following convergence rate

$$\begin{aligned} {\mathbb {E}}[F\left( {\bar{{\textbf{x}}}}_S \right) - F({\textbf{x}}^*)] \lesssim ( 1-\alpha )h_D\sigma ^2 \frac{1}{T} + \beta \frac{1}{T^2} + \gamma \frac{ m^3}{T^3}, \end{aligned}$$
(2)

where \({\bar{{\textbf{x}}}}_S = \frac{\sum _s (s+a)^3 {\textbf{x}}_s}{\sum _s (s+a)^3}\), and

$$\begin{aligned} \alpha := \frac{n-1}{N-1}, \quad \beta := \alpha ^2 + ( 1-\alpha )^2(b-1)^2, \quad \gamma := \frac{n^3}{N^3}. \end{aligned}$$

Tightness. The convergence rate of CorgiPile is tight in the following sense:

  • \(\alpha = 1\): It means that \(n=N\), i.e., all tuples are fetched to the buffer. Then CorgiPile reduces to full shuffle SGD algorithm [40]. In this case, the upper bound in Theorem 1 is \(O(1/T^2 + m^3/T^3)\), which matches the result of the full shuffle SGD algorithm [40].

  • \(\alpha = 0\): It means that \(n=1\), i.e., only sampling one block each time. Then CorgiPile is very close to mini-batch SGD (by viewing a block as a mini-batch), except that the model is updated once per data tuple. Ignoring the higher-order terms in (2), our upper bound \(O(h_D\sigma ^2/ T)\) is consistent with that of mini-batch SGD.

Comparison to vanilla SGD. In the vanilla SGD, we only randomly select one tuple from the dataset to update the model. It admits the convergence rate \(O(\sigma ^2/T)\). For our algorithm, when T is sufficiently large, the term \((1-\alpha )h_D(\sigma ^2/T)\) in (2) will be dominating. If \(n \gg (h_D -1) (N-1)/h_D + 1\) for \(h_D > 0\) (i.e., sampling sufficiently many blocks), the factor \((1-\alpha ) h_D\) in the dominating term will be much smaller than 1. Therefore, ignoring the higher order terms in (2) for a large T, our algorithm admits a faster convergence rate compared to \(O(\sigma ^2/T)\) for the vanilla SGD. It is also worth noting that, even if n is small, CorgiPile may still significantly outperform vanilla SGD in practice. Assuming that reading a random single tuple incurs an overhead of \(t_{\text {lat}}+t_{\text {t}}\) and reading a block of b tuples incurs an overhead of \(t_{\text {lat}}+bt_{\text {t}}\), where \(t_{\text {lat}}\) is the “latency” for one read/write operation that does not grow linearly with respect to the amount of data that one reads/writes (e.g., SSD read/write latency or HDD “seek and rotate” time), and \(t_{\text {t}}\) is the time that one needs to transfer a single tuple. To reach an error of \(\epsilon \), vanilla SGD requires time

$$\begin{aligned} O\left( \frac{\sigma ^2}{\epsilon } t_{\text {lat}} + \frac{\sigma ^2}{\epsilon } t_{\text {t}} \right) , \end{aligned}$$

whereas CorgiPile requires time

$$\begin{aligned} O\left( (1-\alpha ) \frac{h_D}{b} \cdot \frac{\sigma ^2}{\epsilon } t_{\text {lat}} + (1-\alpha ) h_D \cdot \frac{\sigma ^2}{\epsilon } t_{\text {t}} \right) . \end{aligned}$$

Because \((1-\alpha ) \frac{h_D}{b} < 1\), CorgiPile always provides benefit over vanilla SGD in terms of the read/write latency \(t_{\text {lat}}\). When \(t_{\text {lat}}\) dominates \(t_{\text {t}}\), CorgiPile can outperform vanilla SGD even for small buffers.

Non-convex objective We further conduct an analysis on objectives that are non-convex or satisfy the Polyak–Łojasiewicz condition, which leads to similar insights on the behavior of CorgiPile.

Theorem 2

Suppose that \(F({\textbf{x}})\) is a smooth function. Letting \(T = S n b\) be the number of tuples iterated, under Assumption 1, CorgiPile has the following convergence rate:

  1. 1.

    When \(\alpha \le \frac{N-2}{N-1}\), choosing \(\eta _s = \frac{1}{\sqrt{bn(1-\alpha )h_D\sigma ^2 S} }\) and assuming \(S \ge \frac{bn(\frac{104}{3}L + \frac{4}{3}L_H)^2}{\sigma ^2 (1-\alpha ) h_D}\), we have

    $$\begin{aligned} \frac{1}{S} \sum _{s=1}^S {\mathbb {E}}\Vert \nabla F({\textbf{x}}_0^s)\Vert ^2 \lesssim&( 1-\alpha )^{1/2} \frac{\sqrt{h_D } \sigma }{\sqrt{T}} + \beta \frac{1}{T} +\gamma \frac{ m^3}{T^{\frac{3}{2}}}, \end{aligned}$$

    where the factors are defined as

    $$\begin{aligned} \alpha:= & {} \frac{n-1}{N-1}, \quad \beta := \frac{\alpha ^2}{1-\alpha } \frac{1}{h_D \sigma ^2} + ( 1-\alpha )\frac{(b-1)^2}{h_D \sigma ^2}, \\ \gamma:= & {} \frac{n^3}{(1-\alpha )N^3}; \end{aligned}$$
  2. 2.

    When \(\alpha =1\), choosing \(\eta _s = \frac{1}{(m S)^{\frac{1}{3}} }\) and assuming \(S \ge (\frac{416}{3}L + \frac{16}{3}L_H)^3 b^2 n^3 / N\), we have

    $$\begin{aligned} \frac{1}{S} \sum _{s=1}^S {\mathbb {E}}\Vert \nabla F({\textbf{x}}_0^s)\Vert ^2\lesssim \frac{1}{T^{\frac{2}{3}}} +\gamma ' \frac{ m^3}{T}, \end{aligned}$$

    where we define \(\gamma ':= \frac{n^3}{N^3}\).

We can apply a similar analysis as that of Theorem  1 to compare CorgiPile with vanilla SGD, in terms of convergence rate, and reach similar insights.

5 Implementation in the database

We have integrated CorgiPile into PostgreSQL. Our implementation provides a simple SQL-based interface for users to invoke CorgiPile, with the following query template:

figure d

This interface is similar to that offered by existing in-DB ML systems like MADlib [4, 13] and Bismarck [5]. Examples of the params include learning_rate = 0.1, max_epoch_num = 20, and block_size = 10MB. CorgiPile outputs various metrics after each epoch, including the training loss, accuracy, and execution time.

The Need of a Deeper Integration Unlike existing in-DB ML systems, we choose not to implement our CorgiPile strategy using user-defined aggregates (UDAs). Instead, we choose to integrate CorgiPile into PostgreSQL by introducing physical operators. Is it necessary for such a deeper integration with database system internals, compared to a potential UDA-based implementation without modifying the internals?

While a UDA-based implementation is conceptually possible, it is not natural for CorgiPile, which requires accessing low-level data layout information such as table pages, tuples, and buffers. A deeper integration with database internals makes it much easier to reuse such functionalities that have been built into the core APIs offered by database system internals but not yet have been externally exposed as UDAs. Moreover, such a physical-level integration opens up the door for more advanced optimizations, such as double-buffering that will be illustrated in Sect. 5.3.

5.1 Design considerations

As discussed in Sect. 4.1, CorgiPile consists of three steps: (1) block-level shuffling, (2) tuple-level shuffling, and (3) SGD computation. Accordingly, we design three physical operators:

  • BlockShuffle, an operator for randomly accessing blocks;

  • TupleShuffle, an operator for buffering a batch of blocks and shuffling their tuples;

  • SGD, an operator for the SGD computation.

We then chain these three operators together to form a pipeline and implement the getNext() method for each operator, following the classic Volcano-style execution model [19] that is also the query execution paradigm of PostgreSQL.

One challenge is the implementation of the SGD operator, which requires an iterative procedure that is not typically supported by database systems. We choose to implement it by leveraging the built-in re-scan mechanism of PostgreSQL to reshuffle and reread the data after each epoch.

We store datasets as tables in PostgreSQL using the schema of \(\langle \textit{id}, \textit{features\_k}[], \textit{features\_v}[], \textit{label} \rangle \), which is similar to the one used by Bismarck [5]. For sparse datasets, \(\textit{features\_k}[]\) indicates which dimensions have nonzero values, and \(\textit{features\_v}[]\) refers to the corresponding nonzero feature values. For dense dataset, only \(\textit{features\_v}[]\) is used.

Currently, we store the (learned) machine learning model as an in-memory object (a C-style Struct) with an ID in the PostgreSQL’s kernel instead of using UDA. Users can initialize the model hyperparameters via the query. For the inference, users can execute a query as “SELECT table PREDICT BY model ID”, which invokes the learned model for prediction.

Fig. 5
figure 5

CorgiPile in PostgreSQL, with three new operators and the “double-buffering” optimization

5.2 Physical operators

The control flow of the three operators is illustrated in Fig. 5, which leverages a PostgreSQL’s pull-style dataflow to read tuples and perform the SGD computation. In the following, we assume that the readers are familiar with the structure of PostgreSQL’s operators, e.g., functions such as ExecInit() and getNext().

After parsing the input query, CorgiPile invokes ExecInit() of each operator to initialize their states such as ML models and I/O buffers. At each epoch, the SGD operator pulls tuples from the TupleShuffle operator for SGD computation, which further pulls tuples from the BlockShuffle operator. The BlockShuffle operator is responsible for shuffling blocks and reading their tuples. We now present the implementation of these operators.

  1. (1)

    BlockShuffle: This operator first obtains the total number of pages by PostgreSQL’s internal function as RelationGetNumberOfBlocks(). It then computes the number of blocks \(\textit{BN}\) by \(\textit{BN} = \textit{page\_num} * \textit{page\_size} / \textit{block\_size}.\) After that, it shuffles the block indices \([0, \dots , \textit{BN}-1]\) and gets shuffled block ids, where each block corresponds to a batch of contiguous table pages. For each shuffled block id, it reads the corresponding pages using heapgetpage() and returns each fetched tuple to the TupleShuffle operator. The BlockShuffle operator is similar to PostgreSQL’s Scan operator, although the Scan operator reads pages sequentially instead of randomly.

  2. (2)

    TupleShuffle: It first allocates a buffer, and then pulls the tuples one by one from the BlockShuffle operator by invoking its ExecTupleShuffle(), namely getNext(). Each pulled tuple is transformed to an SGDTuple object, which is then copied to the buffer. Once the buffer is filled, it shuffles the buffered tuples, which is similar to how the Sort operator works in the PostgreSQL. After that, the shuffled tuples are returned one by one to the SGD operator.

  3. (3)

    SGD: It first initializes an ML model in the ExecInitSGD() and then executes SGD computation in ExecSGD(). At each epoch, ExecSGD() pulls tuples from TupleShuffle one by one, and runs SGD computation. Once all tuples are processed, an epoch ends. It then has to reshuffle and reread the tuples for the next epoch, using the re-scan mechanism of PostgreSQL. Specifically, after each epoch, SGD invokes ExecReScan() of TupleShuffle to reset the I/O states of the buffer. It further invokes ExecReScan() of BlockShuffle to reshuffle the block ids. After that, SGD operator can reread shuffled tuples via ExecSGD() for the next epoch. This is similar to the behavior of multiple table/index scans in PostgreSQL’s NestedLoopJoin.

5.3 Optimizations

As discussed in Sect. 4.1, CorgiPile introduces additional overheads for buffer copy and shuffle. To reduce them, we use a double-buffering strategy as shown in Fig. 5. Specifically, we launch two concurrent threads for TupleShuffle with two buffers. One write thread is responsible for pulling tuples from BlockShuffle into one buffer and shuffling the buffered tuples; the other read thread is responsible for reading tuples from another buffer and returning them to SGD. The two buffers are swapped once one is full and the other has been consumed by SGD. As a result, the data loading (i.e., block-level and tuple-level shuffling) and SGD computation can be executed concurrently, reducing the overhead.

6 Multi-process CorgiPile in PyTorch

We also integrated CorgiPile into PyTorch, one state-of-the-art deep learning system. The main challenge is how to extend our single-process CorgiPile to work in the parallel/distributed environment of deep learning systems, which usually use multiple processes with multiple GPUs to train models. For example, PyTorch offers a DistributedDataParallel (DDP) mode [47] for multi-process training, where PyTorch runs multiple processes in a single machine with multiple GPUs or across a number of machines to train models.

Fig. 6
figure 6

a The implementation of CorgiPile in a parallel/distributed environment (e.g., PyTorch) with multiple processes and GPUs. b The shuffled data generated by multi-process CorgiPile is similar to c the shuffled data generated by single-process CorgiPile. d further confirms this statement using more (four) processes

6.1 A multi-process mode of CorgiPile

CorgiPile can be naturally extended to work in a multi-process mode, by enhancing the tuple-level shuffle under the data-parallel computation paradigm. As mentioned in Sect. 4.1, CorgiPile contains both block-level shuffle and tuple-level shuffle. As shown in Fig. 6a, we can naturally implement block-level shuffle by randomly distributing data blocks to different processes. For tuple-level shuffle, we can use multi-buffer-based shuffling instead of single-buffer-based shuffling — in each process we allocate a local buffer to read blocks and shuffle their tuples. The deep learning system can then read the shuffled tuples when running SGD to perform the forward/backward/update computation as well as gradient/parameter communication/synchronization among different processes.

We implement this enhanced multi-process CorgiPile as a new CorgiPileDataset API in PyTorch:

figure e

Similar to usage of the original Dataset API, users only need to initialize the CorgiPileDataset with necessary parameters and then use it as usual in the DataLoader API offered by PyTorch. The train() method constantly extracts a batch of tuples from DataLoader and then performs mini-batch SGD. Multi-process CorgiPile can achieve random data order similar to that of the single-process CorgiPile ( Sect. 6.3).

6.2 Implementation details

We next detail the implementation of multi-process CorgiPile in PyTorch:

  1. (1)

    Block partitioning: We first partition the dataset into blocks. In a parallel/distributed environment, we typically store the dataset on the block-based parallel/distributed file systems such as HDFS [48], Amazon EBS [49], and Lustre [50]. For example, the ETH Euler cluster [51] uses Lustre, which reads/writes data in blocks (by default 4 MB) [52] and does not allow users to store/read massive small files like raw images in a directory. Therefore, for training \(\sim \)150GB ImageNet with 1.3 million raw images [21] that cannot be fit into memory, we need to convert these images into binary data files such as the widely used iterable dataset TFRecords [53, 54] and store them in Lustre before training. In addition, we build a block index to identify the start/end of each block, by using the block information provided by the file system or indexing tools such as PyTorch-TFRecord [54]. If the dataset itself contains tuple index (e.g., the map-style dataset in PyTorch), we can also partition the dataset into blocks based on the tuple index.

  2. (2)

    Block shuffle: Each process randomly picks \(\textit{BN}/\textit{PN}\) blocks, where \(\textit{BN}\) is the number of blocks and \(\textit{PN}\) is the number of processes. We implement block shuffle in our CorgiPileDataset API. At the beginning of each epoch, it first shuffles the block indices and then splits the indices into \(\textit{PN}\) parts. The i-th process only reads the blocks with indices in the i-th part.

  3. (3)

    Tuple shuffle: Each process first allocates a small buffer in memory and then constantly reads the blocks into the buffer. Once the buffer is full, the process will shuffle the buffered tuples. This is implemented in CorgiPileDataset as its iter() method, which reads blocks into a buffer, shuffles their tuples, and returns the shuffled tuples one by one. The buffer size here is much smaller than that used in single-process CorgiPile—if we set \(\textit{buffer}\_\textit{size} = \textit{BS}\) in single-process CorgiPile, we can choose \(\textit{buffer}\_\textit{size} = \textit{BS}/\textit{PN}\) for each local buffer in multi-process CorgiPile.

  4. (4)

    SGD computation: After block shuffle and tuple shuffle, each process performs mini-batch SGD on the shuffled tuples. Unlike single-process CorgiPile that performs mini-batch SGD on the whole dataset with \(\textit{batch}\_\textit{size}=\textit{bs}\), each process in multi-process CorgiPile performs mini-batch SGD on partial dataset with a smaller batch size (\(\textit{bs}/\textit{PN}\)) and updates the model with gradient synchronization every batch. As shown in Fig. 6a, after each batch the processes will synchronize/aggregate the gradients using a communication protocol (e.g., AllReduce); each process then updates its local copy of the ML model. This procedure is encapsulated inside the \(\texttt {train()}\) method, which automatically performs gradient computation/communication/synchronization and model update every time after reading a batch of tuples from CorgiPileDataset.

6.3 Single-process vs. multi-process CorgiPile

The shuffled data order of multi-process CorgiPile is comparable to that of single-process CorgiPile. Indeed, any data order generated by multi-process CorgiPile can also be generated by single-process CorgiPile (see Theorem 3). Here, we use a simple example (shown in Fig. 6) to demonstrate this. As shown in Fig. 6a, there are two processes and each randomly picks four blocks from the dataset. Each process can read two blocks into the buffer at once and shuffle their tuples in the buffer. As shown in Fig. 6b, the shuffled tuples of process 0 are in sequence from block 1/7 (denoted as \(b_{1|7}\)) and then from block 5/3. Likewise, the shuffled tuples of process 1 are in sequence from block 0/6 and then from block 2/4. Since PyTorch sequentially performs mini-batch SGD on the first \(\textit{batch}\_\textit{size}/\textit{PN}\) tuples of each process (denoted as \(g_1\) on block 1/7 and block 0/6) and aggregates their gradients (sums and averages \(g_1\)) every batch, this parallel mini-batch SGD is equivalent to the mini-batch SGD on the first \(\textit{batch}\_\textit{size}\) tuples from block 1/7/0/6 (i.e., \(g_1\) on \(b_{1|7|0|6}\)) in a single process. Therefore, from the view of the whole dataset, PyTorch with multi-process CorgiPile performs mini-batch SGD on the tuples first from block 1/7/0/6 and then from block 5/3/2/4. This is similar to the data order generated by single-process CorgiPile in Fig. 6c, where the buffer size is \(\textit{PN}\) times larger. Here, \(\textit{PN}=2\) and the buffer can keep 4 blocks at once.

To demonstrate more general cases with \(\textit{PN} > 2\), we increase the number of processes from two to four in Fig. 6d. In this case, each process first loads two blocks into the local buffer, shuffles their tuples, and then performs mini-batch SGD on a number of (\(\textit{batch}\_\textit{size}/\textit{PN}\)) shuffled tuples in each batch. For the first batch, PyTorch computes the gradient \(g_1\) in each process and then aggregates them. The aggregated \(g_1\) can be viewed as the result of performing mini-batch SGD on the first \(\textit{batch}\_\textit{size}\) shuffled tuples from block 1/7/0/11/10/6/9/5 (denoted as \(b_{1|7|0|11|10|6|9|5}\)). This \(b_{1|7|0|11|10|6|9|5}\) can also be generated in a single process, by shuffling blocks as shown in Fig. 6d and then shuffling their tuples in a large single buffer. Thus, given a data order generated by multi-process CorgiPile, we can also find an equivalent data order generated by single-process CorgiPile. The same observation holds for the next batches (e.g., \(g_2\)).

Theorem 3

Any order of data tuples generated by the multi-process CorgiPile can also be generated by the single-process CorgiPile.

Proof

Suppose that the dataset contains n blocks. After the block-level shuffle, the shuffled blocks are denoted as \(b_1, b_2, \ldots , b_n\). Suppose that the buffer can keep m blocks. The next step is tuple-level shuffle for both single-process CorgiPile and multi-process CorgiPile.

For single-process CorgiPile, it puts m blocks into the buffer each time. Without loss of generality, suppose that the buffer holds \([b_1, b_2, \ldots , b_m]\). Then, after tuple-level shuffle, each tuple in the buffer comes from a mixture of \([b_1, b_2, \ldots , b_m]\), denoted as \(b_{1|2|\ldots |m}\).

For multi-process CorgiPile, suppose that there are p processes. Each process i has a smaller buffer that can hold \(\frac{m}{p}\) blocks. We use round-robin to assign blocks to processes, i.e., the block \(b_i\) goes to the process \(i\%p\). Suppose that \(m\%p = 0\), i.e., m is a multiple of p, and \(m = k\cdot p\) The process j will buffer \([b_j, b_{p+j}, \ldots , b_{(k-1)p + j}]\) for \(1\le j\le p\). After tuple-level shuffle, tuples in process j come from the mixture of \(b_{j|p+j|\ldots |(k-1)p + j}\). Note that mini-batch SGD will sequentially compute the gradients of shuffled tuples in each process and then sum them together. As a result, mini-batch SGD will compute the gradients of tuples from the union \(\{b_{j|p+j|\ldots |(k-1)p + j}\}_{j=1}^{k}\), which is equivalent to \(b_{1|2|\ldots |m}\), i.e., the shuffled tuples generated by the single-process CorgiPile. \(\square \)

7 Evaluation

We evaluate CorgiPile in both in-DB ML and deep learning systems, to study the statistical and hardware efficiency of CorgiPile, i.e., whether it can achieve both high accuracy and high performance. For in-DB ML systems, we compare our PostgreSQL-based implementation with two state-of-the-art systems, Apache MADlib and Bismarck with diverse linear models and datasetsFootnote 4. We first evaluate linear models with standard SGD in PostgreSQL (Sect. 7.2). We further evaluate linear models with mini-batch SGD as well as other types of (continuous, multi-class) datasets in PostgreSQL. For deep learning systems, we compare CorgiPile with other shuffling strategies in PyTorch, using image classification workloads (Sect. 7.3).

7.1 Experimental setup

7.1.1 Runtime

For in-DB ML workloads, we perform the experiments on a single ecs.i2.xlarge node in Alibaba Cloud, which has 2 physical cores (4 vCPU), 32 GB RAM, 1000 GB HDD, and 894 GB SSD. The HDD has a maximum 140 MB/s bandwidth, and the SSD has a maximum 1 GB/s bandwidth. We run all experiments in PostgreSQL under CentOS 7.6, and we clear the OS cache before running each experiment.

For deep learning workloads, we perform them on ETH Euler cluster [51] as batch jobs. Each job can use maximum 16 CPU cores, 160 GB RAM, and 8 NVIDIA GeForce RTX 2080 Ti GPUs. The datasets are stored in the cluster’s block-based Lustre parallel file system.

7.1.2 Datasets

For in-DB ML, we use a variety of datasets in our evaluation, including dense/sparse and small/large ones with two classes as shown in Table 2. The datasets in Table 2 are stored in PostgreSQL for in-DB ML experiments and we use both label-clustered and feature-ordered datasets. For deep learning, we use both the cifar-10 dataset with 10 classes [35] and the ImageNet dataset with 1,000 classes [21] for image classification.

Fig. 7
figure 7

The end-to-end execution time of SGD with different data shuffling strategies in PostgreSQL, for clustered datasets on HDD and SSD. Block-Only Shuffle refers to CorgiPile without tuple-level shuffle. We only show the first 5 epochs for Shuffle Once and CorgiPile, as they converge in 1–3 epochs due to better shuffled data order. We show all the 20 epochs for other shuffling strategies to observe if they can converge to high accuracy

Table 2 Datasets: The first four are from LIBSVM [34]. For criteo, we extract 98 M tuples from the criteo terabyte dataset. For yfcc, we extract 3.6M tuples from the yfcc100m dataset [56]; the outdoor and indoor tuples are marked as negative (-1) and positive (+1). #Tuples like 4.5/0.5M refer to 4.5M tuples for training and 0.5M tuples for testing

7.1.3 Models and parameters

Models for in-DB ML systems. For the evaluation on in-DB ML systems, we mainly train two popular generalized linear models, logistic regression (LR) and support vector machine (SVM), that are also supported by Bismarck and MADlib. We briefly report the evaluation results for other liner models such as linear regression and softmax regression, which are currently only supported by MADlib. Currently, Bismarck and MADlib only support two of the baseline data shuffling strategies, namely, No Shuffle and Shuffle Once, which we compare our PostgreSQL-based implementation against. Note that the code of MRS Shuffle has not been released by Bismarck yet. Therefore, we leave it out of our end-to-end comparisons. Instead, we implemented MRS Shuffle by ourselves in PyTorch and compare with it when we discuss the convergence behavior of different data shuffling strategies (like Fig. 8).

Models for DL system. For the evaluation on deep learning system, we perform the classical VGG19 and ResNet18 models on the cifar-10 dataset, and perform more complex ResNet50 model on the ImageNet dataset.

Model hyperparameters. The model hyperparameters include the learning rate, the decay factor, and the maximum number of epochs. By default, we use an exponential learning rate decay with 0.95. We set the number of epochs to 20 for in-DB ML and 50 for deep learning models. Only for ResNet50 on ImageNet, we set the number of epochs to 100 and decay the learning rate every 30 epochs, following the official PyTorch-ImageNet code [57]. We use grid search to tune the best learning rate from {0.1, 0.01, 0.001}. For in-DB ML, we use the same initial parameters and hyperparameters among the compared systems, including MADlib, Bismarck, and CorgiPile.

7.1.4 Settings of CorgiPile

CorgiPile has two more parameters, i.e., the buffer size and the block size. We experiment with a diverse range of buffer sizes in {1%, 2%, 5%, 10%} and the block size is chosen in {2 MB, 10 MB, 50 MB}. We always use the same buffer size (by default 10% of the whole dataset size) for Sliding-Window Shuffle, MRS Shuffle, and our CorgiPile.

7.1.5 Settings of PostgreSQL

For PostgreSQL, we set the work_mem to be the maximum RAM size and tune shared_buffers. Note that PostgreSQL can further compress the high-dimensional datasets using the so-called TOAST [58] technology, which tries to compress large field value or break it into multiple physical rows. For our dense epsilon and yfcc datasets with 2000+ dimensions, PostgreSQL uses TOAST to compress their features_v columns.

Fig. 8
figure 8

The convergence rates of LR and SVM with different shuffling strategies, for clustered datasets

7.2 Evaluation on SGD with in-DB ML systems

For in-DB ML, we first evaluate CorgiPile in terms of the end-to-end execution time. The compared systems include No Shuffle and Shuffle Once strategies in MADlib and Bismarck, as well as a simpler version of our CorgiPile named Block-Only Shuffle, to see how CorgiPile behaves without tuple-level shuffle. We then analyze the convergence rates, in comparison with other strategies, including MRS Shuffle and Sliding-Window Shuffle. We finally study the overhead of CorgiPile by comparing the per-epoch execution time of CorgiPile with the fastest No Shuffle baseline.

In the following, we set the buffer size to \(10\%\) of the whole dataset and block size to 10 MB for all methods. We choose these settings according to our sensitivity analysis in Sect. 7.2.4.

7.2.1 End-to-end execution time

Figure 7 presents the end-to-end execution time of SGD for in-DB ML systems, for clustered datasets on both HDD and SSD. The end-to-end execution time includes: (1) the time for shuffling the data, i.e., Shuffle Once needs to perform a full data shuffle before SGD starts runningFootnote 5; (2) the data caching time, i.e., the time spent on loading data from disk to the OS cache during the first epochFootnote 6; and (3) the execution time of all epochs.

From Fig. 7, we can observe that CorgiPile converges the fastest among all systems, and simultaneously achieves comparable converged accuracy to the best Shuffle Once baseline, usually within 1–3 epochs because of the large number of data tuples. In particular, CorgiPile is 2.9\(-\)12.8\(\times \) faster than MADlib and 2.0\(-\)4.7\(\times \) faster than Bismarck, to converge to the same accuracy when data is stored on HDD and SSD. This is due to the eliminated data shuffling time. For example, for the clustered yfcc dataset on HDD, CorgiPile can converge in 16 min, whereas Shuffle Once in Bismarck needs 50 min to shuffle the dataset and another 15 min to execute the first epoch (to converge). That is, when CorgiPile converges, Shuffle Once is still performing data shuffling. For other datasets like criteo and epsilon, similar observations hold. Moreover, data shuffling using ORDER BY RANDOM() in PostgreSQL, as implemented by Shuffle Once in MADlib/Bismarck, requires 2\(\times \) disk space to generate and store the shuffled data. Therefore, CorgiPile is both more efficient and requires less space.

MADlib is slower than Bismarck given that it performs more computation on some auxiliary statistical metrics and has less efficient implementation [28]. Moreover, for high-dimensional dense datasets, such as epsilon and yfcc, MADlib’s LR cannot finish even a single epoch within 4 h, due to some expensive matrix computations on a metric named stderr.Footnote 7 MADlib’s SVM implementation does not have this problem and can finish its execution on high-dimensional dense datasets. In addition, MADlib currently does not support training LR/SVM on sparse datasets such as criteo dataset.

Table 3 The final testing accuracy of Shuffle Once (SO) and CorgiPile
Fig. 9
figure 9

The convergence rates of LR and SVM with different shuffling strategies, for feature-ordered datasets

Fig. 10
figure 10

The average per-epoch time of SGD with Bismarck (No Shuffle), CorgiPile, and CorgiPile with single buffer in PostgreSQL, for clustered datasets on HDD and SSD. It shows that CorgiPile is up to 11.7% slower than the fastest No Shuffle

7.2.2 Convergence rate comparison

For all datasets inspected, the gap between Shuffle Once and CorgiPile is below 1% for the final testing accuracy, as shown in Table 3. We attribute this to the fact that CorgiPile can yield good data randomness in each epoch of SGD (Sect. 4.2). No Shuffle results in the lowest accuracy when SGD converges, as illustrated in Fig. 7. The Block-Only Shuffle baseline, where we simply omit tuple-level shuffle in CorgiPile, can achieve higher accuracy than No Shuffle but lower accuracy than Shuffle Once. The reason is that Block-Only Shuffle can only yield a partially random order, and the tuples in each block can all be negative or positive for the clustered data.

Since MRS Shuffle and Sliding-Window Shuffle are not available in the current MADlib/Bismarck, we use our own implementations (in PyTorch) and compare their convergence rates. Figure 8 shows the convergence rates on clustered datasets for all strategies, where Sliding-Window, MRS, and CorgiPile all use the same buffer size (10% of the whole dataset). As shown in Fig. 8, Sliding-Window Shuffle suffers from lower accuracy, whereas MRS Shuffle only achieves comparable accuracy to Shuffle Once on epsilon and yfcc but suffers on the other datasets. We further perform these strategies on the feature-ordered datasets with results in Fig. 9. Although No Shuffle, MRS, and Sliding-Window achieve higher accuracy on the feature-ordered datasets, they still have gaps with the Shuffle Once and CorgiPile for the higgs, susy, and criteo datasets. Only for the epsilon (with synthetic features [59]) and yfcc with image-extracted features, they can achieve similar convergence rate to Shuffle Once and our CorgiPile. We observe the similar results for mini-batch SGD, which will be detailed in Sect. 7.2.5.

7.2.3 Per-epoch overhead

To study the overhead of CorgiPile, we compare its per-epoch execution time with the fastest No Shuffle baseline, as well as the single-buffer version of CorgiPile, as shown in Fig. 10. We make the following three observations.

  • For small datasets with in-memory I/O bandwidth, the average per-epoch time of CorgiPile is comparable to that of No Shuffle.

  • For large datasets with disk I/O bandwidth, the average per-epoch time of CorgiPile is up to \(\sim \)1.1\(\times \) slower than that of No Shuffle, i.e., it incurs at most an additional 11.7% overhead, due to buffer copy and tuple shuffle.

  • By using double-buffering optimization, CorgiPile can achieve up to 23.6% shorter per-epoch execution time, compared to its single-buffering version.

The above results reveal that CorgiPile with double-buffering optimization can introduce limited overhead (11.7% longer per-epoch execution time), compared to the best No Shuffle baseline.

Fig. 11
figure 11

The effects of buffer size and block size on CorgiPile

Fig. 12
figure 12

The end-to-end execution time of LR and SVM using mini-batch SGD (batch_size = 128) in PostgreSQL, for clustered datasets on SSD

7.2.4 Sensitivity analysis

We next study the effects of different buffer sizes and block sizes for CorgiPile.

The effects of buffer size. Figure 11a reports the convergence behavior of CorgiPile on the two largest datasets with different buffer sizes: 1%, 2%, and 5% of the dataset size. We see that CorgiPile only requires a buffer size of 2% to maintain the same convergence behavior as Shuffle Once. With a 1% buffer, it only converges slightly slower than Shuffle Once, but achieves the same final accuracy. On the other hand, as discussed in previous sections, Sliding-Window Shuffle and MRS Shuffle achieve a much lower accuracy even when given a much larger buffer (10%).

The effects of block size. We vary the block size in \(\{\)2 MB, 10 MB, 50 MB\(\}\) on the large criteo and yfcc datasets. Figure 11b shows that the per-epoch time decreases as the block size increases from 2MB to 50 MB, due to the higher I/O bandwidth (throughput). However, the time difference between 10 MB and 50 MB is limited (under 10%), because using 10 MB has achieved the highest possible I/O bandwidth (130 MB/s on HDD). As the I/O performance of CorgiPile depends on the random accessing speed of blocks. A key question is how to choose an appropriate block size. In practice, as illustrated in Fig. 4, we recommend users to choose the smallest block size that can achieve similar I/O bandwidth to the sequential read on their devices. To assist CorgiPile users on this task, we further developed a tool that explores the relationship between block size and disk I/O bandwidth via profiling, which is available at [60].

In the previous experiments, we focused on the standard SGD algorithm, which updates the model per tuple. Since it is also common to use mini-batch SGD, we implemented mini-batch SGD for CorgiPile, Once Shuffle, No Shuffle, and Block-Only Shuffle, using our in-DB operators in PostgreSQL. We compare these shuffling strategies only based on our PostgreSQL implementations, since MADlib and Bismarck currently do not support mini-batch SGD for linear models.

Fig. 13
figure 13

The convergence rates of LR and SVM using mini-batch SGD (batch_size = 128), for clustered datasets

Fig. 14
figure 14

The convergence rates of LR and SVM using mini-batch SGD (bs = 128), for feature-ordered datasets

7.2.5 Mini-batch LR and SVM models

We first perform LR and SVM using mini-batch SGD on the clustered datasets. Figure 12 illustrates the end-to-end execution time of these two models in PostgreSQL on SSD. The result is similar to that of the standard SGD. Our CorgiPile achieves comparable convergence rate and accuracy to Shuffle Once but 1.7\(-\)3.3\(\times \) faster than it to converge. Other strategies like No Shuffle and Block-Only Shuffle suffer from either lower converged accuracy or lower convergence rate.

In comparison with other shuffling strategies, Figs. 13 and 14 demonstrate the convergence rates of different shuffling strategies with batch_size = 128, for both clustered datasets and feature-ordered datasets, respectively. We observe that CorgiPile (as well as the best Shuffle Once baseline) often significantly outperforms Sliding-Window Shuffle and MRS Shuffle, in terms of convergence rate and/or model accuracy, on both clustered and feature-ordered datasets. This result reveals that our CorgiPile also works for mini-batch SGD while other shuffling strategies are suboptimal.

Fig. 15
figure 15

The end-to-end time of linear and softmax regression in PostgreSQL using different batch sizes (bs = 1 and bs = 128), for clustered datasets on SSD

7.2.6 Linear regression and softmax regression models

Apart from LR/SVM on binary-class datasets, users may also want to train ML models on continuous and multi-class datasets in the database. Thus, we further implemented linear regression for training continuous dataset and softmax regression (i.e., multinomial logistic regression) for multi-class datasets, based on our in-DB operators inside PostgreSQL. Figure 15 shows the end-to-end execution time of linear regression for the continuous YearPredictionMSD clustered dataset [34] and softmax regression for the 10-class mini8m clustered dataset [34], with different batch sizes on SSD. CorgiPile again achieves convergence rate and model accuracy (i.e., coefficient of determination \(R^2\) for linear regression) similar to Shuffle Once, but is 1.6\(-\)2.1\(\times \) faster to converge.

7.3 Evaluation with deep learning system

CorgiPile is a general data shuffling strategy for any SGD implementation. To understand its impact on deep learning systems and workloads, we implement the CorgiPile strategy as well as others in PyTorch and compare them using deep learning models, for image classification. In the following parts, we first evaluate the end-to-end performance and convergence rate of CorgiPile on the ImageNet dataset. We then study the convergence rate of CorgiPile in detail.

Fig. 16
figure 16

The convergence rates of ResNet50 with different data shuffling strategies, for the clustered ImageNet dataset. Note that the original ImageNet dataset is clustered by the labels. TopN refers to the Top-N accuracy

Fig. 17
figure 17

The convergence rates of deep learning models with different data shuffling strategies and batch sizes, for the clustered 10-class cifar-10 image dataset

7.3.1 Performance comparison

To evaluate the performance of CorgiPile in PyTorch, we train ResNet50 on ImageNet, which has 1.3 million images in 1000 classes. We run this experiment using multi-process CorgiPile with 8 GPUs and 16 CPU cores in our cluster. We evaluate two different block sizes (5 MB and 10 MB, with about 50 and 100 images per block), as our cluster reads data in terms of 4 MB+ blocks. The batch size is set to 512 images, so each process performs SGD computation on \(512/8 = 64\) images per batch. The buffer size of each process is 1.25% of the whole dataset, thus the total buffer size of all processes is 10% of the whole dataset. The number of data loading threads for each process is set to two, as we have twice as many CPU cores as GPUs. The learning rate is initialized as 0.1 and is decayed every 30 epochs with multiplicative factor of 0.1.

Figure 16 illustrates the end-to-end execution time of ResNet50 model on the large ImageNet dataset, using different shuffling strategies. We report both the Top 1 and Top 5 accuracy. From Fig. 16a, b, we can observe that CorgiPile is 1.5\(\times \) faster than Shuffle Once to converge and the converged accuracy of CorgiPile is similar to that of Shuffle Once. The main reason of the slowness of Shuffle Once is that it needs about 8.5 h to shuffle the large (\(\sim \)150 GB) ImageNet dataset and store the shuffled dataset in our cluster (using about 8.5 h to randomly access raw images and merge them into large binary files [54], as Lustre file system does not allow users to store/access small raw image files [61]). Lustre is widely used in HPC clusters and has been used by many of the top supercomputers and large multi-cluster sites.Footnote 8 In contrast, CorgiPile eliminates this long data shuffling time. The second reason is that our CorgiPile has limited per-epoch overhead. Although CorgiPile has block shuffle and tuple shuffle overhead, the per-epoch time of CorgiPile with 5 MB or 10 MB block is only \(\sim \)15% longer than that of the fastest No Shuffle baseline, which is similar to that observed in the previous experiments on PostgreSQL. Recall that CorgiPile reads data in terms of blocks, which is comparable to sequential read on block-based parallel file system.

Fig. 18
figure 18

The convergence rates of deep learning models with different data shuffling strategies and batch sizes, for the feature-ordered 10-class cifar-10 image dataset

We further compare the convergence rate in Fig. 16c, d. We can see that the convergence rates of CorgiPile with 5 MB/10 MB block sizes are comparable to that of Shuffle Once. Although CorgiPile with 10 MB block size has lower convergence rate than Shuffle Once in the first 30 epochs, it can catch up in the following epochs and converge to similar accuracy.

7.3.2 Convergence rate comparison

To compare CorgiPile with other data shuffling strategies, we perform deep learning (VGG19 and ResNet18) models on cifar-10 image dataset using a single GPU. The cifar-10 dataset contains 50,000 training images in 10 classes, and we use both the clustered and feature-ordered cifar-10 dataset.

Figure 17 illustrates the convergence rates of VGG19 and ResNet18 models with different batch sizes (64 and 128) for the clustered cifar-10 dataset. The buffer size is \(10\%\) of the whole dataset and the block size is set to 100 images per block. This figure shows that CorgiPile achieves comparable convergence rate and accuracy to the Shuffle Once baseline, whereas other strategies suffer from lower accuracy due to the partially random order of the shuffled tuples. Specifically, the Sliding-Window Shuffle used by TensorFlow only performs better than No Shuffle, and suffers from large (50%+) accuracy gap with Shuffle Once and CorgiPile.

We further repeat the experiments on the feature-ordered cifar-10 dataset, and Fig. 18 presents the results. It again shows that the convergence rate of CorgiPile is comparable to Shuffle Once, whereas No Shuffle and MRS Shuffle suffer from lower accuracy or lower convergence rate. Only Sliding-Window Shuffle can achieve similar convergence rate compared to Shuffle Once and CorgiPile.

The above results indicate that CorgiPile can achieve both good statistical efficiency and hardware efficiency for deep learning models on non-convex optimization problems. When integrated to PyTorch, CorgiPile is 1.5\(\times \) faster than the Shuffle Once baseline on the large ImageNet dataset in our experiments.

8 Related work

Stochastic gradient descent (SGD). SGD is broadly used in machine learning to solve large-scale optimization problems [62]. It admits the convergence rate O(1/T) for strongly convex objectives, and \(O(1/\sqrt{T})\) for the general convex case [63, 64], where T refers to the number of iterations. For non-convex optimization problems, an ergodic convergence rate \(O(1/\sqrt{T})\) is proved in [64], and the convergence rate is O(1/T) (e.g., [40]) under the Polyak-Łojasiewicz condition [65]. In the analysis of the above cases, the common assumption is that data is sampled uniformly and independently with replacement in each epoch. We call SGD methods based on this assumption as vanilla SGD.

Data shuffling strategies for SGD. In practice, full shuffle SGD is a more practical and efficient way of implementing SGD [22]. In each epoch, the data is reshuffled and iterated one by one without replacement. Empirically, it can also be observed that random-shuffle SGD converges much faster than vanilla SGD [38,39,40]. In Sect. 3, we empirically studied the state-of-the-art data shuffling strategies for SGD, including Epoch Shuffle, No Shuffle, Shuffle Once, Sliding-Window Shuffle [17] and MRS Shuffle [5]. Our empirical study shows that Shuffle Once achieves good convergence rate but suffers from low performance, whereas other strategies suffer from low accuracy. In addition, there has been previous work on bi-sampling [66], which has been used in the context of online aggregation [67]. Bi-sampling first selects/samples pages using Bernoulli sampling and then shuffles/samples the tuples inside each page. Unlike CorgiPile, it does not shuffle tuples across pages, which is similar to the Block-Only Shuffle used in our experimental evaluation.

In-DB ML. Previous work [4,5,6,7,8,9,10,11,12, 14, 15, 25,26,27, 68, 69] has intensively discussed how to implement ML models on relational data, such as linear models [6,7,8], linear algebra [9, 12, 26], factorization models [10], neural networks [25,26,27] and other statistical learning models [11], using batch gradient descent (BGD) or SGD, over join or self-defined matrix/tensors, etc. The most common way of integrating ML algorithm into RDBMS is to use user-defined aggregate functions (UDA). The representative in-DB ML tools are Apache MADlib [4, 13] and Bismarck [5], which use PostgreSQL’s UDAs to implement SGD, and leverage SQL LOOP (Bismarck) or Python driver (MADlib) to implement iterations. Recently, DB4ML [28] proposes another approach called iterative transactions to implement iterative SGD/graph algorithm in DB. However, it still uses/assumes the Shuffle Once strategy as that of Bismarck/MADlib. Since the source code of DB4ML has not been released yet, we only compare with MADlib and Bismarck.

Scalable ML for the distributed data systems. In recent years, there are active research on integrating ML models into distributed database systems to enable scalable ML, such as MADlib on Greenplum [70], Vertica-ML [71], Google’s BigQuery ML [72], Microsoft SQL Server ML Services [73], etc. Another trend is to leverage big data systems to build scalable ML models based on different architectures, e.g., MPI [74, 75], MapReduce [76,77,78], Parameter Server [79,80,81] and decentralization [82, 83]. Recent work also started discussing how to integrate deep learning into databases [84, 85]. Our CorgiPile is a general data shuffling strategy for SGD and has been integrated into PostgreSQL and PyTorch. We believe that CorgiPile can be potentially integrated into more above distributed data systems.

9 Conclusion

We have presented CorgiPile, a novel data shuffling strategy for efficient SGD computation on top of block-addressable secondary storage systems. It adopts a two-level hierarchical shuffle mechanism that avoids the computation and storage overhead of full data shuffling while retaining similar convergence rate of SGD as if a full data shuffle were performed. We provide theoretical analysis on the convergence behavior of CorgiPile and further integrate it into both PostgreSQL and PyTorch. Experimental evaluation demonstrates both statistical and hardware efficiency of CorgiPile when compared to state-of-the-art in-DB ML and deep learning systems.