Keywords

1 Introduction

In recent years machine learning has gained great popularity as a powerful technique in the field of big data analysis. Especially, Gradient Boosted Decision Tree (GBDT) [6] is a widely used machine learning technique for analyzing massive data with various features and sophisticated dependencies [17]. GBDT has already been applied in different application areas, such as drug discovery [24], particle identification [18], image labeling [16] and automatic detection [8].

The GBDT is an ensemble machine learning model that requires training of multiple decision trees sequentially. Decision trees are binary trees with dual judgments on internal nodes and target values on leaves. GBDT trains the decision trees through fitting the residual errors during each iteration for predicting the hidden relationships or values. Thus the GBDT spends most of its time in learning decision trees and finding the best split points, which is the key hotspot of GBDT [9]. In addition, GBDT also faces the challenge of irregular memory access for achieving optimal performance on emerging many-core processors such as Sunway [7].

Equipped with Sunway SW26010 processors, Sunway TaihuLight is the first supercomputer that reaches the peak performance of over 125 PFLOPS [4]. The Sunway processor adopts a many-core architecture with 4 Core Groups (CGs), each of which consists of a Management Processing Element (MPE) and 64 Computation Processing Elements (CPEs) [7]. There is a 64 KB manually-controlled Local Device Memory (LDM) on each CPE. The Sunway many-core architecture also provides DMA and register communication for efficient memory access and communication on CPEs.

In this paper, we propose an efficient GBDT implementation for the Sunway many-core processor, which is an attractive target for accelerating the performance of GBDT with its unique architectural designs. The hotspot of GBDT can be further divided into two parts: 1) sorting all the feature values before computing the gains; 2) computing gain for every possible split. To speedup the hotspot, we partition the data into finer granularities such as blocks and tiles to enable efficient data access on CPEs. To improve the performance of sorting, we divide CPEs into multiple roles for pipelining the computation of segmenting, sorting, and merging with better parallelism. We evaluate the optimized GBDT implementation swGBDT with representative datasets and demonstrate its superior performance compared to other implementations on Sunway.

Specifically, this paper makes the following contributions:

  • We propose a memory access optimization mechanism, that partitions the data into different granularities such as blocks and tiles, in order to leverage LDM and register communication for efficient data access on CPEs.

  • We propose an efficient sorting algorithm on Sunway by segmenting and sorting the data in parallel and then merging the sorted sequences. During the sorting and merging, we divide the CPEs into multiple roles for pipelining the computation.

  • We implement swGBDT and evaluate its performance by comparing with the serial implementation on MPE and parallel XGBoost on CPEs using representative datasets. The experiment results show 4.6\(\times \) and 2\(\times \) speedup, respectively.

This paper is organized as follows. We give the brief introduction of GBDT algorithm and Sunway architecture as background in Sect. 2. Section 3 describes our design methodology for swGBDT. Section 4 shows the implementation details of swGBDT. In Sect. 5, we compare swGBDT with both serial GBDT implementation and parallel XGBoost on synthesized and real-world datasets in terms of performance. Related works are presented in Sect. 6 and we conclude our work in Sect. 7.

2 Background

2.1 Sunway Many-Core Processor

The Sunway SW26010 many-core processor is the primary unit in Sunway TaihuLight supercomputer. The illustration of the many-core architecture within a single Core Group (CG) of SW26010 is in Fig. 1. There are four CGs in a single SW26010 processor. The peak double-precision performance of a CG can be up to 765 GFLOPS, while the theoretical memory bandwidth of that is 34.1 GB/s. Moreover, the CG is comprised of a Management Processing Element (MPE), 64 Computation Processing Elements (CPEs) in a \(8\times 8\) array and a main memory of 8 GB. The MPE is in charge of task scheduling whose structure is similar to mainstream processors, while CPEs are designed specifically for high computing output with 16 KB L1 instruction caches and 64 KB programmable Local Device Memories (LDMs). There are two methods for memory access from main memory in the CG to a LDM in the CPE: DMA and global load/store (gld/gst). DMA is of much higher bandwidth compared to gld/gst for contiguous memory access. The SW26010 architecture introduces efficient and reliable register communication mechanism for communication between CPEs within the same row or column which has even higher bandwidth than DMA.

Fig. 1.
figure 1

The many-core architecture of a Sunway core group.

2.2 Gradient Boosted Decision Tree

The Gradient Boosted Decision Tree is developed by Friedman [6]. The pseudo-code of GBDT algorithm is presented in Algorithm 1 [20]. The training of GBDT involves values from multiple instances under different attributes and there are several hyperparameters in GBDT: the number of trees N, the maximum depth of tree \(d_{max}\) and the validation threshold of split points \(\beta \). To store the dataset of GBDT algorithm, the sparse format [20] is developed to reduce the memory cost which only stores the non-zero values instead of values of all attributes in all instances as the dense format. We use the sparse format for swGBDT.

Moreover, as shown in Algorithm 1, GBDT trains the decision trees iteratively using the residual errors when the loss function is set to mean squared error. During each iteration, in order to find the best split points which is the bottleneck of GBDT, the algorithm needs to search for the maximum gain in one attribute, which will generate a preliminary split point that is appended to set P, and finally the best split point will be extracted from the set P with a validation threshold constant \(\beta \). Thus the primary process in searching for best split points are gain computation and sorting. The gain among all instances of one attribute can be derived from Eq. 1, where \(G_L\) and \(G_R\) are the sum of first-order derivatives of loss function in left or right node respectively, while \(H_L\) and \(H_R\) are the sum of second-order derivatives of loss function in left or right node, respectively. The first-order and second-order derivatives can be computed from Eq. 2 and Eq. 3 respectively, where E is the loss function and it is set to be mean squared error in swGBDT.

$$\begin{aligned} gain = \frac{1}{2}[\frac{G_L^2}{H_L + \lambda } + \frac{G_R^2}{H_R + \lambda } - \frac{(G_L + G_R)^2}{H_L + H_R - \lambda }] \end{aligned}$$
(1)
$$\begin{aligned} g_i = \frac{\partial E}{\partial y_i} \end{aligned}$$
(2)
$$\begin{aligned} h_i = \frac{\partial ^2 E}{\partial y_i^2} \end{aligned}$$
(3)
figure a

2.3 Challenges for Efficient GBDT Implementation on Sunway Processor

In order to implement GBDT algorithm efficiently on Sunway, there are two challenges to be addressed:

  1. 1.

    How to leverage the unique many-core architecture of Sunway to achieve effective acceleration. Unlike random forest that each tree is independent of each other, the computation of each tree in GBDT depends on the result of the previous tree, which prevents the tree-level parallelism. Therefore, we need to design a more fine-grained parallel scheme to fully utilize the CPEs for acceleration.

  2. 2.

    How to improve the efficiency of memory access during GBDT training. The large number of random memory accesses during GBDT training lead to massive gld/gst operations with high latency. The poor locality with random memory access deteriorates the performance of GBDT. Therefore, we need to design a better way to improve memory access efficiency.

3 Methodology

3.1 Design Overview

In this paper, data partitioning and CPE division are used to reduce the time of memory access through prefetching. For data partition, as shown in Fig. 2, firstly, we divide the data into blocks evenly according to the number of CPEs participating in the computation. Then we divide the blocks into tiles according to the available space of every CPE’s LDM. When calculating the data from a tile, the DMA is used to prefetch the next tile. That’s how we use the double buffering to hide the data access delay. When multi-step memory access or multi-array access are simultaneously needed (such as computing \(A[i] = B[i] + C[i]\) needs access array A, B and C simultaneously), we divide the CPEs into data cores called loaders and computing cores called savers. Loaders prefetch data and then send it to savers for calculating by register communication. Meanwhile, for the reason that sorting operation is the most time-consuming, we propose an efficient sorting method. Firstly, we divide the data to be sorted evenly into 64 segments and sort them separately by 64 CPEs to achieve the maximum speedup ratio. Then we merge the 64 segments by dividing the CPEs into different roles and levels and register communication. Every 128 bits transferred by register communication is divided into four 32-bit length part as shown in Fig. 3.

Fig. 2.
figure 2

The illustration of bistratal array blocking.

3.2 Data Prefetching on CPE

When there is a N-length big array named ARR participating in calculating, we firstly partition it into K blocks evenly (normally 64 when the CPE division in Sect. 3.1 is not needed, otherwise 32), so that every CPE processes a block. Because normally the processing time of every element in ARR is the same, so the static partition can achieve load balance. Then every CPE divides its block into tiles according to its usable LDM size. If the size of a tile is too small, more DMA transactions will be needed, whereas the size is too large, the tile will not be able to fit the limit of LDM. As a result, we use the equation \(T = \tfrac{M}{\sum _{0}^{n-1}Pi}\) to calculate the number of tiles, in which T denotes the number of tiles, M denotes the LDM usable space size, n denotes the number of arrays that participate in the task, \(P_i\) denotes the element size of every array. Because the DMA is an asynchronous operation, it needs no more computation after sending the request, so it can be paralleled with computations. Thus we use the double buffering to hide the DMA time. In the beginning, the CPE loads the first tile and sends the DMA request for prefetching the next tile, then begins calculating. Every time it finish the calculating of one tile, the next tile has been prefetched by DMA, so the CPE sends a new DMA transaction for the next uncached tile and begins calculating the cached tile. That’s how our double-buffering works.

In the training process of GBDT, we will face computing tasks like \(C[i] = func(A[B[i]])\) which need multi-step memory access. Due to the gld operation in the second step, the memory access is of high latency and low bandwidth. Meanwhile, all CPEs access memory at the same time will cause the load of memory controller too heavy. So, we use the CPE division mode at these times, set half of CPEs as data cores called loader, another half of CPEs as computing cores called saver because they also need to save the final result to the main memory. There is a one-to-one relationship between saver and loader. The multiple roles of CPEs and data communication are shown in Fig. 4. The loader firstly uses the data partitioning to prefetch tiles from array B, then uses the gld to get the value of A[B[i]], finally sends it to its saver by register communication. We use the communication format in Fig. 3(a). The saver computes the C[i] and saves C[i] into the buffer and saves the result to the main memory by a DMA request when fills a buffer.

Fig. 3.
figure 3

The message formats used in register communication.

3.3 Sort and Merge

The sorting of large array is the main hotspot. To make full use of the 64 CPEs and to maximize parallelism, we firstly divide the array evenly into 64 segments, every CPE sorts a segment so that we can get 64 sorted sequences \(A_0, A_1, A_2, \ldots , A_{63}\), then we merge them to get the final result. As shown in Fig. 5, each round carries out two combined mergence, 32 sorted sequences are got after the first round, 16 after the second round and so on, 6 rounds are needed to get the final result. For the reason that unmerged data may be replaced, as shown in Fig. 6, during every merging round, the data source and destination must be different. This means that at least two times of memory reading and writing is needed, reading from source and writing to a temporary location then reading from the temporary location and writing to the source. If we do not implement the data reusage through register communication, each round of merging requires a time of memory reading and memory writing, that are reading data for merging and writing the result into memory. 6 times of data reading and writing is needed for 6 rounds of mergence. This will lead to a large amount of data movements which will cause unnecessary time consumption. In this paper, we divide the CPEs into different roles and levels and use register communication to reduce the times of memory reading and writing from six to two. To achieve this goal, the 6 rounds of mergence is divided into two steps. The first step only includes the first round of mergence, writes the merged data into a temporary location. The second step includes all the last 5 rounds of mergence and writes the final result back to source location. Because the CPEs are divided into different roles and levels and compose a pipeline, the merging results of intermediate rounds are all transmitted by register communication to CPEs that doing the next rounds’ mergence instead of writing back to memory, thus only one round of memory reading and writing is needed. The formats of Fig. 3(b) and (c) are used for register communication.

Fig. 4.
figure 4

The multiple roles of CPEs and data communication.

In the first step, the CPEs are divided into two types, loaders and savers. Each loader corresponds to a saver, they are in the same row so that register communication can be directly performed. Every loader reads two sequences with prefetching method mentioned in Sect. 3.1. Then loaders send the merged data to its saver through register communication. The roles and data stream are similar to the left part of Fig. 4, the difference is that in mergence, no gld is needed and all the data is got by DMA. For the fact that the key of all the data we will sort is non-negative integers, loaders send a message to its saver with the key field set to −1 as the flag of data transmitting ended after all the data is got and merged and sent to its saver. Each saver holds a double buffer, it saves the data into its buffer every time it receives data from its loader. When one part of its double buffer is full, it will write the data back to memory by a DMA request and use another part of the double buffer for data receiving. When savers receive the data with the value of the key field is −1 which means the data transmitting is ended, they will write all the remaining data in their double buffer back to memory and end working.

In the second step, the CPEs are divided into three types, loaders, savers and workers. A part of CPEs’ hierarchical topology, division and data stream are shown in the right part of Fig. 4. Workers are set to be in different levels according to the flow of the data stream, the workers that directly receive data from loaders are in the lowest level, the worker that directly sends data to the saver is in the highest level and the level of the others are sequentially increased according to the flow of the data stream. In Fig. 4, \(W_n\) means the level n worker. Loaders read data of two different sequences from memory and merge them and then send to the workers in the lowest level through register communication. The workers of every level receive data from two different lower-level worker and send the data to a higher-level worker after mergence through register communication. There is only one highest level worker. It sends the merged data to the saver instead of other workers. The saver saves the result back to memory. Also we set key of communication to −1 as the end flag.

Fig. 5.
figure 5

The illustration of sequence merging.

Fig. 6.
figure 6

The illustration of read-write conflict. (OD denotes the original data and T denotes the temporary memory)

4 Implementation

In this section, we present the implementation details of swGBDT, especially focusing on the gain computation and sorting process which are the major hotspots of GBDT. Moreover, we also describe the communication scheme for CPEs in detail.

4.1 Processing Logic of Gain Computation

As shown in Algorithm 2, to find the best gains, we need to consider every possible split and compute the gain of every split according to Eq. 1 to find the best gains. As the \(G_L\) and \(G_R\) in the equation are the sum of the g of instances on the left side and the right side, respectively. \(H_L\) and \(H_R\) are the same but the sum of h. The computation of g and h are shown in Eq. 2 and 3. Naturally, every feature value is a possible split point, but not all the instances have all the features. So, there are two ways to handle those instances that do not have the feature that the split uses. One is to divide the instances without this feature into the left side, that is assuming the feature values of these instances are smaller than the split point. Another is to divide them to the right side, that is assuming the feature values of these instances are bigger than the split point.

figure b

Through the prefix sum operation, we already know the sum of g/h of the instance with the feature the split uses for all the possible split. The sum of g/h of the missing instances has been calculated, too. Thus we can easily calculate the \(G_L\) and \(G_R\) of two kinds of division by simple addition. The sum of g/h of the father node which is the sum of g/h of all the instances is also known as the result of the previous iteration. Thus we can get the \(G_L\) and \(H_L\) with a simple subtraction. So, they are all used as the input of the algorithm. In the algorithm, we obtain the index of the node that is to be split at first and get \(father\_gh\) value with the index. Then we calculate the gain of missing instances on the left and right side, respectively. We only need the larger one and keep the original value if they are on the left or take the opposite if they are on the right. Since getting the fatherGH and missingGH are both two-step memory access, we cannot predict the memory access location of the second step because it depends on the result of the first step, so the data cannot be load into LDM by DMA easily. This means gld with high latency is needed. To reduce the performance loss, we divide the CPEs into loaders and savers. Loaders load the possible split into LDM using DMA and then get the fatherGH and missingGH with gld. Finally they send the data to their saver with register communication. Savers receive data, then compute the gains and write back to memory using DMA.

4.2 Processing Logic of Sorting

For sorting, we need to split the whole sequence to be sorted evenly into \(A_0, A_1, A_2, \ldots , A_{63}\), each element in the sequence consists of two parts: key and value. It’s a key based radix sort. As the key is a 32-bit integer, the time complexity is \( O(\lceil 32/r \rceil \times n)\), where r is the number of the bits of the base, that is, r bits are used for calculation at each round. It can be seen that the larger the r is, the lower the time complexity is. However, the rounding up operation leads to the result that when r is set to \(r_0\) and is not the factor of 32, the time complexity is the same with r using a factor of 32 that is the closest to \(r_0\) but smaller than \(r_0\). The factors of 32 are 1, 2, 4, 8, 16, 32 and the capacity of LDM is 64 KB which can only accommodate up to 16386 32-bit integers. When r takes 16, \(2^{16}=65536\) buckets are needed, the LDM will be exceeded even if the capacity of each bucket is 1. That is to say, the r can only take 8, so four rounds are needed to finish sorting. Because every CPE sorts independently, only 64 internally ordered sequences, \(B_0, B_1, \ldots , B_{63}\), are obtained after sorting. 64 sequences are unordered with each other. We need the merging operation to get the final result.

For merging, we use the loader-saver mode to divide the CPEs in the first step. For stable sorting, as shown in Algorithm 3, the \(i^{th}\) loader reads the data from \(B_{2i}\) and \(B_{2i+1}\) and merges them. We can consider the two sequences as two queues, the queue with data from \(B_{2i}\) calls \(q_0\), the queue with data from \(B_{2i+1}\) calls \(q_1\). Reading a tile means the data in the tile enqueue the corresponding queue. Comparing the key of the elements of the two queues continually, only when the key of the head element of \(q_1\) is smaller than that of \(q_0\) or \(q_0\) is empty with no more data to be enqueued, \(q_1\) can dequeue the head element, otherwise \(q_0\) dequeue the head element. The dequeued element is sent to the corresponding saver by register communication. Saver saves the received data into buffer and writes the data into the main memory every the buffer is filled.

figure c
figure d

In the second step, we use the loader-worker-saver mode to divide the CPEs. Because the receiver of register communication cannot know the sender, a send-flag in the message that indicates the sender is needed if there are more than one sender in the same row or the same column with receiver. But the message length is only 128 bits, the sum length of key and value is 64 bits. If sender-flag is added, we can only send a pair of data a time which lower the efficiency. Thus, we propose a method that each CPE receives data from only one same-row CPE and one same-column CPE. And for stable sort, we ensure that the data received from the same-column CPE is in the former sequence than which from the same-row CPE by a carefully designed communication method. More specifically, the \(i^{th}\) worker in a level receives last-level’s \((2i)^{th}\) intermediate result by register communication from the same-column CPE and the \((2i+1)^{th}\) from the same-row CPE and sends the merged data (the \(i^{th}\) intermediate result of this level) to the \(\lfloor i/2 \rfloor th\) CPE of the next level as shown in Algorithm 4. According to the design, when \(i \; mod \; 2 \, = \, 0\), the data is sent to the same-column CPE, otherwise to the same-row CPE. Meanwhile, the read buffer of register communication is a queue with clear-after-reading, we do not need queues for merging. Loaders and savers work similar to the first step.

4.3 Synchronization Among CPEs

Since the CPE senders and receivers perform register communications according to the index of the array to be written or to be read and all the communications are one-to-one communications, no explicit synchronization mechanism is required. In other words, the whole 128 bits of the message are usable data. Therefore, our method can make full use of the bandwidth of register communication and thus improve communication performance.

5 Evaluation

5.1 Experiment Setup

Datasets. To evaluate the performance of our swGBDT, we use 6 datasets from LIBSVM Data [2] and 4 synthesized datasets named dataset1–4. The details of the datasets are shown in Table 1.

Table 1. The datasets for evaluation.

Evaluation Criteria. We conduct our experiments on a CG of Sunway SW26010 processor. We compare the performance of our swGBDT with serial implementation on MPE and parallel XGBoost [3] on CPEs. The serial implementation is the naive implementation of our GBDT algorithm without using CPEs. We port the popular open source implementationFootnote 1 of XGBoost for parallel execution on CPEs (with LDM used for better performance). In our experiments, we set the parameter depth to 6 and the number of trees to 40. All experiments run in single precision.

5.2 Performance Analysis

We use the average training time of a tree for comparison and use the MPE version as baseline. The results are shown in Fig. 7, Fig. 8 and Table 2, we can see clearly that swGBDT is the best one on all datasets. Compared to the MPE version, swGBDT can reach an average speedup of 4.6\(\times \) and 6.07\(\times \) for maximum on SUSY. Meanwhile, compared to XGBoost, we can achieve 2\(\times \) speedup for average, 2.7\(\times \) speedup for maximum. The advantage of swGBDT comes from the CPEs division that reduces the memory access time.

Fig. 7.
figure 7

The performance of swGBDT and XGBoost on real-world datasets

Fig. 8.
figure 8

The performance of swGBDT and XGBoost on synthesized datasets

Table 2. The training results for swGBDT and XGBoost.

5.3 Roofline Model

In order to analyse the efficiency of our implementation, we apply the roofline model [22] to swGBDT on a CG of Sunway processor. Giving a dataset with m instances and n features, assuming the non-zero numbers is nnz, we store it in CSC format. \(N\_split\) is the number of possible splits during every training round. Let Q, W and I represent the amount of data accessed from memory, the number of floating point operations and the arithmetic intensity [23] respectively. The calculation of Q, W, I is shown in Eq. 5, 4, 6 respectively.

$$\begin{aligned} W = 19.5*nnz + 37*n\_split \end{aligned}$$
(4)
$$\begin{aligned} Q = 22*nnz + 32.5*n\_split \end{aligned}$$
(5)
$$\begin{aligned} I = \dfrac{W}{Q*8bytes} = 0.125 + \dfrac{1.8*n\_split - nnz}{70.4 * nnz + 104 * n\_split} \end{aligned}$$
(6)

In our experiments, in most of the dataset, the \(n\_split\) is about 0.9 of nnz. In this situation, \(I = 0.1288\), the ridge point of Sunway processor is 8.46, we can see that the bottleneck of GBDT is memory access. The version without memory access optimization (the MPE version) gets the \(I = 0.108\). Our optimization increases the arithmetic intensity for about 20%.

5.4 Scalability

To achieve better scalability, we divide the features into n segments evenly when the number of CGs is n. The \(i^{th}\) CG only stores and processes the \(i^{th}\) feature segment. Each CG computes its \(2^{depth}\) splits and then determines the \(2^{depth}\) best splits for all, where depth is the depth of the tree currently. As shown in Fig. 9, we use up to 4 CGs on a processor for evaluating the scalability of swGBDT. Comparing to one CG, we can reach an average of 8\(\times \), 11.5\(\times \) and 13\(\times \) speedup when scaling to 2, 3 and 4 CGs, respectively.

6 Related Work

6.1 Acceleration for Gradient Boosted Decision Tree

To improve the performance of the GBDT algorithm, on one hand, some of the recent researches have been making efforts to modify the GBDT algorithm for acceleration. LightGBM [9] accelerate the time-consuming gain estimation process by eliminating instances with small gradients and wrapping commonly exclusive features, which can reduce computation. Later, Biau et al. [1] optimize the GBDT through combining Nesterov’s accelerated descent [15] for parameter update. On the other hand, researches have been trying to transfer the GBDT to novel accelerators like GPU. Mitchell and Frank [14] implement the tree construction within GBDT algorithm in XGBoost [3] to GPU entirely to reach higher performance. Besides, Wen et al. [20] develop GPU-GBDT which enhance the performance of GBDT through dynamic allocation, data reusage and Run-length Encoding compression. The GPU-GBDT is further optimized to ThunderGBM [21] on multiple GPUs which incorporates new techniques like efficient search for attribute ID and approximate split points. However, those implementations do not target at Sunway architecture and there have not been any efficient GBDT algorithm designed to leverage the unique architecture features on Sunway to achieve better performance.

6.2 Machine Learning on Sunway Architecture

There have been many machine learning applications designed for Sunway architecture since its appearance. Most of the previous researches focus on optimizing neural networks on Sunway. Fang et al. [5] implement convolutional neural networks (CNNs) on SW26010 which is named swDNN through systematic optimization on loop organization, blocking mechanism, communication and instruction pipelines. Later, Li et al. [10] introduce swCaffe which is based on the popular CNN framework Caffe and develop topology-aware optimization for synchronization and I/O. Liu et al. [13] propose an end-to-end deep learning compiler on Sunway that supports ahead-of-time code generation and optimizes the tensor computation automatically.

Fig. 9.
figure 9

The scalability of swGBDT.

Moreover, researchers have paid attention to optimize the numerical algorithms which are kernels in machine learning applications on Sunway architecture. Liu et al. [12] adopt multi-role assignment scheme on CPEs, hierarchical partitioning strategy on matrices as well as CPE cooperation scheme through register communication to optimize the Sparse Matrix-Vector Multiplication (SpMV) algorithm. The multi-role assignment and CPE communication schemes are also utilized by Li et al. [11] who develop an efficient Sparse triangular solver (SpTRSV) for Sunway. What’s more, Wang et al. [19] improve the performance of SpTRSV on Sunway architecture through Producer-Consumer pairing strategy and novel Sparse Level Tile layout. Those researches provide us the inspiration of accelerating GBDT algorithm for Sunway architecture.

7 Conclusion and Future Work

In this paper, we present an efficient GBDT implementation swGBDT on Sunway processor. We propose a partitioning method that partitions CPEs into multiple roles and partitions input data into different granularities such as blocks and tiles for achieving better parallelism on Sunway. The above partitioning scheme can also mitigate the high latency of random memory access through data prefetching on CPEs by utilizing DMA and register communication. The experiment results on both synthesized and real-world datasets demonstrate swGBDT achieves better performance compared to the serial implementation on MPE and parallel XGBoost on CPEs, with the average speedup of 4.6\(\times \) and 2\(\times \) respectively. In the future, we would like to extend swGBDT to run on CGs across multiple Sunway nodes in order to support the computation demand of GBDT at even larger scales.