1 Introduction

The rise of heterogeneous computing has brought new challenges, such as the integration of co-processor hardware in data centers and the development of new parallel programming paradigms [1]. One such example is the use of co-processors in High-Performance Computing (HPC), specifically Graphics Processor Units (GPUs). These devices not only offer higher performance than traditional CPUs, but are also considered more energy-efficient and have a higher peak performance density.

In order to effectively leverage the capabilities of these heterogeneous systems, new programming paradigms have been developed, such as CUDA [2] for NVIDIA GPUs, and OpenCL [3, 4] which supports a wide variety of devices from multicore CPUs, to DSPs, GPUs, and FPGAs. SYCL [5, 6] is an abstraction layer built on the efficiency and portability of OpenCL, and follows a single-source programming model to improve both programming productivity and performance portability [7,8,9].

Because SYCL is a higher-level programming model, there are numerous implementations available, such as ComputeCpp from Codeplay,Footnote 1 triSYCL from Xilinx,Footnote 2 hipSYCL from Universitat HeidelbergFootnote 3(or OpenSYCL), and DPC++ from Intel.Footnote 4 DPC++ is included in the oneAPI toolkit,Footnote 5 an initiative that aims to bring together a single API to program multiple devices. Moreover it includes optimized libraries on multi-devices such as oneMKL for linear algebra, oneDAL for machine learning, or oneDPL, which adds additional functionalities to the language.

The rapid growth of data has spurred the use of data analysis techniques to extract common patterns. The use of machine learning methods has also motivated the development of optimized libraries to improve productivity, such as the open-source libraries Scikit-learn,Footnote 6 Spark-MLlib,Footnote 7 and CuML.Footnote 8 In the realm of unsupervised methods, the k-means technique is one of the simplest but most powerful ways to cluster a wide range of data sources. The popular k-means clustering algorithm groups objects by assigning each object to the cluster with the nearest mean. It has a wide range of applications in fields such as data mining, astronomy, medicine, and information retrieval [10, 11].

Although portability is crucial for using the wide variety of devices available on the market, there is currently no efficient and portable solution that allows the adoption of k-means in a real-world context. The existing libraries, such as Scikit-learn, are not fully competitive because of not supporting accelerators. Furthermore, those libraries which uphold heterogeneous computing such as CuML or oneDAL are only available with a single vendor. In this paper a new implementation of the k-means algorithm developed in SYCL on various GPU and CPU devices from different vendors (NVIDIA and Intel) is evaluated. In addition to studying portability, we also show the necessary adaptation of the algorithm for different device types. A comparative study of the performance, exploitation of computational resources, and optimization of the memory hierarchy is carried out too.

The rest of the paper is organized as follows. Section 2 presents some relevant information about SYCL and the state of the art of k-means implementations. Section 3 describes the main aspects of the k-means algorithm, and Sect. 4 details how the algorithm was developed. Section 5 we describe the process of porting CUDA code to SYCL and introduce the SYCL implementations. In Sect. 6, we describe the main experimental conditions, such as the work environment, datasets, and code parameters used. Section 7 discusses the main results obtained. Finally, Sect. 8 summarizes the conclusions reached.

2 Background

This section is divided into two parts. In the first part, we introduce SYCL and discuss the main available implementations. In the second part, we review the current state of the art in k-means clustering.

2.1 SYCL in a nutshell

SYCL is a standard (SYCL 2020) developed and maintained by the Khronos Group, similar to other standards such as OpenMP (e.g. 4.5, 5.1, etc) or OpenCL (e.g. 2.1, 3.0, etc) [5, 12, 13]. Its main purpose is to enable developers to use any ISO C++ compiler (e.g. GCC, Clang, NVCC, ICC, etc), at least C++ 17, and utilize C++ lambdas to encapsulate device kernels. SYCL doesn’t aim to replace other parallel models or backends (e.g. CUDA, HIP, OpenCL, etc.) but rather to complement them. Since all these models are C++-compatible, SYCL uses C++ lambdas to extend the native API of different backends. For instance, when allocating memory on an NVIDIA GPU, a SYCL memory allocation automatically triggers a native CUDA allocation at background. Then, you can consider SYCL as the facade design pattern, which serves as a front-facing interface to other backends [14].

Up to this point, we have solely addressed the SYCL standard, however, it is crucial to recognize that SYCL does not have a singular implementation. In this paper, we tested two SYCL 2020 implementations. The most feature-rich implementation is Intel DPC++, which not only conforms to the SYCL 2020 standard but also includes other custom featuresFootnote 9 [15]. DPC++ is a compiler-based implementation, meaning that it is integrated with the new Intel compiler, which is based on and forked from the Clang/LLVM project.Footnote 10 It is important to remark that DPC++ compiler is open source, although Intel also offers a commercial alternative available on the oneAPI toolkits. The oneAPI includes additional tools such as profiler and optimized libraries. While custom features are present in DPC++, we opted not to employ them in our study, with the aim of ensuring the portability of our developments.

The DPC++ compiler is responsible for compiling SYCL code, and it performs different steps such as front-end, middle-end, and back-end. In the front-end phase, the compiler separates the host code from the device code, while in the middle-end phase, it transforms the device code into an intermediate representation known as LLVM IR. The back-end stage then compiles the LLVM IR representation into the device’s native code and combines everything into a final file called a “fat binary”. This compiler can generate a final binary that can run on multiple devices, including multi-vendor GPUs or even FPGAs, as described in [16, 17].

The other noteworthy implementation is OpenSYCL also known as hipSYCL, which is a library-based implementation. This means that they have developed a C++ library and rely on third-party compilers. It is generally recommended to use it with the regular Clang/LLVM compiler, which was designed to support CUDA, HIP, OpenMP, and OpenCL source codes [18, 19]. However, hipSYCL also supports other compilers such as NVIDIA NVC++, this limits targeting CPUs and NVIDIA GPUs, but gives support to NVIDIA GPUs with their native NVIDIA compiler [17].

2.2 K-means related work

Focusing on k-means for heterogeneous computing, it is worth mentioning the inclusion of the k-means algorithm in the Rodinia benchmarks collection [20], which compiles implementations written in CUDA, OpenMP, and OpenCL. Recently, the paper [21] extended the analysis of the performance and portability of k-means by migrating an OpenCL implementation to an SYCL implementation, using both Intel CPUs and integrated GPUs. Although this work concludes that the SYCL version significantly improves productivity with a 51% reduction in the number of source code lines and achieves similar performance to the Rodinia CUDA counterpart, its performance is significantly lower compared with other optimized CUDA-based implementations [22, 23].

Furthermore, the “YinYang” version [24] optimizes the naive k-means algorithm by reducing the number of clusters examined in the first phase of the algorithm. This solution showed excellent results on multi-core CPUs. Later, the multi-core k-means (MKM) proposal by Böhm et al. [25] outperformed other proposals by pinning data blocks on CPU cores, parallelizing outer loops with OpenMP, and vectorizing the inner loop with intrinsic SIMD instructions. On GPUs scope, Kruliš et al. [22] presented their GPU CUDA solution, which addresses many aspects of the CUDA architecture, such as memory bandwidth, cache limits, and workload dispatching. The Kruliš et al. CUDA k-means implementation can be considered the state of the art.

From a heterogeneity perspective, He-Vialle et al. [23, 26] present a fine-tuned k-means algorithm for both CPU and GPU architectures. For this paper, we use He-Vialle’s implementation as a reference for two main reasons: (1) it incorporates implementation for heterogeneous computing based on OpenMP in multicore and CUDA for GPU and (2) it offers equivalent performance on NVIDIA GPUs in comparison with Kruliš implementation and even slightly outperforms in some cases [23].

3 The k-means algorithm

The k-means algorithm is a widely used method for clustering that computes the Euclidean distance as a heuristic to group the input samples. The main steps of the algorithm are summarized in the Algorithm 1. The k-means algorithm takes an input dataset, where each entry is considered a point in an n-dimensional space. The algorithm initially selects the k centroids using a heuristic or a random selection [27].

figure a

The iterative process of the k-means algorithm consists of two phases. In the first phase, each n-dimensional point is classified to the closest cluster by calculating the minimal Euclidean distance for all dimensions and clusters. Each point is then assigned to the cluster that minimizes this distance. In the second phase, the centroid values are recalculated based on the histogram of cluster assignments from the previous step. In more detail, the sum of the numbers of assigned points for each dimension are calculated and the clusters are reassigned by calculating the mean values. For the sake of clarity, the main phases (1) and (2) of the Algorithm 1 are referred to as assign clusters and update, respectively, in the rest of the paper.

The algorithmic complexity of the k-means algorithm is \(O(n \times n_d \times k)\), where n corresponds to the number of points or elements with \(n_d\) dimensions to be classified into the k clusters. Note that the dominant term is n, so \(n_d<< n\) and \(k<< n\).

4 Implementation

4.1 General overview

As introduced in Sect. 2, the He-Vialle implementation [23] can be considered one of the most efficient k-means solutions for heterogeneous computing. This implementation is based on two versions: (1) an OpenMP-based code for CPU devices, and (2) a CUDA-based program for NVIDIA accelerators. These versions have notable coding differences in order to exploit the main architectural features of the target devices. We have chosen this implementation as a starting point for our SYCL porting due to its good performance on heterogeneous architectures. It is worth noting that, although the resulting SYCL implementation can run on both CPU and GPU architectures, its performance varies significantly between both devices. The porting process must take into account the main hardware differences, restrictions, and features of each device (e.g., memory hierarchy or SIMD instructions) in order to optimize resources usage.

The first noticeable disparity among the implementations regards the memory layout exploitation; which can be summarized as AoS vs SoA entities:

  • AoS or Array-of-Structures, where the input matrix rows represent points of the n-dimensional space, while the columns depict the dimensional value.

  • SoA or Structure-of-Arrays, this is an inverted layout, where rows are tied to the dimensional values, and the columns represent the points.

It is remarkable that meanwhile, AoS scheme successfully exploits the hardware features of CPUs, the SoA fits better on the CUDA counterpart in the same way as in the He-Vialle’s implementation.

The other major difference relies on the way the assign clusters and update kernels are computed on CPU and GPU devices. These aspects will be detailed in the following sections.

As expected, the performance of the GPU and CPU versions is not uniform. The SYCL code, which is ported from the CUDA version and runs well on NVIDIA GPUs, exhibits reduced performance on Intel GPUs due to architectural differences. On a NVIDIA GPU, the CUDA core serves as the low-level compute unit, while on the Xe architecture (commercial name of the Intel GPU architecture), the Execution UnitFootnote 11 (EU) resembles a conventional CPU core with features such as branch predictors, simultaneous multithreading (SMT), and SIMD capabilities [28].

Although the main adaptation of SYCL codes is presented in the next subsections, we would also like to highlight that the source code developed in this research is freely available in a public repository.Footnote 12

4.2 Assign cluster routine

The assign clusters routine corresponds to the first step in the iterative process. It entails the classification of each point in the most suitable matching cluster. The most common metric used to evaluate the similarity between data points and centroids is the pairwise Euclidean distance. The minimum distance will determine the tag assignment.

Listing 1 shows the SYCL code of the assign clusters method for NVIDIA GPUs. This routine corresponds to the straightforward port from the original He-Vialle implementation (see Sect. 4). Lines 1–2 define the number of blocks and the number of threads per block. Line 4 submits the kernel to the queue previously associated with the target device (e.g. NVIDIA GPU). Finally, lines 12 distribute the workload among the block-threads and threads into the device selected. The rest lines are the kernel code. Since this implementation uses SoA memory layout, the intrinsic data parallelism is automatically exploited by Warps (line 24) through the Single Instruction Multiple Threads (SIMT).

figure b

Figure 1 summarizes the SYCL execution flow of the assign clusters implementation for both Intel’s CPU and Intel GPU. We would like to note that the original He-Vialle CPU version based on OpenMP distributes the workload among the CPU threads, where each thread computes a unique input data point. Nevertheless, our SYCL version distributes a group of consecutive points among the threads (as in a package way) to efficiently exploit the memory bandwidth. This particular code improvement significantly increments performance rates on both CPU and GPU devices. A second remarkable difference from the original He-Vialle implementation relies on explicit SIMD exploitation. Meanwhile, the SIMD/SIMT exploitation from the CUDA version was generated directly through the compiler by means of Warps, the SYCL compiler is not able to perform it. We would like to point out that the AoS scheme could exploit vectorization through the data input dimensions. For the explicit vectorization, SYCL defines a special type associated with SIMD lanes (variables v_pts, v_mean and result in step 2). This type requires the SIMD width (1, 2, 4, 8 or 16) and the data type. It is worth remarking that this hand-tuned vectorization implies including some extra code. First, execute the positions in a SIMD way (step 2), and also compute the remaining iterations (last for loop in step 3). In order to increase the instruction-level parallelism (ILP), a common technique of blocking is applied. The main loop is split into a factor of B, where B is determined by the microarchitecture throughput (the inner for loop in Step 2 of Fig. 1). In particular, the optimal configuration for CPUs and Intel GPUs is \(B=2\) [29].

Fig. 1
figure 1

Diagram of the assign clusters kernel for CPU and Intel GPUs

4.3 Update routine

The second k-means stage consists of updating the clusters with the new assignments calculated previously. For this purpose, the algorithm computes the average of the assigned points to the same cluster. This process is commonly broken into two sub-steps to improve efficiency. The first sub-step can be seen as a reduction which computes the sum of all the points belonging to each cluster. The second sub-step, calculate-mean, performs a simple mean for each cluster. The result becomes in the new centroid clusters for the next iteration. Since the calculate-mean step is trivial to be coded and does not present any performance issues, it is not detailed in this paper although can be seen in the public repository.

Regarding the kernel reduction, our code was adapted and later ported to SYCL. The original He-Vialle implementation uses dynamic/nested parallelism for the update routine. However, this scheme is not fully supported by SYCL. Despite this dynamic parallelism can be resembled with the hierarchical parallelism feature in SYCL, the performance rates are much lower than the common “parallel_for” parallelism [15] scheme. Consequently, we developed an adapted version of He-Vialle’s in CUDA removing and flattening the nested parallelism, and finally, the adapted CUDA version was ported to SYCL. Excluding this aspect, the rest of the SYCL code computes the k-means in the same way as in He-Vialle’s implementation.

Focusing on the kernel reduction, all the implementations developed are based on the package execution in the same way as in the assign clusters kernel. Each thread computes the reduction for a set of points of similar size and synchronizes with the other threads to get the final reduction. The encoding in which the threads are synchronized differs from the target chosen. In the CUDA version, the threads from a block first load data into shared memory, then a certain number of threads perform private reductions, and finally, the reduction value is updated by means of atomicAdd instruction globally.

In contrast, the CPU implementation is based on two stages: (1) summing private package points and then (2) performing a global synchronization. Additionally, the Intel GPU implementations include an extra step [30]. Figure 2 reflects the Intel GPU kernel, which follows the same scheme as in the CPU devices. It starts by (1) packaging points and spreading them to each thread, loading data into the local memory (shared by all work-items in a work-groupFootnote 13), then, the reduction is performed in three steps to avoid unnecessary synchronizations; (2) a private reduction, where each work-item does not interact with each other; (3) a work-group reduction, where work-items from a work-group, perform a tree reduction of all private reductions; and finally, (4) the first work-item of each work-group globally synchronize the work-group reduction with an atomic_fetch_add. Moreover, the sum operation was implemented with explicit vectorization. More details can be seen in the code repository.

It should be noted that, while some state-of-the-art reduction proposals rely on tree-based reduction steps to avoid the overhead associated with atomic operations when the first cache levels are not fully coherent, this approach is not necessarily the best approach on the Intel architectures. A comparative analysis of several reduction proposals can be found on [30], including those based on atomic operations, tree-based approaches, as well as the most efficient proposal, referred to as multiblock-interleaved-vector, which will be adapted to be used in this research.

Fig. 2
figure 2

Diagram of the reduction kernel for CPU and Intel GPUs

5 Adapting the He-Vialle’s implementation to SYCL

The He-Vialle implementation defines different parallel schemes to run over either CPU (via OpenMP) or CUDA architectures. Rather than completely rewriting the implementation, we undertook the task of porting their version to SYCL, while striving to retain its core essence to the greatest extent feasible.

The porting process starts by creating a common baseline code, which minimizes the code divergence among the different devices. This baseline version is in charge of parsing input arguments, loading the input dataset, or measuring performing times. Since He-Vialle’s code is straight tied to the implementation, we decide to refactor and adapt it to create the baseline counterpart.

At this point, we could automatically port the CUDA code into SYCL. However, at the current state of SYCL and because of the use of the dynamic parallelism, the code is not suitable to be ported to SYCL. Dynamic parallelism could be thought of as a nested parallelism, where each thread is able to call other threads. Since SYCL does not support the dynamic parallelism, we could try adapting it by using SYCL hierarchical parallelism, but the resulting performance is severely damaged [31]. Hence, instead of trying to find an equivalent SYCL feature to fill the gap, which probably would introduce implementation differences, we simply decide to re-adapt the original CUDA code and flatten the parallelism. The resulting code, we call from now on Adapted He-Vialle, is still CUDA-based but removing the original dynamic parallelism feature. The Adapted He-Vialle version is now able to be ported to SYCL, and also will serve as a baseline to comparison between the original He-Vialle and SYCL implementations.

To port the Adapted He-Vialle kernels to SYCL, we used the SYCLomatic toolFootnote 14 available on the oneAPI toolkit [32].

Additionally, we also ported He-Vialle’s CPU implementation written in OpenMP to SYCL. However, this code was manually rewritten due to the absence of porting tool from OpenMP to SYCL expressing the parallelism as in OpenMP, but including other parallel exploitation features such as breaking the workload into packages or introducing explicit vectorization. For the Intel GPU devices, we adapted the CPU code in order to get the maximum performance (see Sect. 4).

Once we had developed a well-optimized SYCL implementation for each architecture, we proceeded to extend an implementation that would balance code performance with portability. The target of this implementation (named in the text as “SYCL-port”) was not to attain the highest possible performance, but rather to be ranked second across all tested architectures. More code details can be found in the project repository.

Table 1 sums up the four k-means implementations that will be tested. He-Vialle’s runs on CPU (OpenMP), while their GPU implementation, written in CUDA, only works on NVIDIA GPUs. Adapted He-Vialle version, was adapted from the original removing the dynamic parallelism feature. The Adapted He-Vialle will serve to measure the performance degradation from the original code. Concerning the SYCL version, the code runs on CPU (SYCL-CPU), Intel (SYCL-iGPU) and NVIDIA GPUs (SYCL-nGPU), but the kernels which run on them differ since the kernels on CPU were adapted from OpenMP, the ones on the Intel GPU were adapted from the SYCL CPU version and those for NVIDIA GPU were ported from Adapted He-Vialle with the SYCLomatic tool. Writing custom kernels reduce the portability of the code (\(\approx 70\%\) of the code remains unchanged) but increases the performance of each architecture. In addition, we have included the “SYCL-port” version mentioned earlier, which enhances the portability of the application but may result in reduced performance. Lastly, we would like to point out that even though we developed optimized kernels for each device, all the SYCL versions could run over all the devices mentioned (e.g. SYCL-nGPU can run over CPU or Intel GPU and vice-versa).

Table 1 K-means implementations used in the experimentation. SYCL-CPU stands for the SYCL implementation optimized for the CPU, while SYCL-nGPU and SYCL-iGPU refer to the optimized versions for NVIDIA and Intel GPUs

6 Results

6.1 Work environment

Regarding the devices used, Table 2 summarizes their main features. The Intel i7-10700 is the multicore CPU to run the corresponding simulations, while the Intel Arc A770 and the NVIDIA RTX 3090 are the selected GPUs with a peak performance of 16.7 and 38.6 TFLOPS, respectively.

Table 2 Technical specifications of the work environment used

6.2 Environment configuration

Table 3 displays the main configuration with regards to the software and compilers employed, which was running under Ubuntu 20.04. First, to run on the NVIDIA GPU we used CUDA 11.7 for the nvcc compiler, NVIDIA HPC SDK 11.7 for the nvc++ compiler, and LLVM 11.4 to compile with Clang. On the other side, we used the icpx and dpcpp compilers included in oneAPI 2022.3 to run on the CPU and the Intel GPU. Finally, to run SYCL over NVIDIA GPUs we used the non-commercial version of the dpcpp compiler (Intel Clang/LLVM). We also included hipSYCLFootnote 15 which supports as toolchains the nvc++ and Clang compilers. We finally would like to remark that both Intel LLVM and hipSYCL uses underneath CUDA 11.7 and LLVM 11.4. In all the experimentation phase, the -O3 flag compiler was set.

Execution time is the metric used to evaluate the implementations developed. In order to gain valuable insights into potential issues within the code, we utilized the profilers Intel Advisor for the CPU and Intel V-Tune for the Arc GPU, included in the oneAPI toolkit. When profiling NVIDIA GPUs, we selected the NVIDIA Nsight Tools, available in the CUDA toolkit.

Table 3 Compilers used for each implementation. Intel Clang is the open-source version of dpcpp compiler included on oneAPI

6.3 Dataset description

The experiments were carried out with two datasets. One of them is synthetic and was created with a script found in the repository.Footnote 16 The created dataset has 1 M random points and 500 dimensions per point. Its goal is guiding to adjust the optimal parameters described in the following sections (Package size, and SIMD width).

In order to have a real-world dataset for the final experiment, we choose from the UCI Machine Learning Repository,Footnote 17 the US census 1990 dataset. This dataset contains 2,458,285 instances with 68 categorical attributes, where the instances are the points and the attributes are the dimensions in our model. The original dataset (USCensus1990raw) collects data from the 1990 USA census, but the final one is a simplified and discretized version of the raw one.

6.4 Choosing kernel package size

Sections 4.2 and 4.3 described the implementation of both assign clusters and update kernels as well as the main programming features. In this section, it is presented some parameter tuning such as package size and the correspondent performance impact. The package size parameter refers to the number of points in the dataset that will be processed by each work-item. To determine the optimal parameters for achieving the best performance, experimentation is repeated ten times to avoid variability. This approach allowed us to verify that the standard deviation was negligible, and as such, variability was not incorporated in the charts since it was deemed to be irrelevant.

Regarding NVIDIA GPUs and the three implementations considered, we would like to point out that the only kernel which supports packaging is the update kernel. Regarding He-Vialle [23] implementation, we can confirm that the same package sizes achieve the best performance rates in our particular work environment (128 points per package, and four dimensions per package). Respectively, the configuration of the one hundred package size fits also well for the He-Vialle CPU implementation. Hence, we decided to settle the original configuration figures.

In addition, Fig. 3 shows the performance evolution of the assign clusters kernel by increasing the number of packages in the SYCL CPU version. For the experiment, it is used as input dataset 1 M points, one hundred dimensions, into 256 clusters with five iterations till the convergence. We have noticed that the number of packages relies proportionally on the number of cores available per CPU. The best performance achieved corresponds to the selection of sixteen packages. In more detail, varying the number of packages from above sixteen does not have any effect on the reduction kernel. Accordingly, to these results, the best configuration parameter in the CPU devices is sixteen packages for the assign clusters kernel and sixteen in the reduction one.

Fig. 3
figure 3

Run-time (s) by varying the number of packages in SYCL CPU and assign clusters kernel. Variation was less than 0.01

Concerning the Intel GPUs configuration, Fig. 4 shows the behaviour of varying the number of packages in the reduction kernel. For this device type, the number of packages was set according to the number of subslices available in the GPU. Note that Subslice argot depicts the groups of EUs of Intel’s GPU, which shares memory and synchronization mechanism. As can be seen in Fig. 4, increasing the number of packages above thirty-two reduces the performance rate. In consonance with the CPU behaviour, the assign clusters kernel boosts by increasing the number of packages. Hence, the best configuration on the Arc GPU corresponds to thirty-two packages for the reduction kernel and one hundred and twenty-eight for the assign clusters kernel.

Fig. 4
figure 4

Run-time (s) by varying the number of packages in SYCL Intel GPU and update kernel. Variation was less than 0.001

6.5 Choosing kernel SIMD width

This subsection evaluates the SIMD width impact on the Intel CPU and GPU. We have only considered this study with SYCL because of the only programming model that supports an explicit vectorization scheme. Since the other implementations use automatic vectorization, no hand-tuned configurations are applicable.

Figure 5 presents the impact of SIMD width in CPU devices varying dimensions range from 10 to 500 in the assign clusters kernel. We can observe that by increasing the dimensions more parallelism degree is available so the exploitation of SIMD width reports higher speedups. However, we do not observe a significant difference by using the SIMD feature in the reduction kernel. Thus, we set the SIMD vector width to eight which achieves the best execution times.

Fig. 5
figure 5

Run-time (s) by varying the number dimensions and SIMD width in SYCL CPU and assign clusters kernel. Variation was less than 0.1

Analogously, Fig. 6 shows the effect of SIMD width on an Intel GPU in which its max SIMD length is eight. In this experiment, we have noticed that increasing the SIMD lane above two does not address a significant improvement. Although the scalability is limited, we select a four-width SIMD configuration as the best configuration.

Fig. 6
figure 6

Run-time (s) by varying the number dimensions and SIMD width in SYCL Intel GPU and assign clusters kernel. Variation was less than 0.01

Finally, even though we did not show the results for the update kernel, we would like to point out that the update kernel scalability does not greatly change by increasing the SIMD width, due to the main bottleneck being related to the synchronization stage. So, the SIMD was set to eight lanes for the CPU and four lines to the GPU, which slightly offers (\(\approx 2\%\)) better performance.

6.6 Adjusting architectural parameters for “SYCL-port” version

This section will outline the process for selecting certain parameters for ad-hoc tuning in relation to specific architecture features. We evaluated the optimal configuration for both the assign clusters and update kernels.

First of all, it is important to note the impact of work-items and work-group configuration as mentioned in a previous sections (see Sect. 4.3) by means of resource sharing at work-group level. To maximize the use of each architecture’s resources, it’s essential to choose the most suitable work-group size. Therefore, we’ll discuss the process of selecting an optimal work-group size in detail.

Figure 7 shows the results of executing the update kernel using a synthetic dataset containing one million points, one hundred dimensions, and two hundred and fifty-six clusters. When running the kernel on a CPU, we observed that increasing the group size above eight significantly reduced performance. This fact is due to the CPU used has eight cores, and work-groups are assigned to threads mapped into cores. As a result, increasing the number of threads per core entails placing greater stress on the physical resources available.

For GPU perspective, it important to note architecture disparity among NVIDIA and Intel GPUs. Whereas on an NVIDIA GPU the execution units or CUDA cores are grouped into Streaming Multiprocessors or SMs (128 CUDA cores in a SM in the Ampere architecture [33]), on an Intel Arc GPU [28] the execution units (UE) with 8\(\times\) fp32 per SIMD capacity are grouped into sub-slices. Figure 7 highlights these aspects, observing the best execution times on the RTX for a group-size of 128 (128 CUDA cores per SM) while on the Arc A770 the most efficient corresponds to a group-size of 128 (16 EUs per 8\(\times\) fp32 SIMD). We would like to emphasize that the configuration that delivers optimal performance can be readily automated for each architecture, taking into account the aforementioned features.

We did not include the assign clusters kernel in the results as its numbers did not differ significantly from those of the update kernel. Therefore, we used the same work-group size that we had selected for the update kernel in our subsequent experiments.

Fig. 7
figure 7

K-means update kernel by varying the work-group size in i7-10700, Arc A770 and RTX 3090

7 Discussion

This section presents the results from the implementations and discusses the main issues found. For the following results, we fixed the number of convergence iterations to 20 for CPU and Intel GPU, while for the NVIDIA GPU that number were fixed to 100, because of the brief execution time it takes to finish. To obtain the results, we repeated each experiment ten times, and the bars displayed represent the sample average.

7.1 Comparing with He-Vialle’s implementation

This section compares the performance of the original He-Vialle’s to our SYCL implementations optimized for both CPU and NVIDIA GPU.

Figure 8 displays the execution times achieved on the Intel Core i7 CPU. The simulation was repeated multiple times increasing the number of desired clusters from 4 to 256. We would like to emphasize that our SYCL implementation (SYCL-CPU) outperforms or performs similarly to the He-Vialle counterpart on clusters of size 4, 64, and 256 (Intel LLVM and hipSYCL), but falls short on clusters of size 16. The main factor contributing to the improved performance of the SYCL code is the use of explicit vectorization and additional packages to process the input data in the assign clusters kernel. However, running the code compiled with hipSYCL yields poor results. This can be attributed to the fact that hipSYCL translates SYCL code into OpenMP, as opposed to Intel LLVM that translates it into OpenCL [34]. Consequently, the overall performance is considerably degraded, particularly for larger clusters.

Fig. 8
figure 8

Run-time (s) by each implementation and compiler on the Intel Core i7-10700

On the NVIDIA GPU side, Fig. 9 shows the results obtained by running the CUDA and SYCL implementations. For the sake of clarity, the log-scale of the chart are set in the x-axis. The first notable observation is that the performance of our “Adapted He-Vialle” code is similar to the original He-Vialle’s. However, there is a slight difference between them, with our development “Adapted He-Vialle” performing on average 7.7% worse. When analyzing the results, we notice that the assign clusters kernel performs similarly in both versions, but the reduction kernel downgrades the final results in the CUDA adaptation. This suggests that the overhead of developing an SYCL version is related to the lack of support for dynamic parallelism in SYCL.

To further investigate this, we compiled the code with the nvc++ and clang compilers, but since neither of them support dynamic parallelism, we only present results for the “Adapted He-Vialle” implementation. While using clang, the performance remains similar to nvcc.Footnote 18 However, using nvc++ greatly reduces the performance. This is likely due to lower cache hit rates and increased cache access time which is discussed deeply in Sect. 7.3.

Fig. 9
figure 9

Run-time (s) by each CUDA or SYCL implementation and compiler on the NVIDIA RTX 3090

Regarding the SYCL implementation (SYCL-nGPU), it presents a severe downgrade with respect to the “Adapted He-Vialle” version. A deep analysis using the NVIDIA profiler, reveals that the degradation is produced due to issues observed in the assign clusters kernel. The lower cache hit rate has a direct impact on the application’s overall performance. Higher cache hit rate means the processor unit has to wait fewer cycles to receive the requested data [35]. We found that the cache hit rate is reduced from 81% in “Adapted He-Vialle” compiled with nvcc to 24% in SYCL-nGPU (note that a more detail analysis is performed in the Sect. 7.3). Identical concern was also found in the “Adapted He-Vialle” code compiled with nvc++. From SYCL perspective, we tested the code using different compilers: Intel LLVM, hipSYCL (clang), and hipSYCL (nvc++). All executables report similar runtimes compared to “Adapted He-Vialle” compiled with nvc++. However, we would like to note that under similar conditions, SYCL version (“SYCL-nGPU” with hipSYCL and nvc++) runs on average \(\times 1.15\) faster than CUDA version of “Adapted He-Vialle” compiled with nvc++.

7.2 Comparing SYCL implementations among devices

This section aims to evaluate the performance rates achieved by different SYCL implementations, namely SYCL-CPU, SYCL-iGPU, and SYCL-nGPU, on various devices. Since SYCL code can theoretically run on any device, we compare the performance rates achieved on the Intel Core i7-10700, NVIDIA RTX 3090, and Intel Arc A770 devices used in this study. In addition, the behavior of the SYCL-port version is shown, which improves performance over its non-fine tune counterparts. This section’s key focus is to provide a comprehensive analysis of the performance of the different SYCL implementations on various devices.

Figure 10 displays execution times observed with the SYCL implementations running on the Core i7 CPU. Even though it is possible to run the SYCL code on the other devices, observations reveal a significant drop in performance, with a degradation of 66% and 57% on average for SYCL-iGPU and SYCL-nGPU implementations, respectively, compared to the optimal performance achieved on the designated architectures. Regarding to the SYCL-port version, the average performance difference respect SYCL-CPU was 46%, making it the best performing non-tuned implementation.

Fig. 10
figure 10

Run-time (s) by each SYCL implementation on the Intel Core i7-10700

Concerning the NVIDIA RTX 3090 GPU, Fig. 11 shows the execution times for all the versions. We notice that the ad-hoc version denoted as SYCL-nGPU is on average 72% faster than SYCL-CPU and SYCL-iGPU. In this case, the performance achieved by SYCL-port is quite similar to SYCL-nGPU.

Fig. 11
figure 11

Run-time (s) by each SYCL implementation on the NVIDIA RTX 3090

Figure 12 displays the performance variation of the different SYCL implementations on the Arc GPU. As expected, the SYCL-iGPU implementation outperforms the rest implementations. However, compared to the SYCL-nGPU version, which is optimized for NVIDIA GPUs, performance loss is only 26%. For the SYCL-port version this gap is reduce to only 16%. Compared to the CPU implementation on this device, we observe on average performance degradation of 43% slower respect to the ad-hoc counterpart due to the impact of update kernel which cannot fully leverage the higher degree of parallelism on the GPU devices.

Fig. 12
figure 12

Run-time (s) by each SYCL implementation on the Intel Arc A770

7.3 Profiling implementations

One last comparison is addressed in Figs. 13 and 14. They show the time distribution when different implementations and devices are evaluated. The plots split the times involved in the assign clusters kernel and the update kernel.

Focusing on the Intel-core i7 behaviour (see Fig. 13), the primary factor affecting execution time is the assign clusters kernel, which accounts for up to 92% of the total execution time in He-Vialle’s implementation and an average of 88% in our developed SYCL versions. However, it’s important to note that both implementations are constrained by compute-bound features due to the high parallelism level of this routine [36]. In the case of using SYCL-nGPU or SYCL-iGPU implementations, the update kernel acts as the bottleneck reaching 36% and 60% of the overall time, respectively. While the SYCL-port solution can reduce the overall computation time, it may not address the bottlenecks found in SYCL-nGPU version.

Fig. 13
figure 13

Kernel time profiling for Intel Core i7 and Arc A770 and for their leading implementations. Results got from the same dataset and setting 20 iterations for 256 clusters

Figure 14 compares the kernel performance of each implementation on the NVIDIA GPU. First of all, we have noticed insignificant differences in terms of times between the original He-Vialle and our “Adapted He-Vialle” (compiled with nvcc). However, by compiling the version “Adapted He-Vialle” with nvc++, the assign clusters time rises up to 93% of the time. It is mainly produced by the inefficient use of the memory hierarchy (81% of L1 hit rate vs 24% in He-Vialle in comparison with “Adapted He-Vialle”). Targeting SYCL implementations, SYCL-nGPU version also presents the same issue related to the compilers.

Lastly, we also included the results of the SYCL-CPU, SYCL-iGPU and SYCL-port implementations on the RTX 3090. In this case, the most time consuming kernel is the assign clusters, which is shared by both implementations (SYCL-CPU and SYCL-iGPU), and it gets a speedup of \(\times 0.35\). Moving to the update kernel, the SYCL-CPU implementation gets \(\times 0.03\) while SYCL-iGPU gets \(\times 0.4\). These figures reflect the fact that using common kernels to run over different devices, does not leverage the device architecture. Given that the majority of the SYCL-port code is similar to the SYCL-nGPU code, it is reasonable to apply the same considerations to it.

Fig. 14
figure 14

Kernel time profiling for the NVIDIA RTX 3090 and for its leading implementations. Results got from the same dataset and set 100 iterations for 256 clusters

Due to the poor performance observed in the assign clusters kernel in some implementations, we are attempting to identify the main constraints. The profiler known as NVIDIA Nsight Tools [37] can be used to detect bottlenecks. When comparing the “Adapted He-Vialle” version compiled with nvc++ and nvcc, a significant difference in performance rate, expressed as Compute (SM) Throughput (20% vs 98% for nvc++ and nvcc compilers), was observed. This degradation is primarily caused by poor exploitation of the memory hierarchy, resulting in a decrease in the L1/TEX Hit Rate and increased data transfer between the L2 and L1/TEX caches and between the device memory and L2 cache. The Memory-Chart analysis can be used to understand the volume of data transferred between all levels of the memory hierarchy in detail.

We have noticed that the memory hierarchy exploitation differs from the implementation, the compiler and the k-means configuration. Therefore, Fig. 15 analyzes the evolution in the volume of data transferred among memory levels for the original He-Vialle, the Adapted He-Vialle versions with the nvcc and nvc++ compilers, and the SYCL-nGPU implementation with the nvc++ compilers. As can be seen, the memory exploitation is almost the same if the problem has less than 16 dimensions but it starts to diverge from this point.

Based on the previous results, someone could conclude that our development known as “Adapted He-Vialle” is equivalent (at least in performance) to the original He-Vialle implementation, but offers lower performance on SYCL. It is worth noting that in Fig. 15, the number of transfers is nearly the same for the state-of-the-art version labeled as “He-Vialle (nvcc)” and the adapted version labeled as “Adapted He-Vialle (nvcc)”. However, the same version that was recompiled with nvcc++ denoted as “Adapted He-Vialle (nvc++)” displays a significant increase in the number of transfers which means a significant performance degradation. The figure also reveals that the cause of the suboptimal performance in SYCL implementations, such as “SYCL-nGPU hipSYCL (nvc++)”, is due to the higher number of memory transfers generated by the nvc++ compiler (identical behaviour of SYCL-nGPU hipSYCL (nvc++) and Adapted He-Vialle (nvc++)) instead of the programming model use.

Fig. 15
figure 15

Total data moved from the DRAM to the L2 and L1 caches. The number of iterations was fixed to one, cluster to four, and dimensions varied from 2 to 68

7.4 Performance portability

The performance portability defined in [38] refers that an application must execute a particular problem correctly on different platforms to be performance portable. It is important to note that the application’s behavior and its performance depend on the problem being solved. Table 4 collects the performance portability \(\Phi\) as in [38]. The performance achieved by different implementations are grouped into two categories: architectural efficiency and application efficiency. Architectural efficiency represents the ratio of the achieved performance to the theoretical hardware peak performance, while application efficiency represents the ratio of the achieved performance respect to the less time consuming implementation for any programming model.

Table 4 Performance portability of k-means for the different implementations on for the three platform sets

Table 4 analyzes the efficiency ratios of both ad-hoc implementations (“SYCL-CPU/SYCL-iGPU/SYCL-nGPU” in the table) and a tuned “SYCL-port” counterpart across different architectures. As can be seen, there is a disparity in results depending on the selected target. In similar way that in the previous sections, the ad-hoc version is significantly more efficient than its portable counterpart. Focusing on the CPU, despite the lower performance of centroid update kernel that reduces the overall efficiency, the SYCL-CPU version gets higher architectural efficiency rates up to 37% with higher cluster numbers. However, on GPU accelerators, although we found that while the “SYCL-port” version performed reasonably well across NVIDIA and Intel GPUs in comparison with the ad-hoc SYCL implementation, both codes get a significant drop in terms of efficiency, where update kernel based on reduction operations played a crucial role in determining the final performance.

The efficiency of ad-hoc and portable versions for SYCL and CUDA-based implementations for NVIDIA GPUs were evaluated in more detail. We found that both ad-hoc and portable versions of SYCL have low efficiency rates, similar to the optimized state-of-the-art He-Vialle’s version based on CUDA. Our observations indicate that the SYCL implementation using NVIDIA nvc++ yields even worse architectural and application efficiencies (0.02% and  19%, respectively) than the aforementioned implementations. Additionally, the He-Vialle implementation with the NVIDIA nvcc compiler also demonstrated poor performance, with architectural and application efficiencies rates of 0.05% and 82%, respectively. The poor efficiency rates are due to (1) inefficient exploitation of the memory hierarchy by compilers as discussed in Sect. 7.3 and (2) the impact of update kernels due to their memory bound nature.

8 Conclusion

This paper presents a study on the implementation of the k-means algorithm in SYCL, a programming model that allows running on a wide variety of devices such as CPUs and multi-vendor GPUs. The motivation behind this work is the need for portable and efficient solutions for data analysis techniques, such as k-means, that can take advantage of the computational capabilities of modern heterogeneous devices.

To evaluate the performance of the SYCL k-means implementation, we compared several version against the state-of-the-art He-Vialle’s implementation, which is able to run on both CPUs and NVIDIA GPUs. The SYCL implementation showed comparable performance to the original CUDA version on NVIDIA GPUs, but suffered from some compiler limitations related to memory hierarchy exploitation. However, when these limitations did not apply, SYCL was able to achieve the same performance as the original CUDA implementation in most scenarios.

Apart from the comparison with the He-Vialle implementation, we also analyze the performance portability of SYCL on Intel/NVIDIA GPUs and CPUs as well. Our study revealed that by accepting a minor decrease in performance (especially on GPUs), we have developed a portable version denotes as “SYCL-port” (auto-tune to each architecture) that suits well on all architectures favoring the performance portability.

Overall, our results demonstrate the potential of SYCL for implementing efficient and portable solutions for data analysis techniques like k-means. However, although SYCL has among their strengths portability which favours writing once but running anywhere, this paper emphasizes that in order to achieve high performance equivalent to native implementations supposes a hand-tuned re-factorization for the efficient exploitation of key architectural aspects such as memory hierarchy, non-uniform parallelism patterns, and the efficient synchronization mechanism usage in the challenging reduction operations.

As future work, we plan to improve the SYCL k-means implementation to further reduce the performance gap with the original CUDA version and evaluate other types of devices such as AMD-GPUs or other novel architectures as asymmetric multi-cores.