Conflict-aware workload co-execution on SX-aurora TSUBASA

Nunokawa, Riku; Shimomura, Yoichi; Agung, Mulya; Egawa, Ryusuke; Takizawa, Hiroyuki

doi:10.1007/s42514-023-00171-x

Conflict-aware workload co-execution on SX-aurora TSUBASA

Regular Paper
Open access
Published: 05 October 2023

Volume 6, pages 425–438, (2024)
Cite this article

Download PDF

You have full access to this open access article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Conflict-aware workload co-execution on SX-aurora TSUBASA

Download PDF

Riku Nunokawa¹,
Yoichi Shimomura²,
Mulya Agung³,
Ryusuke Egawa^2,4 &
…
Hiroyuki Takizawa ORCID: orcid.org/0000-0003-2858-3140^1,2

661 Accesses
1 Citation
Explore all metrics

Abstract

NEC SX-Aurora TSUBASA (SX-AT) is the latest vector supercomputer, consisting of host processors called Vector Hosts (VHs) and vector processors called Vector Engines (VEs). The goal of this work is to simultaneously use both VHs and VEs to increase the resource utilization and improve the system throughput by co-executing more workloads. One difficulty is that performance interferences among VH and VE workloads could occur because they share some computing resources and potentially compete to use the same resource at the same time, so-called resource conflicts. To achieve efficient workload co-execution, first, this paper experimentally investigates the performance interference between a VH and a VE, when each of the two processors executes a different workload. It is empirically shown that the frequency of system calls from the VE workload could be a good indicator to predict if the co-execution could cause severe performance interference, even though monitoring system calls requires a huge runtime overhead and it is impractical to simply use it for decision making of co-execution. Then, this paper proposes a workload co-execution strategy based on a practical approach to identifying a pair of VE and VH workloads that could cause severe performance interferences. Our evaluation results clearly demonstrate that the system call frequency can be used to predict if the workload can affect the performance of another co-executing workload, and VH’s CPU load can be a good approximation of the system call frequency. The proposed approach based on the CPU loads could accurately identify a pair of workloads causing frequent resource conflicts, and thus reduce the risk of severe performance interferences between co-executing workloads on an SX-AT system, resulting in shorter makespan without significantly increasing the turn-around time.

Towards Conflict-Aware Workload Co-execution on SX-Aurora TSUBASA

Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer

Optimization of the Himeno Benchmark for SX-Aurora TSUBASA

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recently, high-performance computing (HPC) systems often adopt heterogeneous system architectures equipped with different kinds of processors. NEC SX-Aurora TSUBASA (SX-AT) is one of heterogeneous computing systems, which consists of x86 processors and vector processors, called Vector Hosts (VHs) and Vector Engines (VEs), respectively (Yamada and Momose 2018). A VE is physically implemented as a PCI-Express card, which is similar to an accelerator such as a graphics processing unit (GPU). On the other hand, a VH is responsible for executing the operating system (OS) and managing VEs. In one compute node, each VH could manage multiple VEs. Such a node of VHs and VEs is called a Vector Island (VI). The hardware configuration of one VI is illustrated in Fig. 1. Since the latest-generation VE has a high memory bandwidth of 1.53 TB/s (Egawa et al 2020), VEs are expected to achieve high sustained performance at executing memory-intensive scientific computations while using the standard x86 environment provided by the VH (Komatsu et al. 2018).

Unlike other accelerators such as GPUs, a VE can execute an application as if the whole application is running on the VE. However, when the application running on a VE invokes a system call, the system call is implicitly forwarded to the VH, and processed by the OS running on the VH. In addition to the VH’s CPU time for handling system calls, some other computing resources such as the VI’s network bandwidth are shared by the VEs. Thus, on a large SX-AT system shared by many users, each VI is exclusively assigned to a job so as to avoid performance interferences among jobs, which could occur by sharing VHs. For example, in the AOBA system installed at Tohoku University Cyberscience Center (Takizawa et al. 2023), multiple jobs do not usually share one VI, and one VI might co-execute multiple jobs only if every of the jobs uses only a single VE in the VI. In such an operation policy, a job does not necessarily use all VEs in the assigned VI, and some of the VEs are thus unused during the job execution. Therefore, if multiple jobs are assigned to one VI so that more VHs and VEs are used for the execution, it is possible to increase the utilization of computing resources. However, multiple jobs running on a VI may simultaneously require the same computing resource. This is a so-called resource conflict, and could cause severe performance degradation. For this reason, understanding the performance interference between multiple jobs running on a VI is an important technical issue to achieve high efficiency on SX-AT systems.

This paper first empirically investigates the performance interference between a VH and a VE, when each of the two processors executes a different workload. Then, we discuss workload co-execution with reducing the risk of performance interferences due to resource conflicts, in order to improve the resource utilization. After that, as monitoring system calls requires considerable runtime overheads, this paper proposes a workload co-execution strategy based on a simple but practical approach to identifying a pair of VE and VH workloads that could cause severe performance interference. This proposed approach assumes that the CPU load of each workload is known in advance as assumed in Xiong et al. (2018), e.g., a job scheduler runs each job alone without co-execution at the first time and the CPU load is recorded to determine if the job could be co-executed next time. Under this assumption, this paper quantitatively discusses how accurately the CPU load can be used an indicator to predict performance interferences between VH and VE workloads. Evaluation results demonstrate that the system call frequencies of workloads can be used as a good indicator to identify a pair of VH and VE workloads causing frequent resource conflicts, and thus reduce the risk of severe performance degradation while improving the resource utilization. Instead of monitoring system calls, the evaluation clearly shows that the proposed approach can accurately predict if a pair of VE and VH workloads causes severe performance interferences, by executing each workload alone in advance and measuring the CPU load. The novelty of this paper is to show the feasibility of the risk prediction based on the CPU load of each workload without directly using the system call frequency. As a result of the risk prediction, the proposed approach can reduce the makespan without significantly increasing the turn-around time.

2 Resource conflicts on an SX-Aurora TSUBASA system

Although there are several accelerator-like programming models to use both VHs and VEs for one application such as in Takizawa et al. (2021); Ke et al. (2021), this paper assumes that each workload is either a VH or VE workload, meaning that a user process is executed by using only either a VH or a VE. When either a VH or VE is used for executing a workload, the other processor is idle and this paper thus discusses whether another workload should be executed on the idle processor or not (co-execution). This section briefly reviews the resource conflicts among VH and VE workloads co-executing in one VI.

Figure 2 illustrates how the user and kernel memory spaces are physically located on an SX-AT system. Each VE has its own memory devices to provide high memory bandwidth, and the memory hierarchy of a VE is physically different from that of a VH even if they are within the same VI. In principle, as with accelerators such as GPUs, VHs and VEs need to communicate via the PCI-Express link for exchanging data, which could be time-consuming. Therefore, on an SX-AT system, all data in the user memory space of a VE workload are physically stored on the VE-side memory device, and thus the VE workload can enjoy the VE’s high memory bandwidth while programmers do not need to care about the VH-VE data transfer.

When an application is running on a standard x86 Linux system, the application has a user memory space, which is logically different from the kernel memory space used by the OS. On the other hand, when a user memory space is assigned to an application running on a VE, unlike the standard system, the user memory space physically resides on the memory devices attached to the VE. However, even on an SX-AT system, the kernel memory space is located in the VH memory. Namely, when an application is running on a VE, its user memory space is not only logically but also physically isolated from the kernel memory space. System calls on the VE are forwarded to a dedicated process running on the VH, called a pseudo VE process, that invokes the corresponding system calls on the VH to call the OS kernel. Accordingly, when an application is running on a VE, it internally uses a VH within the VI.

Moreover, if each of a VH and a VE within one VI executes a different application, both VH and VE workloads share the VH and have their own memory spaces, which are logically and physically isolated from each other, as shown in Fig. 2. In this case, if the VE workload invokes a system call, the system call request is forwarded to the pseudo VE process, and then the VH workload would be context-switched to the pseudo VE process so that the VH core can handle the system call request from the VE workload. Since the pseudo VE process spends the CPU time for handling the system call request, the VH workload execution would be delayed, degrading the VH performance. If the VH workload cannot immediately be switched to the pseudo VE process for any reasons, the system call from the VE workload might be delayed, degrading the VE performance. In this way, VH and VE workloads may compete to use the same computing resources such as the VH’s CPU time, the VH’s memory bandwidth, network and file access. Therefore, the VH and VE workloads can affect their performance each other, referred to as inter-process performance interferences.

Performance interference is expected to occur primarily if a workload on either the VH or the VE intensively uses the shared computing resources. For example, suppose that a memory-intensive workload is running on one of the VH cores. Then, if the memory bandwidth is spent out, the memory access latency of the pseudo VE process increases, and thus the system call from the VE workload is delayed, degrading the VE performance. As in research dealing with performance interference on a single processor (Xiong et al. 2018), in order to maximize the benefits of concurrency while efficiently controlling the overall performance degradation that may occur, it is necessary to clarify the characteristics of the applications that cause conflicts through quantitative research. As one kind of major resource conflicts, it is known that the total execution time of a workload increases due to access conflicts to the file system (Aceituno et al. 2021), which is one of the shared computing resources. If such a root cause of performance interferences is known in advance, it would be possible to schedule jobs so as to avoid resource conflicts among them.

3 Preliminary evaluation of performance interference by workload co-execution

3.1 Evaluation setup

In this work, we experimentally investigate the effect of co-executing various VH and VE workloads on their performances, and identify the combinations of VH and VE workloads causing severe performance interferences on SX-AT. The execution time of each workload is adjusted to be almost the same. In the evaluation, popular benchmarks of Himeno (Himeno 2001), IOR (Shan et al 2008), Intel MPI (Intel Corporation 2018), STREAM (McCalpin 1995), b_eff (Rabenseifner et al. 2001), MiniAMR (Sasidharan and Snir 2016), and HPL (Petitet et al. 2018) are first used as VH and VE workloads for general discussions on performance interferences. After that, we further discuss the performance interferences with some tiny benchmark programs that intensively use only particular computing resources, such as the CPU time, memory bandwidth, file I/O, and network. Each benchmark program is compiled for both of a VH and a VE, and executed by using all cores in the processor. The system specifications used in the following evaluations are listed in Table 1.

Table 1 Hardware configuration of NEC SX-Aurora TSUBASA A300-8

Full size table

3.2 Interference evaluation results

We evaluate the changes in execution time when a VH and a VE within one VI co-execute the benchmark programs, expecting that the performance interference will increase the execution time. Figure 3 shows the increase in execution time of each VH benchmark program referred to as the slowdown rate, while changing the combination of VH and VE workloads. On the other hand, Fig. 4 shows the slowdown rate of each VE benchmark program. The system call frequency of each benchmark program is shown in Fig. 5.

Comparing Figs. 3 and 4, we can see that, when workloads are executed simultaneously on a VH and a VE, the overall performance of the VH tends to degrade more than that of the VE. This is because computational resources such as CPU time and memory bandwidth of the VH are consumed not only by the VH workload but also by the pseudo VE process to handle system calls from the VE workload. The reason why the performance degradation of the VE is smaller than that of the VH is that the computational resources of the VE are allocated exclusively to each VE workload. In this case, the slowdown at co-executing VH and VE workloads is due to the overhead of handling system call requests from the VE workload. In reality, for most scientific and technical computing applications, the system call overhead is not so significant because a large part of the total execution time is spent on the execution of the kernel loop. Therefore, these findings indicate that co-execution on VH and VE workloads is a promising approach to improving system utilization without severe performance degradation except in some extreme cases.

Figure 3 shows that the execution time of all VH workloads increases when the IOR benchmark is run on the VE. The performance degradation of the VH workloads is caused by the resource conflicts when the IOR benchmark is executed on the VE side. Similarly, in Fig. 4, it can be seen that, when the IOR benchmark is executed on the VE side, the execution time of the VE workload increases regardless of which benchmark is executed on the VH side. This means that performance degradation occurs for both the VH and VE workloads when the IOR benchmark is run as a VE workload. This is mainly because the IOR benchmark frequently invokes system calls to measure file I/O performance. On VH cores, the VH workload and the pseudo VE process are concurrently run and frequently become active to handle the system call requests from the VE workload, resulting in a situation where each application cannot fully utilize its resources. This causes problems in executing VH workloads and handling system call requests of VE workloads, which manifests in each benchmark’s performance degradation.

Figure 5 shows the frequency of system call invocations for each benchmark. This graph shows that the IOR benchmark invokes system calls for file I/O operations more frequently than the other benchmarks. Thus, as discussed in Sect. 2, frequent context switches on the VH CPU to handle system calls from the IOR benchmark running on the VE can hinder the workloads running on VH from consuming CPU time and other shared computing resources. In addition, the VH workloads running in MPI parallel are generally more sensitive to conflicts on computing resources than the VH and VE workloads that do not run in MPI-parallel. This is because the loads among MPI processes become unbalanced when some of the VH cores used for computation are also used to execute pseudo VE processes while other cores are not, resulting in long delays during synchronization such as in collective communication.

In order to further analyze in detail the performance degradation of the system due to system call invocations, we developed a micro-benchmark that invokes typical system calls at arbitrary intervals. The micro-benchmark is executed on a VE, while another typical benchmark is executed simultaneously on a VH. In this work, we evaluate how the performance degradation of the system caused by resource conflicts varies with the type and frequency of system calls by repeatedly invoking a set of system calls at fixed time intervals.

We discuss the evaluation results of the system performance degradation caused by conflicts while varying the types and frequency of system calls made by co-executing workloads. Figures 6 and 7 show the slowdown rates (the increase in execution time) of the standard benchmark on the VH and the micro-benchmark on the VE, respectively. As in Figs. 3 and 4, these figures show the slowdown rates for each benchmark compared to that for a single benchmark for each combination of benchmarks.

When the micro-benchmarks are executed on the VE side, the execution time of the VH workload increases with the frequency of system call invocations, and the execution time of the VE workload also increases. On the other hand, the VE workload is affected only by IOR with frequent system call invocations when the micro-benchmarks are executed on the VH side as shown in Nunokawa et al. (2022). This result clearly shows that the frequency of system calls is correlated with the performance degradation that occurs when the VH and VE workloads are executed concurrently. System calls may cause context switches, switches between kernel and user modes, and access to shared computing resources on the VH side via system calls. The higher the frequency of system call invocations is, the more frequently the corresponding pseudo VE process running on the VH runs. Thus, the performance degradation of the system due to conflicts on computing resources of the VH becomes more pronounced. In the VH workload, the resource conflicts on VH cause a delay in the overall processing. In the VE workload, the resource conflicts on VH cause a delay in the system call invocation handled by the pseudo VE process running on VH, which increases execution time.

These results in this section indicate that the frequency of system calls can be used to determine whether a workload degrades the performance of other concurrently executed workloads.

4 A co-execution strategy with runtime identification of conflicting workload pairs

This section proposes a co-execution strategy based on runtime identification of workload pairs to be conflicting. In the proposed strategy, if a new workload is predicted to be not conflicting with another workload already running, those two workloads are co-executed. Therefore, a key is how to predict if a pair of workloads cause resource conflicts.

In SX-AT, the latency of a system call from a VE workload becomes longer if the VH is intensively used by another workload, because the pseudo VE process takes a long time to handle the system call by sharing the VH with the intensive VH workload. Thus, as discussed in Sect. 3, the system call frequency of a workload can be a good indicator to predict if it could degrade the performance of another workload. However, it requires a huge runtime overhead to monitor a target process from another process, suspend the target process at every system call to get the system call type, and then resume the target process. Therefore, it is difficult to monitor every system call at runtime, and we need a more practical approach to runtime identification for a pair of VE and VH workloads causing performance interferences.

Executing the pseudo VE process for handling a system call from a VE workload increases the CPU load of the VH, referred to as the VH load. If the system call frequency of a VE workload is high, the total VH load of pseudo VE process and co-executing VH workload could reach 100%, meaning that the performance demand for the VH exceeds its actual performance. On the other hand, if the system call frequency of a VE workload is low, the total VH load also becomes low. Accordingly, the VH load of a workload could approximate the system call frequency of a VE workload, which can easily be obtained by executing the VE workload once without co-execution.

The proposed approach assumes that each of VE and VH workloads has already been executed once without co-execution to obtain its average VH load. Then, the compute intensity of a workload pair, $C_{\textrm{all}}$, is defined by

$$\begin{aligned} C_{\textrm{all}} = U_{\textrm{VH}} + U_{\textrm{VE}}, \end{aligned}$$

(1)

where $U_{\textrm{VH}}$ and $U_{\textrm{VE}}$ are the VH loads of co-running VH and VE workloads, respectively.

If $C_{\textrm{all}}$ exceeds 100%, it is obvious that the VH is overloaded and thus the performance degrades. However, it is unclear how the performance changes when $C_{\textrm{all}}$ is smaller than 100%. In this work, the VH load of each workload is retrieved from a performance counter value. Since enabling a performance counter could cause a large runtime overhead, the performance counter value is retrieved at a fixed interval, e.g., 1 s. The VH loads of VH workloads as well as pseudo VE processes for VE workloads are listed in Tables 2, 3, 4, and 5.

Table 2 VH loads of standard benchmarks on VH

Full size table

Table 3 VH loads of standard benchmarks on VE

Full size table

Table 4 VH load of micro-benchmarks on VH

Full size table

Table 5 VH load of micro-benchmarks on VE

Full size table

Figures 8 and 9 show the slowdown rate in execution time and the total VH load, i.e., $C_{\textrm{all}}$, for each pair of VH and VE workloads, respectively. The number in each cell indicates the slowdown rate, and the color indicates the total VH load. Figure 10 shows the relationship between the slowdown rate and the total VH load of every pair in Figs. 8 and 9. The correlation ratio between the two values is 0.738, showing their strong positive correlation. These results indicate that the total VH load, $C_{\textrm{all}}$, correlates the performance degradation due to resource conflicts on VH. The performance gradually degrades as the total VH load exceeds about 50%, and the degradation becomes remarkable when the total VH load exceeds 100%. Remember that a physical core of the VH in Table 1 works as two logical cores. Thus, when the total VH load is 50%, all the physical cores are used for the execution, while all the logical cores are spent out when the total VH load is 100%. When the total VH load is less than 50%, the performance degradation due to resource conflicts is not significant. Since the system call frequency is small for scientific computation in general, the execution is mostly only on the VE and it is unlikely that the VH becomes too busy for the pseudo VE process to handle system calls. Accordingly, co-execution of VH and VE workloads is promising to increase the system throughput if we can avoid co-executing a workload pair whose $C_{\textrm{all}}$ is large.

The results above show that the system call frequency can be approximated by the total VH load, $C_{\textrm{all}}$, to identify a workload pair causing severe performance degradation. A co-execution strategy is needed to decide if a workload is co-executed when either of a VH or a VE becomes idle. If a new workload is expected to cause resource conflicts with the workload already running, the proposed co-execution strategy decides not to co-execute the workload pair. Specifically, if $C_{\textrm{all}}$ exceeds a threshold, severe resource conflicts are expected to occur. Therefore, the proposed conflict-aware co-execution strategy searches another workload in the remaining workload set, with which $C_{\textrm{all}}$ does not exceed the threshold. If such a workload exists, it is co-executed with the already running workload. Otherwise, the idle processor waits until the other processor finishes the current workload.

5 Evaluation of conflict-aware workload co-execution

This section discusses the effects of conflict-aware workload co-execution in terms of makespan and turn-around time. Since the turn-around time is never improved by co-execution, we demonstrate that the proposed approach can shorten the makespan while not increasing the turn-around time as much as possible.

5.1 Evaluation methodology

In the following evaluation, the benchmarks used in Sect. 3 are used. A workload set is generated by assuming to execute those benchmarks one by one on both of VH and VE, and then the makespan for the workload set and turn-around time for each benchmark are calculated. The benchmarks to be executed on VH and VE are selected for co-execution, based on either unconditional co-execution or the proposed conflict-aware co-execution.

As there are seven standard benchmarks and 12 micro-benchmarks, the number of possible permutations for each of VH and VE is 19!. Thus, the number of permutations for co-execution is $19!\times 19!\approx 10^{34}$, and it is infeasible to actually execute all the co-execution permutations. Therefore, in this paper, we simulate the execution for discussions on the effects of conflict-aware co-execution. In the following evaluation, we use unconditional co-execution and the proposed conflict-aware co-execution with different threshold values. The threshold values used in the evaluation are 50%, 100%, 150%, and $\infty$; the threshold of $\infty$ indicates unconditional co-execution, and the next workload immediately starts running whenever the preceding workload finishes.

The slowdown rates measured in advance are used to calculate $C_{\textrm{all}}$ and thereby predict if severe resource conflicts occur. If either of VH or VE becomes idle, the idle processor can start workload execution only if the new workload does not cause resource conflicts with the already-running workload on the other processor, checking if $C_{\textrm{all}}$ exceeds the threshold or not. When both VH and VE become idle and the next workload pair is expected to cause severe resource conflicts, either of the VH or VE workload is executed first. The VH workload is executed if the number of remaining VH workloads is larger than the number of remaining VE workloads, and vice versa.

The procedure of the simulation is as follows. First, a pair of VH and VE workloads are executed to measure the slowdown rate, and also each workload is executed alone to measure its VH load. Then, a workload set of executing 19 benchmarks one by one in a random order is organized for each of VH and VE, assuming the execution time of each benchmark is 100 s. After that, the makespan and turn-around time are calculated based on the measurement. For example, the slowdown rates are 1.4 and 1.2 for VH and VE workloads, respectively. Namely, in the case of their co-execution, the VH and VE workloads take 140 and 120 s, respectively. After 120 s from the simulation start, the VE becomes idle and the co-execution strategy might start executing one of subsequent VE workloads, or decide to keep the VE idle for 20 s to prevent severe performance degradation. In this way, workload co-execution is simulated until all of the VH and the VE workloads have finished. These steps are repeated 10,000 times to calculate the average values of the makespan and turn-around time.

5.2 Evaluation results and discussions

Co-execution strategies are evaluated in average turn-around time, worst turn-around time, and makespan. The average turn-around time is calculated for each of VH and VE workloads. If every workload is executed alone, there is no resource conflict and each workload takes exactly 100 s for execution. If all of VH and VE workloads are executed sequentially without co-execution, it takes $100\times 19\times 2=3800$ seconds to execute all the workloads, i.e., the makespan is 3800 s. By introducing workload co-execution, the makespan will become shorter because the execution time of one workload is partially overlapped with that of another co-executing one, while the turn-around time of each workload might become longer due to resource conflicts. Accordingly, there is a trade-off between the turn-around time and makespan. The proposed conflict-aware co-execution strategy is expected to find a good trade-off point by adjusting the threshold value.

Figures 11, 12, and 13 show the changes in average turn-around time, makespan, and worst turn-around time by adjusting the threshold. These figures visualize the summary statistics with box plots, in which a whisker indicates the range of the data distribution, the central mark on each box indicates the median, and the bottom and top edges indicate the 25 and 75 percentiles, respectively. In the figures, "noplan" indicates unconditional co-execution, which can be considered as a special case of the proposed strategy with the threshold of $\infty$.

Figure 11 shows that the average turn-around time with the threshold of 50% is almost the same as that without co-execution, and the average turn-around time with the threshold of 150% is almost the same as that with unconditional co-execution. When the threshold is set to 100%, the average turn-around time is slightly longer than that without co-execution but much shorter than that with unconditional co-execution. Figure 12 shows that workload co-execution can shorten the makespan unless the threshold is too small of 50%. Even if the threshold is set to 50%, the makespan becomes slightly shorter, and thus some workloads are co-executed. Figure 13 shows that, if the threshold is too large, e.g., 150%, the proposed strategy is unable to avoid a conflicting workload pair from being co-executed and thus the worst turn-around time could be almost the same as that with unconditional co-execution. On the other hand, the proposed strategy with a threshold of 50% or 100% can certainly avoid co-execution of conflicting workload pairs, and retain the worst turn-around time to be short. The results in Figs. 11, 12, and 13 demonstrate that each performance metric of the proposed strategy becomes closer to that with sequential execution as the threshold becomes smaller, and closer to that with unconditional co-execution as the threshold becomes larger. It is also demonstrated that if the threshold is set to 100%, intuitively considered optimal, the proposed strategy can reduce the makespan while preventing a significant increase in turn-around time.

In this paper, the execution time of every workload is set to 100 s, and thus the negative effect of resource conflicts between such short-running workloads cannot be critical. However, in practice, such a severely conflicting workload pair could be co-executed for a longer time, making the system throughput low during the co-execution. Thus, it is important to predict conflicting workload pairs and avoid co-executing them. The proposed strategy with the threshold of 100% can reduce the makespan by 28.5% on average while retaining the worst turn-around time to only 1.55 times longer than that without co-execution. Therefore, we conclude that the proposed strategy can improve the makespan without critically increasing the worst turn-around time.

Although the threshold value of 100% seems to be intuitively reasonable to check if the VH is overloaded, it is also possible to tune its behaviors for individual systems and/or system operation policy by adjusting the threshold value. For example, if user experience in terms of turn-around time is more important than the system throughput, the threshold value should be set to smaller so as to reduce the risk of resource conflicts. Since a smaller threshold makes it more difficult to co-execute workloads, a reduction in makespan becomes smaller, meaning that the proposed approach becomes less beneficial in terms of makespan reduction. As mentioned above, the makespan reduction by using the threshold of 150% is not significant in this paper because of short-running workloads. However, it could be significant in practice at a cost of increasing turn-around time. Thus, a larger threshold could be beneficial if the system throughput is important. One idea to automatically find an optimal threshold would be to use machine learning to adjust the threshold value to improve a given optimality metric. Since the system operation optimality could change dynamically, reinforcement learning models will be appropriate for adapting to the changes. In addition to incorporating the proposed approach into a job scheduler, automatic tuning of the threshold value will be discussed in our future work.

6 Related work

In Sect. 2, we reviewed that SX-AT adopts a heterogeneous configuration. Therefore, the challenges discussed in improving the computational efficiency of a standard CPU-GPU system might also provide interesting insight for improving the SX-AT efficiency.

Due to their massive thread parallelism, a GPU workload running on a standard CPU-GPU system is likely to saturate shared hardware resources, such as memory and network bandwidths. Hence, in (Kayiran et al. 2014), a platform has been proposed to control the performance trade-off between CPU and GPU workloads. The proposed platform can dynamically determine the GPU concurrency level so as to maximize the system performance by considering both system-wide memory and network conflict information as well as the state of GPU cores.

Another related study introduces a runtime framework for scheduling each of multiple users’ OpenCL tasks to its optimal device, either a GPU or a CPU on a CPU-GPU system (Wen and O’Boyle 2017). The runtime framework uses a performance prediction model based on machine learning at runtime to select optimal devices.

Some algorithms and power prediction models are proposed in Zhu et al. (2017) for schedulers to co-execute workloads with considering the impact on power consumption as well as other shared resources.

There are many other studies on oversubscription (Aceituno et al. 2021), where each CPU is used for the concurrent execution of multiple workloads. However, most of these studies do not assume a heterogeneous computing system consisting of different types of processors. On the other hand, studies on job scheduling and resource allocation for heterogeneous computing systems usually focus on whether a CPU or a GPU is used to execute each job (Alsubaihi and Gaudiot 2017), and those existing approaches cannot directly be applied to SX-AT, on which a pseudo VE process is running on the VH to control the VE and sharing the VH resources with VH workloads.

Several researchers have evaluated the performance of SX-AT and reported various scientific applications (Egawa et al 2020; Komatsu et al. 2018), VH-VE offload programming (Ke et al. 2021; Takizawa et al. 2021), and I/O performance (Sasaki et al 2021). However, no report quantitatively evaluates the performance interference when VH and VE workloads coexist. We believe that we are the first to discuss the concurrent execution of VH and VE workloads through quantitative performance evaluation results. In our previous paper (Nunokawa et al. 2022), we have empirically investigated the relationship between system call frequency and performance interferences between VH and VE workloads. Since it is impractical for a job scheduler to monitor the system call frequency of each job, this paper is an extended version of the previous one to additionally discuss a more practical way of predicting severe performance interferences, i.e., VH’s CPU load is used to approximate the system call frequency of a VE workload. As discussed in the previous section, the CPU load could properly approximate the system call frequency and hence reduce the risk of co-executing a pair of VH and VE workloads causing severe performance interferences.

7 Concluding remarks

This paper has experimentally investigated the performance interference between a VH and a VE, when each of the two processors executes a different workload. Then, a conflict-aware co-execution strategy has been proposed.

The evaluation results clearly demonstrate that the system call frequency of a workload can be used as a good indicator to predict if the workload can affect the performance of another co-executing workload. It is also worth considering the number of used cores, because performance interference could be restrained if there are some unused VE cores when co-executing VH and VE workloads. These experimental results will be helpful to identify a combination of workloads causing frequent resource conflicts, and thus reduce the risk of performance interference between co-executing workloads on an SX-AT system.

The superiority of the proposed co-execution strategy has been demonstrated by calculating the turn-around time and makespan for a randomly-generated workload set. Although the system call frequency is a good indicator, monitoring of system calls induces significant runtime overheads. Hence, the VH load is used to approximate the system call frequency. Based on the total VH load of VH and VE workloads, a workload pair to be conflicting can properly be identified. As a result, it is clarified that the proposed strategy can reduce the makespan without significantly increasing the turn-around time if the threshold value is reasonably configured.

This paper has demonstrated the feasibility of co-execution risk prediction assuming to run each workload alone once at the first time. However, to improve the prediction accuracy, it will be effective to run the same workload several times to get the statistical information. A workload must be executed alone whenever there is no co-executable workload, and thus we can record the CPU loads of multiple runs to improve the statistical accuracy. Moreover, even if a misprediction happens, it can be recorded to avoid resource conflicts in the future. More advanced management of performance database will be discussed in our future work. As discussed in Sect. 5, it would also be interesting to adopt a reinforcement machine learning model for automatically and adaptively tuning the threshold value of the proposed approach.

References

Aceituno, J.M., Guasque, A., Balbastre, P., et al.: Hardware resources contention-aware scheduling of hard real-time multiprocessor systems. Journal of Systems Architecture 118 (2021)
Alsubaihi, S., Gaudiot, J.L.: PETRAS: Performance, energy and thermal aware resource allocation and scheduling for heterogeneous systems. In: International Workshop on Programming Models and Applications for Multicores and Manycores, pp 29–38 (2017)
Egawa, R., Fujimoto, S., Yamashita, T., et al.: Exploiting the potentials of the second generation SX-Aurora TSUBASA. In: 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp 39–49 (2020)
Himeno, R.: Himeno benchmark. https://i.riken.jp/en/supercom/documents/himenobmt/ (2001)
Intel Corporation.: Introducing Intel MPI benchmarks. https://software.intel.com/content/www/us/en/develop/articles/intel-mpi-benchmarks.html (2018)
Kayiran, O., Nachiappan, N.C., Jog, A., et al.: Managing GPU concurrency in heterogeneous architectures. In: IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 114–126 (2014)
Ke, Y., Agung, M., Takizawa, H.: neoSYCL: a SYCL implementation for SX-Aurora TSUBASA. In: International Conference on High Performance Computing in Asia-Pacific Region, pp 50–57 (2021)
Komatsu, K., Momose, S., Isobe, Y., et al.: Performance evaluation of a vector supercomputer SX-Aurora TSUBASA. In: The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), pp 685–696 (2018)
McCalpin J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp 19–25 (1995)
Nunokawa, R., Shimomura, Y., Agung, M., et al.: Towards conflict-aware workload co-execution on SX-Aurora TSUBASA. In: Parallel and Distributed Computing, Applications and Technologies (PDCAT 2021) (2022)
Petitet, A., Whaley, R.C., Dongarra, J., et al.: HPL – a portable implementation of the high-performance Linpack benchmark for distributed-memory computers, version 2.3. (2018). https://www.netlib.org/benchmark/hpl/
Rabenseifner, R., Koniges, A.E., Livermore, L.: The parallel communication and I/O bandwidth benchmarks: b_eff and b_eff_io. In: Proc. of 43rd cray user group conference, indian wells, california, usa, Citeseer (2001)
Sasaki, Y., Ishizuka, A., Agung, M., et al.: Evaluating I/O acceleration mechanisms of SX-Aurora TSUBASA. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 752–759 (2021)
Sasidharan, A., Snir, M.: MiniAMR - a miniapp for adaptive mesh refinement. University of Illinois at Urbana-Champaign, Tech. rep. (2016)
Google Scholar
Shan, H., Antypas, K., Shalf, J.: Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, IEEE, pp 1–12 (2008)
Takizawa, H., Shiotsuki, S., Ebata, N., et al.: OpenCL-like offloading with metaprogramming for SX-Aurora TSUBASA. Parallel Computing 102 (2021)
Takizawa H, Takahash K, Shimomura Y, Egawa R, Oizumi K, Ono S, Yamashita T, Saito A.: “AOBA: The Most Powerful Vector Supercomputer in the World,” Sustained Simulation Performance 2022, Springer Nature (2023)
Wen, Y., O’Boyle, M.F.P.: Merge or separate? multi-job scheduling for opencl kernels on CPU/GPU platforms. In: GPGPU-10: Proceedings of the General Purpose GPUs, pp 22–31 (2017)
Xiong, Q., Ates, E., Herbordt, M.C., et al.: Tangram: Colocating HPC applications with oversubscription. In: IEEE High Performance Extreme Computing Conference, pp 1–7 (2018)
Yamada, Y., Momose, S.: Vector engine processor of NEC’s brand-new supercomputer SX-Aurora TSUBASA. In: Hot Chips: A Symposium on High Performance Chips (2018)
Zhu, Q., Wu, B., Shen, X., et al.: Co-run scheduling with power cap on integrated CPU-GPU systems. In: International Symposium on Parallel and Distributed Processing, pp 967–977 (2017)

Download references

Acknowledgements

The authors would like to thank Associate Professor Masayuki Sato of Tohoku University for his valuable help.This work is partially supported by MEXT Next Generation High-Performance Computing Infrastructures and Applications R &D Program “R &D of A Quantum-Annealing-Assisted Next Generation HPC Infrastructure and its Applications,” Grant-in-Aid for Scientific Research(B) #21H03449, Grant-in-Aid for Challenging Research (Exploratory) #22K19764, and Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures jh220025.

Author information

Authors and Affiliations

Graduate School of Informaion Sciences, Tohoku University, Aramakiaza-aoba, Aoba-ku, Sendai, Miyagi, 9808579, Japan
Riku Nunokawa & Hiroyuki Takizawa
Cyberscience Center, Tohoku University, Aramakiaza-aoba, Aoba-ku, Sendai, Miyagi, 9808578, Japan
Yoichi Shimomura, Ryusuke Egawa & Hiroyuki Takizawa
Institute of Genetics and Cancer, University of Edinburgh, Crewe Road, Edinburgh, EH4 2XU, UK
Mulya Agung
School of Engineering, Tokyo Denki University, Senju-asahicho, Adachi-ku, Tokyo, 1208551, Japan
Ryusuke Egawa

Authors

Riku Nunokawa
View author publications
You can also search for this author in PubMed Google Scholar
Yoichi Shimomura
View author publications
You can also search for this author in PubMed Google Scholar
Mulya Agung
View author publications
You can also search for this author in PubMed Google Scholar
Ryusuke Egawa
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Takizawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiroyuki Takizawa.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nunokawa, R., Shimomura, Y., Agung, M. et al. Conflict-aware workload co-execution on SX-aurora TSUBASA. CCF Trans. HPC 6, 425–438 (2024). https://doi.org/10.1007/s42514-023-00171-x

Download citation

Received: 19 April 2022
Accepted: 01 September 2023
Published: 05 October 2023
Issue Date: August 2024
DOI: https://doi.org/10.1007/s42514-023-00171-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Conflict-aware workload co-execution on SX-aurora TSUBASA

Abstract

Similar content being viewed by others

Towards Conflict-Aware Workload Co-execution on SX-Aurora TSUBASA

Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer

Optimization of the Himeno Benchmark for SX-Aurora TSUBASA

1 Introduction

2 Resource conflicts on an SX-Aurora TSUBASA system