1 Introduction

The even distribution of work in a parallel system is a principal challenge for achieving optimal efficiency, as uneven distribution (load imbalance) leads to underutilized hardware. On the one hand, the application or its runtime needs to distribute work evenly across threads (or processes), e.g., by OpenMP loop scheduling. The operating system (OS) scheduler can amplify or mitigate load imbalances by scheduling application threads among cores/hardware threads. OS scheduling decisions are most impactful if there are more execution threads—including application threads and background tasks—than processor cores.

All supercomputers on the TOP500 list use LinuxFootnote 1 and therefore its Completely Fair Scheduler (CFS) [13]. OS-level load balancing operations such as context switches and thread migrations can be costly and influence application performance. Moreover, application characteristics can also influence the OS scheduler actions regarding load balancing. The relation between OS-level and application thread-level scheduling and load balancing is key to increasing performance and utilization of today’s complex and extremely large supercomputers but has not yet been explored. In this work, we investigate this interaction and pursue answers to three specific research questions presented in Sect. 4.

We selected applications with different load imbalance characteristics to create distinct scenarios for the OS scheduler. The applications contain loops parallelized and scheduled with OpenMP. We modified the loop schedule clauses to use LB4OMP  [12], a load balancing library that provides dynamic and adaptive loop self-scheduling techniques in addition to the standard options. To measure the Linux OS scheduler events during the execution of parallel applications we use lo2s  [11], a lightweight node-level tool that monitors applications and the OS.

This work makes the following contributions: 1) An in-depth investigation of the interaction between OS- and application-level scheduling and load balancing and its influence on system and application performance. 2) Exposes opportunities for performance improvement by bridging OS- and application-level schedulers. Overall, the presented results pave the way for cooperation between these levels of scheduling.

This work is organized as follows. Section 2 reviews the related literature while Sect. 3 contextualizes the scheduling considerations relevant to this work. The approach proposed to study the interaction between scheduling levels is described in Sect. 4. The experimental design and performance analysis are discussed Sect. 5. The conclusions and outline directions for future work are presented in Sect. 6.

2 Related Work

Earlier work has exclusively studied the performance of different application thread-/process-level scheduling techniques [7, 12] or the relation between application thread-level and application process-level scheduling [8, 15, 16].

The OS noise was investigated in different scenarios and experimentation strategies [1, 19, 20]. The OS noise was evaluated as a single “composed component” that consists of scheduling, kernel events, OS services, and maintenance tasks. In this work, we break this noise into its constituents and assess the influence of Linux OS scheduler on the performance of multithreaded applications.

Wong et al. showed that local scheduling decisions in Linux with CFS are efficient [21]. Bouron et al. showed that the FreeBSD ULE scheduler achieves comparable performance to Linux CFS [6]. While CFS is in general efficient, Lozi et al. [14] show that its implementation is still updated when bugs are found.

In contrast to the above-cited literature, this work investigates the influence of the Linux CFS on the performance of several multithreaded applications and quantifies this interaction. Through selected specific kernels, benchmarks, and mini-apps that exhibit varied load imbalance characteristics (which we also modify by employing various OpenMP scheduling techniques) we observe their effects on the OS scheduler’s behavior.

3 Scheduling in Context

3.1 Linux OS Scheduling

To rebalance work between different CPUsFootnote 2, the Linux kernel uses scheduler domains, which “mimic[...] the physical hardware” [5]. Figure 1 shows the scheduler domains of the dual socket system used in the experiments. The domains include SMT (Symmetric Multi-Threading), MC (Multi-Core), NODE, and NUMA (twice). While SMT includes all logical CPUs of a single processor core, MC includes all cores with a common LLC (Last Level Cache), i.e., one core complex [2, Sect. 1.8.1]. NODE refers to all CPUs of a NUMA node. NUMA refers to properties of NUMA systems that are not matched by previous domains – in our example, the two NUMA domains represent CPUs of one processor and all CPUs, respectively. Each domain also serves as a scheduling group in the higher level domain, e.g., the four SMT domains are scheduling groups in the MC domain.

Fig. 1.
figure 1

Scheduler domains of a dual socket AMD Epyc 7502 system. Four cores are a core complex, two core complexes define a NUMA node.

Bligh et al. [5] also state that work rebalancing is triggered at regular intervals. This is done by stopper threads, which exist per CPUFootnote 3. The threads visit their scheduler domains and try to find more loaded scheduling groups from which they steal work and enqueue it in their local group. The rebalancing interval is typically between one and hundreds of milliseconds, depending on the scheduler domain, and typically grows with the number of CPUs being included in the domain. The exact values for each domain are listed as min_interval and max_interval in the domains entries in Linux’ debugfs. Linux can also reschedule tasks (in our case OpenMP application threads) when they change state, e.g., when they switch between active (TASK_RUNNING) and idle (TASK_SUSPENDED). To save energy, stopper threads can be turned off during idle periods on tickless kernels. Here, a ”NOHZ balancer core is responsible [...] to run the periodic load balancing routine for itself and on behalf of [...] idle cores.“ [14, Sect. 2.2.2].

3.2 Application Thread-Level Scheduling

Application thread-level scheduling is used to improve the execution of a given application. E.g., load-imbalanced loops with static scheduling leads to idle application threads and therefore possibly underutilized CPUs while dynamic scheduling leads to fewer idle application threads.

The applications considered in this work exhibit various load imbalance characteristics and we modify their parallel loops to use techniques from the LB4OMP library [12]. We select six scheduling techniques representing distinct classes: static (static), work-stealing (static_steal), dynamic self-scheduling (SS, FAC2, GSS), and adaptive self-scheduling (AF). 1) static: straightforward compile-time parallelization as per OpenMP standard, smallest to no scheduling overhead and high data locality, no load rebalancing during execution. 2) static_steal: LLVM’s implementation of static scheduling with work stealing; Steal operations are costly with loss of data locality. 3) dynamic,1 (SS) [17]: OpenMP standard-compliant. Each work request is assigned one loop iteration; Best load balance at the cost of considerable scheduling overhead and severe loss of data locality. 4) guided (GSS) [18] and (5) FAC2  [10] are dynamic & non-adaptive scheduling techniques. Both assign large chunks of iterations early in the loop execution and later gradually smaller chunks, achieving load balancing at low scheduling overhead. 6) adaptive factoring (AF) [4] is a dynamic & adaptive scheduling technique. AF collects information about currently executing loop iterations to adapt the next chunk size accordingly and achieve load balancing at a possibly considerable scheduling overhead and loss of data locality.

4 Interaction Between OS and Application Scheduler

The next two subsections present our proposed methodology to answer the following research questions.

figure a

4.1 Quantifying OS Scheduler Influence on Application Performance

Quantifying the OS influence on application performance and core utilization is challenging. For instance, an application may issue system calls to read a file from the disk; the OS will do a context switch and may schedule other tasks while the data is being read from the disk. In this case the application is not delayed while it waits for the I/O. In other cases, the OS scheduler may decide to migrate an application thread to another core. The OS scheduler often makes such decisions to balance the system cores either globally via thread migrations or locally via context switches. However, these scheduling decisions may ultimately lead to load imbalance at the application level, i.e., context switches and thread migrations cause performance variability between application threads.

To quantify the OS scheduler’s influence on a specific application, we monitor and capture all events that occur while the OS schedules the application. The captured events are classified into: application events A, context switches C, idle events D, or other O. This classification allows to quantify and compare the influence of individual event types in various scenarios (see Sect. 5).

Table 1. Symbol definitions

Table 1 summarizes the notation and metrics employed in this work. In particular, f(X) represents the influence of a specific type of event as the percentage of the aggregated duration of all events of that specific type on all cores to the parallel cost of the application. Events also have an indirect influence that is observable but not explicitly measurable. For instance, frequent context switches create locality loss and delay certain application threads. Thus, the influence of these context switches goes beyond their duration. To infer such indirect influence, we measure the rate of a specific type of event. For instance, abnormally high context switch rates may explain performance degradation (see Sect. 5.3).

4.2 Recording Linux OS Scheduling Events

We record Linux OS scheduling events with the lo2s performance measurement tool [11]. lo2s leverages the Linux kernel to collect occurrences of hardware and software events including two special context switch records: one for switching away from a process and one for switching into the new current process. These records are generated at the beginning and end of the scheduler context_switch implementation in kernel/sched/core.c. lo2s reads the buffers when reaching a watermark or at the end of monitoring and writes them to OTF2 [9] files.

In this work, we use one buffer for each of the systems CPUs and register the tracepoint event sched:sched_migrate_task, which is generated by the Linux scheduler implementation of set_task_cpu in kernel/sched/core.c. This event is caused by different sources, e.g., when a program calls specific syscalls, or when a stopper thread or the swapper re-balance work. Each occurrence of this tracepoint translates to an OTF2 metric event that includes numeric values of the involved process and CPU ids as well as priority. The OTF2 representation of the context switch events gives complete information about when which thread was (de)scheduled on any CPU. The information is recorded as calling-context enter and leave events where each calling-context represents the process that was (de)scheduled. This includes a calling-context for idle with a pid of 0. The full lo2s command used in this work is lo2s -a -t sched:sched_migrate_task.

From the recorded data, we extract information on how often a context switch was performed and how much time the context_switch implementation took in total in between the monitoring records. We specifically filter the (de)scheduling of application threads to collect how often and for how long they were scheduled. The same is true for the time spent idle (idle OS threads).

5 Performance Results and Discussion

We test the proposed methodology (see Sect. 4) through a performance analysis campaign and offer quantitative answers to the three research questions. For the experiments, we use an application, a benchmark, and a kernel, which we execute on two types of compute nodes (Table 2 and Sect. 5.1), operated by distinct Linux kernel versions. All codes are compiled with Intel compiler version 2021.6.0. We employ 6 different application thread-level scheduling techniques from the LB4OMP library (Sect. 3.2). The LB4OMP library requires OpenMP and we use the LLVM OpenMP runtime library version 8.0. For all measurements regarding OS and application events, we use lo2s version v1.6.0 (Sect. 4.2). Each experiment configuration was repeated 5 times. The average c.o.v. of all measurements considering parallel execution time was 0.0133. The highest c.o.v. appears for NAS-BT.C executing on conway, active wait policy, notPin, and static scheduling technique: 0.0937. The majority of all other measurements are below 0.02 c.o.v.

5.1 Applications

Depending on their characteristics (memory-, compute-, and/or I/O-bound), applications may drive the OS scheduler to perform more scheduling operations (context switches, thread migration) than others. OS threads executing load-imbalanced applications will naturally idle more often than when executing well-balanced applications. These idle times can be exploited by the OS scheduler. Preemption or migration of threads can also decrease data locality. To investigate these aspects, we focus mainly on compute-bound applications with different characteristics regarding their load imbalance behavior.

Calculating the Mandelbrot set is a highly compute-bound kernel. We implement this kernel in a time-stepping fashion. The code comprises an outer loop enclosing three other loops that, when scheduled with static, present: constant, increasing, and decreasing load imbalance characteristics across time steps. NAS-BT.C is a block tridiagonal solver for synthetic systems of nonlinear partial differential equations, part of the NAS OpenMP parallel benchmarks version 3.4.2 [3]. When NAS-BT.C is scheduled with static, it shows low load imbalance. SPH-EXAFootnote 4 is a Smoothed Particle Hydrodynamics (SPH) simulation framework. In this work, SPH-EXA simulates a Sedov blast wave explosion. This application exhibits both memory- and compute-bound characteristics and when executed with static results in mild load imbalance.

These applications comprise loops and employ OpenMP loop parallelization and scheduling to exploit hardware parallelism on shared memory systems. We modified their most time-consuming loops to use the OpenMP schedule(runtime) clause to call different scheduling techniques from LB4OMP  [12]. Specifically, all OpenMP loops in Mandelbrot were changed from no schedule clause (which defaults to static) to schedule(runtime). In NAS-BT.C, 12 out of 28 loops use the NOWAIT loop clause. The scheduling of these loops cannot be changed from the current static scheduling as the correctness in NAS-BT.C depends on those loops’ iterations executing in the predefined order. Therefore, we only modified the 3 most time-consuming loops: x_solve, y_solve, z_solve. For SPH-EXA, the 4 most time-consuming loops: IAD, findPeersMac, momentumAndEnergy, and updateSmoothingLength were modified, out of 16 OpenMP loops.

5.2 Design of Experiments

The experimental evaluation of the proposed methodology requires designing and performing experiments with several factors, each with multiple values. This yields a set of 2’520 factorial experiments, summarized in Table 2. N denotes the number of iterations of an applications’ loop that we will schedule, S – the number of time-steps, and P – the number of system cores. In the following sections, we explore the interaction between OS-level and application thread-level scheduling using the metrics described in Sect. 4.1 and Table 2.

Table 2. Design of factorial experiments (960 experiments in total)

5.3 Influence of OS Scheduling Events on Application Performance

Here, we decouple OS- from application-related events to answer RQ.1. This allows us to investigate the direct impact of different OS scheduler-related operations on the parallel cost. Figure 2 shows several heat maps which represent the influence of OS scheduler-related events on the parallel cost of the different applications/configurations executing on the two systems. That is, the color of the cells represents \(1 - f(A)\).

Cells with a dark shade of show a larger influence of non-application events on parallel cost than cells with a shade. Information about the cells’ annotations and x and y axis can be found in the caption of Fig. 2. The title of each heat map identifies the application and system, and reminds the key characteristics of the application. The application thread-level scheduling techniques are ordered along the x axis according to their scheduling overhead [12], from lowest (static), to the highest (SS).

In Fig. 2, one can observe that the largest OS influence is due to the time spent idle during the execution (5th row of annotations on the cells of the heat maps). Only results for experiments configured with the passive wait policy are shown in Fig. 2. The results for active wait policy were subtracted, as they show the same behavior for all applications and systems with the influence of non-application events on the parallel cost close to \(0\%\). This is due to application threads never being allowed to become idle (they persist in busy wait at the end of the OpenMP loops), which prevents the OS from freeing cores. This phenomenon makes the idle times practically disappear and other events, such as context switches, are significantly reduced.

The pinning strategies reveal a more general behavior. Unpinned executions increase the amount of time spent idle during the execution of the applications, indicating that the OS level load balancing attempts (via thread migrations) end up increasing system idleness. One can observe that for the ariel system, the performance impact of not pinning application threads is lower than on conway. Since conway has almost twice the amount of cores than ariel, it increases the likelihood that the OS load balancing operations (performed to preserve OS-level load balance) will create short delays on application threads which can induce or increase application-level load imbalance. Thereby, increasing the amount of time spent idle during the applications’ execution. We discuss this phenomenon further in Sect. 5.4.

Fig. 2.
figure 2

Influence of OS scheduler-related events on the parallel cost of different applications/configurations executing on the two systems. The x axis shows the application thread-level scheduling techniques, while the y axis identifies the different configurations. The heat bar presents \(1 - f(A)\), which determines the heat map cells’ colors. The annotations of each cell show: 1st line, the relative time spent in application events f(A); 2nd line, the parallel cost \(T_{c}\); 3rd line, the application related events time \(t(A_i)\); 4th line, the relative time spent in context switch operations f(C); 5th line, the relative time spent in idle events f(D); 6th line, the relative time spent in other events f(O).

From Fig. 2, Mandelbrot executions with static scheduling show a large percentage of time spent idle during the executions (ariel Pin 8.23%, notPin 8.50% | conway Pin 9.04%, notPin 18.62%). This happens as the kernel itself is highly load imbalanced which creates several opportunities for the OS to schedule other tasks or, in this case, turn the cores idle. The load imbalance in Mandelbrot greatly affects the performance of the kernel. This can be noticed as all application thread-level scheduling techniques outperform static by improving application-level load balancing and also indirectly improving OS-level load balancing by reducing the time spent idle on the cores.

NAS-BT.C (see Fig. 2) makes for an interesting case as it starts with 19 short loops interconnected by the NOWAIT loop clause. These loops become slightly desynchronized during execution, which accumulates until the end of the loop barrier in every time step. This causes a significant amount of idle time. Although this desynchronization can create a considerable amount of time spent idle when the application is executed with static on all loops (ariel Pin 4.11%, notPin 5.75% | conway Pin 15.68%, notPin 21.09%), it does not translate into considerable application performance loss.

For SPH-EXA, one can observe that the direct influence on parallel cost from context switches f(C) and other events f(O) is very low (smaller than 1%) for all experiments. SPH-EXA (see Fig. 2) loops are executed numerous times within each time step. This makes the application threads encounter the OpenMP end loop barriers also numerous times for each time step, creating very frequent and short idle periods.

Fig. 3.
figure 3

Context switches per second, r(C), for SPH-EXA executing on both systems ariel and conway. The x axis shows the different scheduling techniques and the y axis identifies the different configurations. The heat bar shows the context switch rate, cells show a lower rate while cells show a higher rate. The annotations in each cell show: 1st line, the actual context switches per second, r(C), represented by the cells’ color; 2nd line, the parallel cost \(T_{c}\). (Color figure online)

We experiment with SPH-EXA to demonstrate the indirect influence that excessive context switches can have on the performance of applications. Figure 3 shows the number of context switches per second, r(C), when SPH-EXA executed on both systems ariel and conway. Figure 3 includes the results for wait policy active to highlight the excessive context switches that were performed when SPH-EXA was executed with passive wait policy.

Using the frequency of context switches, one can infer their indirect influence on the performance of SPH-EXA. In Fig. 3, the executions configured with wait policy passive, on both systems, show more than \(10 \times \) context switches per second than with wait policy active. This indirectly affects the performance of SPH-EXA as not only the direct cost of context switches is increased, but also thread wake-up times and loss of data locality. One can notice that by forcing the application threads to stay in a busy wait state at the end of the OpenMP loops (wait policy active), the OS does not preempt the threads so frequently, which lowers the context switches per second (Fig. 3) and the amount of time spent idle (Fig. 2).

Without wait policy active, the OS scheduler does not know the characteristics of the application being executed and keeps trying to free the cores that contain idle application threads even if the idle time is extremely short and rather frequent. Although wait policy active works as a solution for this case, it will prevent the OS from exploiting idle times, which is not ideal for collocation. A solution to this problem can be coordination and information exchange between application thread-level and OS-level scheduling where the application scheduler signals the OS scheduler that a few threads are in a busy wait state. Also, the OS can be notified that the application is executing exclusively. This would allow the OS to make more informed decisions. A common case for HPC systems is exclusive execution where, instead of making the cores idle, the OS could keep the application threads scheduled and only preempt them when there is something that is actually ready to execute.

The results in Fig. 2 and Fig. 3 allow an answer to RQ.1. The influence of OS scheduling events on the performance of applications manifests in different ways, depending on the application characteristics. Extremely compute-bound and load-imbalanced applications, such as Mandelbrot, can significantly be affected by OS scheduling events when load balancing at OS-level is allowed (not pinned threads). Results with unpinned threads show that the OS balancing the load across cores ends up aggravating load imbalance in NAS-BT.C and Mandelbrot, which increased idle time and directly increased the parallel cost of the application (see Fig. 2). Finally, applications with very frequent and short loops, such as SPH-EXA, can end up triggering the OS to perform too frequent context switches to keep the system state updated with idle cores, which creates a significant overhead on the application execution.

5.4 Interaction Between OS- And Application-Level Scheduling

In this section, we first compare system level load imbalance and application performance for Mandelbrot to answer RQ.2 (Fig. 4).

To calculate the c.o.v. shown in Fig. 4, we consider each system core individually and measure the time spent in application-related events on each core, \(t(A_i)\). This time is then used to calculate the c.o.v. of the system cores considering only application-related events (see c.o.v. in Table 2). The active wait policy results were subtracted from the figure as they always show c.o.v. close to zero due to threads practically never becoming idle.

For Mandelbrot with pinned threads (Pin) in Fig. 4, executed with static, the application load imbalance directly translates to load imbalance across the system cores as the OS is not allowed to migrate application threads. Furthermore, all dynamic application thread-level scheduling techniques achieved similar performance and balanced the execution of Mandelbrot achieving c.o.v. close to zero. This means that balancing the execution of pinned threads, the system cores load is directly balanced too.

The results for Mandelbrot with unpinned threads (notPin), in Fig. 4 allow an answer to RQ.2. During execution of Mandelbrot with static, the OS reduced system cores load imbalance by migrating overloaded threads to idle cores. However, the additional context switches and thread migration operations lowered Mandelbrot ’s performance in comparison to executions with pinned threads. For example, the c.o.v. of the conway system with pinned threads (Pin) was 0.0824 and the parallel cost was 2756.48 s while the c.o.v. with unpinned threads (notPin) was 0.0468 and the parallel cost was 3071.69 s. In Sec. 5.3, fourth row of the heat maps at the top of Fig. 2, one can observe that the additional operations required to improve the system cores load balance end up increasing the amount of time spent idle during the execution.

Fig. 4.
figure 4

System cores load imbalance using c.o.v. metric. The x axis shows the different scheduling techniques at application thread-level while the y axis identifies the configurations. The heat bar shows the c.o.v. for the heat maps. Cells with a red shade show higher c.o.v. than cells with a blue shade. Higher c.o.v. indicates a higher system level load imbalance. The annotations show: 1st line, the actual c.o.v. shown by the cell color; 2nd line, the parallel cost (\(T_{c}\)). (Color figure online)

One can observe in Fig. 4 that when the OS performs load balancing across cores (notPin) and the application (Mandelbrot) also performs load balancing with dynamic scheduling techniques, the resulting c.o.v. is higher than when the OS does not interfere (Pin). This indicates that simultaneous load balancing operations both in the application thread-level and the OS-level schedulers result in application performance loss and a higher system cores load imbalance (higher c.o.v.).

This phenomenon is explained by the fact that the OS scheduler is designed to achieve fairness for multiple applications which compete for resources. In contrast, application thread-level scheduling is a cooperative approach for all threads executing on resources with the common objective to keep the execution flow balanced and complete the work as fast as possible. The key issue here is that on most HPC systems, nodes are allocated exclusively, with only one application executing at any time on each node. This decreases the fairness requirement, making OS-level scheduling less competitive and increasing the need for it to be more cooperative. One must consider modifications to the Linux OS scheduler, for HPC systems to allow the OS scheduler to receive information from other levels of scheduling that would help avoid focusing on fairness when it is not needed.

To answer RQ.3, we evaluate whether the OS exploits system idleness to collocate Mandelbrot and NAS-BT.C on the nodes. We selected these applications as they show both the highest (Mandelbrot) and lowest (NAS-BT.C) c.o.v. among all applications, respectively.

Figure 5 presents the influence of OS scheduler-related events on the parallel cost of Mandelbrot and NAS-BT.C when they execute concurrently. It shows that for both systems and Pin/notPin strategies, when the applications were executed concurrently and scheduled with static, the amount of idle time in the system decreased compared to the same configurations when the applications were executed exclusively (as shown in Fig. 2 results for Mandelbrot and NAS-BT.C, with static). This shows that the OS benefited from at least a portion of the idleness generated by Mandelbrot to schedule NAS-BT.C and vice versa (RQ.3).

Fig. 5.
figure 5

Influence of OS scheduler-related events on the parallel cost of Mandelbrot and NAS-BT.C when executing concurrently on the two systems. For these experiments, we consider application events as any event that is related to any of the two applications. The axes and annotations to the cells of the heat maps follow the same pattern as Fig. 2.

To confirm that the OS efficiently exploits system idleness (RQ.3), we analyze the executions with unpinned application threads (notPin) as the OS performs balancing such threads across cores. As the OS can move threads, it should migrate NAS-BT.C threads to cores where the threads from Mandelbrot are underloaded. This is confirmed by the results in Fig. 5, which show that with the exception of static, the executions with unpinned threads outperform executions with pinned threads on both systems. For example, the parallel cost of the applications executed on ariel with GSS scheduling technique and pinned (Pin) threads was 4739.64 s, while for free threads (notPin), it was 4646.52 s. This confirms that the OS exploits system idleness (RQ.3) and shows that when there is competition in the system, performing load balancing at both OS- and application thread-level is advantageous (RQ.2).

6 Conclusion

This work investigates the interaction between OS-level and application thread-level scheduling to explain and quantify their precise roles in application and system performance. We distinguish OS-related events from application-related events and proposed metrics to quantify the interaction between OS-level and application thread-level scheduling strategies and decisions. Through an extensive performance analysis campaign, we show that the interaction between OS and application scheduling significantly influences system load balance and application performance. We also expose collaboration points between OS- and application thread-level scheduling that can be leveraged to improve performance and load balancing decisions at the OS-level scheduling.

Future work will consider memory-bound applications and tuning of the Linux kernel parameters. Modifying the kernel to receive information about application scheduling decisions will also help coordinate scheduling and load balancing decisions at the OS level.