1 Introduction

High-performance computing (HPC) is essential for a variety of fields, including data analytics, scientific research, and intricate computer simulations. Thorough monitoring and analysis are essential for optimizing cluster performance, wise resource allocation, and productive workload management [1].

The technology of a high-performance computing grid refers to a distributed computing infrastructure that uses linked computing resources to tackle complicated problems by utilizing their combined capacity. Clusters, supercomputers, servers, and even individual workstations might be considered among these resources. Because HPC grids are built to perform large-scale parallel processing jobs, a variety of scientific, engineering, and research applications can benefit from their use [2].

Resource management is of paramount importance in high-performance computing grid environments due to several critical factors that influence the efficiency, reliability, and overall success of computational tasks [3]. Efficient resource management ensures that computing resources, including processors, memory, storage, and network bandwidth, are used optimally. This optimization leads to better performance, faster job completion, and reduced idle time for hardware components. Resource management helps balance workloads across the HPC grid system, preventing individual nodes or clusters from becoming overloaded while others remain underutilized [4, 5]. This balance is crucial to maintaining the stability of the system and maximizing overall throughput.

PBS Pro and Slurm are both job scheduling and resource management systems commonly used in high-performance computing (HPC) environments. Additionally, PBS Pro and Slurm support multi-clusters or HPC grids. Although they serve similar purposes, they have different architectures and approaches.

PBS supports peer scheduling between multiple PBS clusters. This means that PBS clusters can communicate with each other directly to schedule jobs [6]. Administrators can also set up PBS to schedule jobs to particular clusters according to their resource needs by using PBS Multi-sched partitions. Moreover, it is possible to configure PBS to route jobs to particular clusters according to their job type or other criteria using qlists and PBS routing queues [7].

Conversely, Slurm facilitates multi-cluster scheduling [8, 9] via connecting multiple clusters to a common database, or having the databases of the different clusters communicate with each other. Additionally, Slurm clusters can merge into a single logical cluster using federation [10] which acts like multi-cluster but with coherent jobs IDs among all clusters. The federation can then receive job submissions from users, just like a single cluster would. However, federation allows job arrays to run only on the origin cluster (i.e., the cluster to which jobs were submitted). Thus, we consider only Slurm multi-cluster, not federation, in this paper. Another feature that Slurm supports made by authors in [11] is "hierarchy," which enables administrators to organize Slurm clusters in a hierarchy. This can be helpful when overseeing sizable clusters with several administrative tiers. "Partitions", another feature that Slurm provides [12], enable administrators to combine resources and designate them for particular users or groups. This can be helpful when managing resources on multi-cluster systems.

There are several benchmark tools available for assessing resource management tools with grid performance, such as the NAS Grid parallel benchmark. [NGB] [13, 14] and GridBench [15]. NASA created a set of benchmark programs known as the NAS (NASA Advanced Supercomputing) Grid Parallel Benchmarks, or NGB, to assess the functionality of HPC grid systems. In the HPC field, these benchmarks are frequently used to evaluate and contrast the computational capacities of various systems. They are made to mimic the amount of computing required for different engineering and scientific applications [13, 14].

Another benchmark tool is known as GridBench, and it consists of a set of benchmark tests that simulate typical grid workloads. These benchmarks are intended to approximate performance metrics for different Grid setups, identify key elements influencing the overall performance of applications, and provide application developers with preliminary estimates of the expected application performance [15].

The recent white paper in [16] highlights PBS Pro, part of the Altair HPCWorks platform, as a superior solution to open-source job schedulers. Australia’s National Computational Infrastructure (NCI) chose PBS Pro, which demonstrated superior flexibility and reliability as a workload manager, providing confidence and comfort for a long-term partnership [17]. After over 20 years with the custom workload manager Cobalt, Argonne Lab switched to PBS Pro due to its robustness, scalability, and commercial support [18]. Kyoto University acknowledged the significance of the highly customizable nature of PBS Pro as a crucial feature that enhances productivity in cluster management [19]. PUNCH Torino recognized that by transitioning to PBS Pro, the resulting infrastructure not only facilitates the efficient completion of critical engineering tasks but also liberates their team from the challenges associated with managing an on-premises data center, providing a streamlined and headache-free operational environment [20]. Additionally, the National Supercomputing Center (NSCC) in Singapore acknowledges that PBS Pro caters to both current requirements and future demands, simplifying tasks such as job submission and management and facilitating secure data management along with remote 3D visualization. [21]. Australian Bureau of Meteorology chose PBS Pro because it takes into consideration the criticality of a whole-system outage [22].

Therefore, recently, we built the Egyptian National HPC Grid (EN-HPCG), which is considered to be the first national implementation of the grid concept rather than the cluster model [23]. The primary goal of EN-HPCG is to unify High-Performance Computing (HPC) resources in Egypt, establishing a national, intelligent, and diverse HPC grid. This grid aims to connect various national research institutes and university-based HPC facilities. The current consortium involves three key participants: the Informatics Research Institute (IRI) at the City for Scientific Research and Technological Applications (SRTA-City), the Faculty of Post-graduate Studies for Nanotechnology at Cairo University, Sheikh Zayed Branch, and the Faculty of Science at Ain Shams University (ASU), under the supervision of Academy of Scientific Research and Technology (ASRT). In this phase of the EN-HPCG grid, we utilize PBS Pro to leverage its benefits, including stable 24/7 support, the user portal, the admin portal, and a strong history of peer scheduling features.

Therefore, in this paper, we primarily evaluate popular resource management solutions, such as SLURM and PBS Pro, using the NGB benchmark on the Egyptian National HPC Grid (EN-HPCG). This study guides our decision to transition to the open-source Slurm workload management system in our EN-HPCG grid, aiming to minimize the cost of EN-HPCG sustainability. The paper is organized as follows Sect. 2 illustrates the related work of using multi-clusters, Sect. 3 presents the resources management system tools built by SLURM and PBS-Pro, Sect. 4 describes the experimental setup, Sect. 5 illustrates the evaluation experiments and results and Sect. 6 concludes the paper and presents future work.

Table 1 presents a comprehensive list of the abbreviations and terminologies used throughout this paper.

Table 1 List of abbreviations and terminologies

2 Related work

Many years ago, researchers delving into grid middleware laid the foundation for the sophisticated layer of software that today plays a pivotal role in managing the intricacies of computational grids. Managing a computational grid, which inherently comprises machines with various characteristics, poses a multifaceted challenge. Executing a job on a computational grid requires tasks such as establishing machine profiles, identifying allocated resources, evaluating specific work requirements, segmenting the job, and distributing the workload based on the available nodes and resources. To alleviate application developers from the intricacies associated with these actions, computational grids incorporate a software layer designed to conceal the complexities of this heterogeneous environment. Within this layer, various protocols and functions are imperative, providing support for various elements of the grid and facilitating adaptation to different operating systems, file systems, and communication protocols [24]. The middleware assumes a central role in computational grids, functioning across service, resource, and connectivity layers [25]. Comprising a synthesis of protocols, services, APIs, and SDKs [26], the middleware serves as a critical component that masks the hardware intricacies and communication complexities inherent in a grid environment. Leveraging a computational grid for diverse purposes becomes feasible only through the middleware layer’s ability to obscure these underlying complexities. Noteworthy among the prominent projects dedicated to middleware development for grids are:

  • Globus [27]: The Globus project, managed by the “Globus Alliance,” features collaborative efforts involving institutions like the Argonne National Laboratory and the Institute of Information Sciences. This open-source toolkit serves as a facilitator for constructing computational grids and grid-based applications. Its capabilities extend beyond corporate, institutional, and geographic boundaries, ensuring seamless collaboration while preserving local autonomy.

  • Unicore [28]: The Unicorn Project, supported by financial backing from the German Ministry of Education and Research and in partnership with entities such as ZAM and Deutcher, establishes a Java-based grid computing environment. This system ensures uninterrupted and secure entry into distributed resources, promoting smooth integration and efficient utilization.

  • Boinc [29]: BOINC, developed by Berkeley University, is designed to be compatible with research projects such as SETI, focusing on the search for extraterrestrial intelligence. This open-source platform is utilized in scientific endeavors spanning diverse fields, including astrophysics, chemistry, molecular biology, medicine, and climatology, harnessing the computational power of personal computers.

  • HTCondor [30]: HTCondor, also known as Condor, originated at the University of Wisconsin-Madison with the initial purpose of evaluating the advantages of intensive computing in campus research. This system excels in handling tasks related to computationally intensive or high-throughput computing (HTC). It is specifically designed to accommodate large-scale, power-tolerant processing tasks for extended durations, spanning weeks to months.

Various articles and studies addressed optimization techniques and advancements in grid computing. The authors of [31] focused on optimizing Aurora middleware to reduce the computing environment overhead. Subsequent studies explored various aspects of grid computing, including QoS treatment in [32] and a multi-agent-based peer-to-peer network proposal in [33].

Further contributions to the enhancement of grid architecture were found in studies such as the development of an oriented grid for high-performance computing applications in [34] and improvements to the Globus middleware to increase throughput in bioinformatics applications in [35].

Various proposals and frameworks were presented, ranging from resource management in bio-grids [36] to integration of mobile computing with the grid in [37], interaction with running jobs in [38], and the application of peer-to-peer approaches in volunteer computing platforms in [39, 40]. Various articles covered different topics, including a new architecture for computational grids in [41], a comparison of grid and cloud computing in [42], web performance enhancement using the grid in [43], and a comparison of atmospheric data analysis models in cluster and computational grid environments in [44]. Middleware like Agent Team Works was detailed in [45], while the use of Hadoop in computational grids for a smart marketing model was proposed in [46].

Addressing the challenges posed by the complexity of developing grid applications and the limitations on the types of discovered resources when directly utilizing the jobs, the grid initiative introduces several technologies. Therefore, many alternatives were explored by the scientific and industrial communities to simulate multicluster concepts. Interactive methods were employed with the aim of creating a programming interface capable of defining both the computation process and the interaction with distributed resources. Most of the programming models utilized in this industry and applied within user programs were separated from the work distribution process by the manual submission of job description files to a batch system scheduler such as HTCondor [47], Slurm [48], and PBS [7].

The authors of [49] described the Coffea-casa (University of Lincoln, Nebraska) prototype analytical facility in the United States. This service used Dask for the computation distribution. It incorporated dedicated resources allotted through FairShare using an HTCondor scheduler and was built on top of a local Kubernetes cluster. Another analysis facility prototype [50] was developed at Fermilab with the label "Elastic Analysis Facility." The analytic facility implementation presented in this study aimed to integrate the various geographic clusters at INFN with a novel scheduler-client connection system, following the same general direction as the prototypes just discussed.

Numerous examples of HPC Grids at universities and scientific institutions across various locations can be found in non-federated approaches, such as [51,52,53,54,55], and in the federated approaches, such as [56]. The existing literature highlights numerous challenges in constructing a robust architecture for the HPC grid. These challenges encompass aspects like heterogeneity, programmability, scalability, and the interoperability and coupling of high-performance architectures or networks of computing nodes. Hence, in this paper, as a beginning, we intend our evaluation of our EN-HPCG grid to focus on widely used homogeneous resource management solutions, including SLURM and PBS Pro, employing the NGB Benchmark. Then, in our future work, we will extend our focus to address additional challenges, including the integration of heterogeneous resource management, throughput improvement, and the implementation of smart job allocation strategies.

3 Resource management systems (Slurm and PBS-Pro)

3.1 Slurm

Slurm, which stands for “Simple Linux Utility for Resource Management,” is an open-source cluster management and job scheduling system. It is widely used in high-performance computing (HPC) environments to allocate and manage computing resources such as CPU cores, memory, and GPUs efficiently across a cluster of interconnected computers.

Slurm provides a flexible and scalable framework for managing jobs and workflows on HPC systems. It allows users to submit and schedule jobs, monitor their progress, and control resource allocation. Slurm supports a variety of job types, including batch jobs, interactive jobs, and parallel jobs, making it suitable for a wide range of scientific and computational workloads.

One of the key features of Slurm is its ability to handle complex job dependencies and priorities. It supports job dependencies to ensure that certain jobs are executed only after their prerequisite jobs have been completed successfully. Additionally, slurm allows users to specify job priorities, enabling important or time-critical jobs to be allocated resources ahead of lower-priority jobs.

Slurm provides a command-line interface (CLI) for users to interact with the system and perform various tasks, such as submitting jobs, querying job status, and managing resources. It also offers a web-based graphical user interface (GUI) called "sview" to visualize and monitor the activity of the cluster.

In addition to job management, Slurm provides extensive accounting and reporting capabilities, allowing administrators to track resource usage, generate usage reports and enforce resource limits. It supports authentication and authorization mechanisms to ensure secure access to the system and resource allocation.

Slurm is highly configurable and can be customized to meet the specific requirements of different HPC environments. It is widely adopted in academic and research institutions, government laboratories, and industry settings to manage large-scale computational clusters and supercomputers.

Overall, Slurm plays a crucial role in optimizing resource utilization, improving job throughput, and facilitating efficient management of HPC systems, making it a popular choice for organizations working with computationally intensive workloads.

3.2 PBS

The lineage of the Portable Batch System (PBS) scheduler family can be traced directly back to Unix-based Network Queuing System (NQS), the inaugural batch scheduler developed with NASA funding. PBS is a workload manager and high performance computing (HPC) task scheduler that is designed to effectively manage and optimize the distribution of computing resources in demanding computing settings. Altair Engineering, a multinational technology business with a focus on data analytics, high-performance computing, and simulation, produced the commercial solution [6].

PBS Pro provides a robust framework for job submission, scheduling, and resource management in HPC clusters, supercomputers, and cloud environments. It allows organizations to maximize the utilization of their computing resources while ensuring fair and efficient job execution [57]. PBS have the following Key Features:

  • Policy-based scheduling: PBS Pro employs a policy-based scheduler that allows administrators to define various policies for job prioritization, fair-share scheduling, and resource allocation.

  • Scalability: The system is designed to scale efficiently, support large-scale parallel processing, and accommodate the computational demands of modern scientific and industrial applications.

  • Flexibility in job types: It supports a wide range of job types, including parallel, serial, array, and checkpointing jobs, making it suitable for diverse scientific and engineering workloads.

  • Advanced job management: PBS Pro includes advanced features such as job arrays, custom resource configurations, and checkpointing mechanisms to enhance job management capabilities.

  • Resource monitoring: The system provides real-time monitoring of resources, allowing administrators to track system usage, diagnose issues, and optimize resource allocation.

  • Extensible architecture: PBS Pro has a modular and extensible architecture, allowing organizations to tailor the system to their specific needs and integrate it with other tools and applications.

  • Multi-cluster support: PBS Pro enables organizations to create a unified computing environment from distributed resources by managing multiple clusters. It uses peer scheduling, where the available resources in a local cluster are used first, and any remaining pending jobs are sent to the available connected clusters for execution. Once a job finishes, it is rolled back to the local cluster.

3.3 Feature comparison

Table 2 illustrates a comparison of Slurm and Altair PBS pro features. Although Slurm is an open-source scheduler, PBS Pro is superior in that it supports Windows operating systems and has a graphical user interface portal. Windows support could be beneficial for software that is only available for Windows. However, containers could be a solution to run Windows-based software. Altair Access facilitates the seamless submission, monitoring, and visualization of HPC jobs on remote clusters, cloud infrastructure, and other computational resources through a web-based viewer. And, it offers ongoing support and comprehensive documentation, ensuring users have access to assistance and resources for optimal utilization of the platform [58]. In addition, PBS Pro allows users to submit their jobs based on job type; the routing queue will route the job automatically to the particular clusters that avail the required software.

Table 2 Features comparison among job schedulers (Inspired by [57])

4 Experiments

4.1 Benchmark implementation

In this paper, to assess our EN-HPCG testbed, we employed the Nas Grid Benchmark (NGB) serial version 3.1 [14], a derivative of the Nas Parallel Benchmark (NPB). NPB is a widely recognized benchmark widely used to evaluate HPC cluster [64, 65]. we had to exhaust the grid resources for evaluating the grid performance, so we used the Embarrassingly Distributed (ED) kernel in the NGB benchmark, which illustrates the important category of grid applications that entails numerous independent executions of the same program with different input parameters. Such experiments are frequently carried out at NASA using Scalar Penta-diagonal solver (SP) as flow solvers [66]. There is no communication between any of the NQS kernels. There are several classes for the ED kernel based on its sizes S, W, A, B, C, D, and E with sizes of (9 x 1), (9 x 1), (9 x 1), (18 x 1), (36 x 1), (72 x 1), and (144 x 1), respectively. Figure 1 shows the data flow graph of the ED benchmark [67].

Fig. 1
figure 1

Data flow graph of ED benchmark [67]

4.2 Slurm testbed settings

In this paper, we dedicate a part of the entire EN-HPCG grid to evaluate the Slurm scheduler compared to the PBS Pro that is already used in the Egyptian grid [23]. The grid testbed consists of two clusters, one allocated at the City of Scientific Research and Technological Applications (SRTA), and the other cluster allocated in the Faculty of Science at Ain Shams University (ASU). The cluster at SRTA called slurmcluster2 and The cluster at ASU called asuslurm that are connected with the Slurm muti-cluster mode [68] as shown in Fig. 2. In multi-cluster mode, different clusters can communicate with one another, where a job can be submitted from one cluster to another to be executed there if convenient.

Fig. 2
figure 2

Architecture of slurm multi-cluster testbed

The head node of slurmcluster2 (headnode) hosts the slurmctld service to control the cluster. While the head node of asuslurm (asuslrhd) hosts slurmctld service to control the asuslurm cluster, as well as slurmd service to act as a compute node itself.

The headnodevm hosts the MariaDB database server, and the slurmdbd service for accounting services for both slurmcluster2 and asuslurm clusters. headnodevm also hosts Chronyd service for time synchronization between all nodes in both clusters. Time synchronization is important in Slurm multi-cluster mode because the choice of which cluster to be allocated a submitted job depends on the cluster that provides the earliest start time for the job. The common services on the headnodevm enable both slurmcluster2 and asuslurm to communicate in the multi-cluster mode and federation mode (i.e., all clusters can be viewed as a single cluster, but still one job can be executed on one cluster only). Future work can involve using separate database servers for each cluster in the multi-cluster mode. FreeIPA server is used to create the same user(s) across all clusters. NFS directories are shared for each user across each cluster nodes to simplify sharing resources (e.g., executable scripts, submission files,.. etc.). Users’ authentication is done using the same Munge keys that are installed in a common directory path, in each cluster, to enable communication between different clusters. Future work may involve using separate munge keys for inter and intra-cluster(s) communication.

All clusters have Slurm 23.11.0-0rc1 installed from the source repository, which is the latest Slurm version (at the time of writing this document).

Nodes specifications of SRTA and ASU clusters are shown in Table 3.

Table 3 SRTA and ASU nodes specifications

4.2.1 Benchmarking sp.A on different computational Slurm nodes

We ran sp.A on different computational nodes on SRTA and ASU to get (almost) correct estimations of execution times and memory requirements of each run (without Slurm). Estimated values are shown in Table 4.

Table 4 Execution time and memory requirements of sp.A on different computational hosts

However, it turned out that the concurrent multiple executions of sp.A jobs under Slurm increased each job execution time, as shown in Table 5 for the average, maximum, and minimum execution times. We justify the increase in job execution time due to job arrival and completion, which trigger the Slurm scheduler. When the scheduler is triggered, it may interrupt the execution of the currently running jobs to find a suitable allocation for the newly received jobs. Furthermore, performance degradation can occur when processes run on the same multi-core CPU and share resources such as last-level caches, memory controllers, system request queues, and prefetch bandwidth. This degradation could be as high as 200%, relative to the execution of processes in isolation [69]. Also, This could be the reason for the difference in execution time between the physicalnode01 (\(\simeq\) 46 Sec) and physicalnode02 (\(\simeq\) 71 Sec), as shown in Table 5.

Table 5 Average, minimum and maximum execution time of sp.A on different computational Slurm nodes
Fig. 3
figure 3

Architecture of PBS testbed

4.3 PBS Pro testbed settings

To evaluate the performance of PBS Pro compared to Slurm, we used the same physical machines. The PBS Pro testbed consists of one cluster at SRTA (called head-node) and another at ASU (called asuhd). The grid system consists of six main components, as shown in Fig. 3:

  • Altair PBS professional [70]: It is employed for the supervision and coordination of grid resources, functioning as a remote resource management system within the grid context. Altair PBS Professional adeptly oversees high-performance computing (HPC) tasks and orchestrates HPC workloads across the entire grid computing infrastructure. The scalability of PBS Professional enables seamless support for systems of varying sizes, ranging from clusters to extensive supercomputers. This ensures optimal utilization of both hardware and software investments, maximizing the benefits for users.

  • Altair access [58]: Users can submit, monitor, and visualize high-performance computing (HPC) jobs on remote clusters, clouds, and other resources through a Web-based viewer. Additionally, Altair Control [61] provides administrators with the ability to control, monitor and manage the configuration environment of an HPC cluster using a Web application.

  • Shared NFS storage: In the grid, each cluster has access to two NFS storage options: local and shared. The “home” directory is utilized by local users of the cluster to store their personal files. On the other hand, the "hpcshared" directory serves the purpose of storing and sharing all grid-installed software packages, making them accessible to all clusters within the grid. For optimal job execution speed, it is advisable to run applications from the home directory instead of the “hpcshared” directory.

  • FreeIPA DB system: FreeIPA serves as a comprehensive solution for globally managing users’ accounts and groups [71]. It provides straightforward installation and user-friendly command lines and web-based management tools, empowering Linux administrators to oversee the identification, authentication, and access control aspects of Linux users’ accounts centrally.

  • Peer scheduler: The peer scheduling feature is activated when users submit jobs to a busy or fully utilized cluster, automatically redirecting the jobs to another available cluster within the grid.

We take some nodes from PBS PRO grid with the same specification of the Slurm testbed to compare the results. PBS testbed specifications of SRTA and ASU clusters are shown in Table 6.

Table 6 SRTA and ASU nodes specifications of PBS Pro testbed

4.3.1 Benchmarking sp.A on different computational nodes

We ran again sp.A on different PBS Pro computational nodes on SRTA and ASU using the PBS pro testbed to obtain (almost) correct estimates of the execution times and memory requirements of each run (without PBS pro). The estimated values are shown in Table 7. Then, we ran concurrent multiple executions of sp.A jobs under PBS pro. Table 8 illustrates the average, maximum, and minimum execution times.

In general, the difference between the execution times between Slurm and PBS Pro, as give in Tables 5, and 8, could be due to the difference between the two schedulers in the arrival and completion of jobs. When the scheduler is triggered, it may interrupt the execution of the currently running jobs to find a suitable allocation for the newly received jobs. We notice that the execution times in node06-m29 and node07-m29 are very close, although these servers are the same servers used in the Slurm testbed. We found that the Mom service in PBS Pro detects the available memory of the two nodes to be equal to 128 GB. In contrast, the slurmd service detects the available memory of the two nodes approximately equal to 192 and 128 GB, respectively.

Table 7 Execution time and memory requirements of sp.A on different computational PBS Pro nodes
Table 8 Average, minimum and maximum execution time of sp.A on different computational nodes

4.4 Performance metrics

The performance of the Slurm and PBS Pro testbeds has been assessed using an intrusive benchmark that applies stress to all the testbed’s resources (\(n \ge N\)), where n is the number of jobs, and N is the number of available processors. This experimental approach aims to accurately determine values for Throughput, system Speedup, user Speedup, and the number of tasks completed as functions of time.

4.4.1 Throughput

Throughput is a key performance metric that measures the rate at which a system can process a workload or a set of tasks over a given period. Throughput is often expressed in terms of tasks per second. It provides an indication of the system’s efficiency in handling a large number of tasks simultaneously.

$$\begin{aligned} Throughput= \frac{J}{ \Delta t} \end{aligned}$$
(1)

Where J is the number of completed jobs, and \(\Delta t\) is the total time, in seconds, including communication and schedule time overhead, of the completed jobs.

4.4.2 System/workflow speed up

System/workflow speed-up \(S_{site}\) is used to evaluate the grid in terms of the execution times of the system or workflow job [72]. A workflow consists of multiple jobs that are usually submitted in bulk. Therefore, the speedup that a user might anticipate when executing a specific class of applications on the Grid can be characterized as

$$\begin{aligned} S_{site}= \frac{T_{site}}{T_{Grid}} \end{aligned}$$
(2)

Where \(T_{Grid}\) is the workflow wall time using all Grid resources, and \(T_{site}\) is the optimum execution time using only the resources available in a given site.

4.4.3 User speed up

User speed-up, \(U_{site}\), is used to evaluate the grid in terms of user perspective (i.e., the speed-up for running a single job submitted by a user using grid resources compared to site resources, instead of the speed-up of running the whole jobs submitted to the grid as defined in 4.4.2). Therefore, We determine the mean duration between the start of the job and its completion using only the resources available (\({\hat{T}}_{site}\)) in a given site divided by the mean duration using all Grid resources (\({\hat{T}}_{Grid}\)) as follows:

$$\begin{aligned} U_{site}= \frac{{\hat{T}}_{site}}{{\hat{T}}_{Grid}} \end{aligned}$$
(3)

Where:

$$\begin{aligned}{} & {} {\hat{T}}_{site}= \frac{1}{n} \sum _{i\in S} {T_i} \quad \quad ,i={0,1,2,...,n-1} \end{aligned}$$
(4)
$$\begin{aligned}{} & {} {\hat{T}}_{Grid}= \frac{1}{n} \sum _{i\in G} {T_i} \quad \quad ,i={0,1,2,...,n-1} \end{aligned}$$
(5)

Where n is the number of jobs, and \(T_i\) seconds is the total time, including communication and schedule time overhead, of the job i.

4.4.4 Number of tasks completed as a function of time

The number of tasks completed as a function of time is given by:

$$\begin{aligned} n(t)= \sum _{i\in G} N_i E_i \end{aligned}$$
(6)

Where \(N_i\) denotes the number of processors used for job i in the grid (G). \(E_i\) takes the value 1 if job i has completed and 0 if it is still running or in a queued state. \(E_i\) is calculated by:

$$\begin{aligned} E_i= {\left\{ \begin{array}{ll} 1, &{} \text {for } \lfloor {\frac{t}{\Delta T_i}}\rfloor \geqslant 1 \\ 0,&{} otherwise.\\ \end{array}\right. } \end{aligned}$$
(7)

Where, \(\Delta T_i\) seconds represent the total time from the beginning of the experiment to the completion of job i, encompassing communication and schedule time overhead. We drew inspiration from the n(t) equation presented in [72] and subsequently modified it to align with our experiment design. In our adaptation, \(E_i\) should be equal to 1 when job i has been completed. This adjustment ensures consistency, addressing the occasional generation of 2 or 3 values in the equation from [72], which can occur when jobs experience prolonged waiting times before execution. In addition, in our case, \(N_i\) consistently equals 1, signifying the use of one processor per job.

5 Results

We evaluated the testbed performance in terms of the mentioned metrics in Sect. 4.4) by using the serial Embarrassingly Distributed (ED) kernel in NAS Grid benchmark with Class E (144 tasks) [67].

Section 5.1 shows the experimental results of the grid using Slurm in multi-cluster mode. Then, Sect. 5.2 illustrates comparative study results of using Slurm in multi-cluster mode versus PBS Pro in Peer-scheduling mode. As PBS Pro is already installed on the EN-HPCG, this study is performed to migrate to Slurm as an open-source scheduler if it has proven its effectiveness regarding users and the system.

5.1 Slurm multi-cluster results

To assess the testbed’s performance in the presence of an intrusive benchmark in which the number of executed jobs exceeds the number of available processors. We executed 144 jobs from the ED-class E NAS Grid benchmark on our testbed, where the SRTA cluster consists of 2 nodes with 16 cores on each of them and the ASU cluster consists of 4 nodes with 8 nodes on each of them as shown in Table 3.

There are two cases for task submission. The first case is to submit jobs to the ASU cluster (ASU-SRTA), and the second case is to submit jobs to the SRTA cluster (SRTA-ASU). In both cases, tasks can be re-submitted from one cluster to the other if Slurm expects an earlier start time for the job on the other cluster.

Figure 4 shows the grid performance, which is the actual grid throughput measured in jobs per second when running six distinct executions of the benchmark on various days of the week to overcome network bandwidth variations conditions. Because of Slurm scheduling and offloading tasks between two clusters (i.e., Slurm assigns the job to the cluster with the earliest expected start time), the number of jobs executed on SRTA was 89 jobs and the number of jobs executed on ASU was 55 jobs, in the case of SRTA-ASU. In the case of ASU-SRTA, the number of jobs executed on ASU was 54 jobs and the number of jobs executed on SRTA was 90. Figure 4 shows that the grid obtains the same performance in the two different cases for the job submission. This means that the Slurm behavior does not depend on where the jobs are submitted, takes the offloading decision when the jobs are submitted, and doesn’t take into consideration the local cluster for the jobs. It obtained high performance in 90 jobs and 120 jobs, then decreased after 120 jobs.

Fig. 4
figure 4

Grid throughput using Slurm multi-cluster

Figure 5 illustrates the performance of the grid and the performance of the ASU cluster and SRTA cluster without the Slurm multi-cluster scheduling. The performance was calculated by Eq. 1. The performance of the grid, represented by the solid line, is better than the performance of the ASU cluster, shown in the dotted red line, because the ASU cluster has limited assets and outdated hardware and takes more time to execute jobs than SRTA. The execution time for ASU jobs is 248 s, on average, as shown in Table 5. This means that it is beneficial for ASU to join the grid.

On the other hand, the performance of the SRTA cluster (the dotted blue line) is better than the performance of the grid and the ASU cluster (the solid and dotted red lines, respectively). The average job execution times in SRTA are 46 s and 71 s, respectively, on the two SRTA cluster nodes. Average job execution times on SRTA are less than the average job execution times on ASU (248 s), as shown in Table 5. Additionally, the average job execution times on SRTA are less than those on the grid, as the latter offloads some jobs to the ASU cluster which is slower than the SRTA cluster.

Fig. 5
figure 5

Grid throughput vs. throughput of each cluster without Slurm multi-cluster scheduling

Figure 6 shows the number of executed jobs within a time interval using the grid (solid lines) compared to the number of executed jobs within a time interval using ASU and SRTA clusters (dotted red and dotted blue lines), respectively. The number of completed jobs within a time interval is calculated by Eq. 6. Figure 6 illustrates that executing 144 jobs takes 543 s using grid resources, 312 s using SRTA, and 1254 s using ASU. Figure 6 shows it is very beneficial for ASU to join the grid.

Fig. 6
figure 6

Number of completed tasks as a function of time using Slurm

Figure 7 illustrates the speed up for SRTA and ASU clusters and the benefits for some clusters to join the grid. The speed up for each site was calculated by Eq. 2. As seen in Fig. 7, in this scenario, it is more beneficial for the ASU site to join the grid than the SRTA site. The grid is helpful in sites with limited assets or outdated hardware.

Fig. 7
figure 7

Grid speedup for system/workflow using Slurm multi-cluster

Figure 8 shows the speed up for SRTA and ASU clusters from the user’s perspective. The user speed-up was calculated by Eq. 3.

Fig. 8
figure 8

Grid speedup for user using Slurm multi-cluster

5.2 Comparative study results of Slurm and Pbs pro in grid mode

This section compares the performance of Slurm in multi-cluster mode (Sects. 3.2 and 4.3) against the performance of PBS Pro is already installed on the Egyptian National High-Performance Computing Grid (EN-HPCG) [23].

Figures 9 and 10 illustrate the experimental performance of the grid throughput (with PBS Pro peer scheduling, and Slurm multi-cluster mode) compared to the throughput of the SRTA and ASU sites (without PBS Pro peer scheduling, or Slurm multi-cluster mode), respectively. The throughput of the grid is represented by the solid lines, whereas the throughput of each site is plotted as dotted lines. Red lines show the throughput of PBS Pro, whereas blue lines show the throughput of Slurm.

Fig. 9
figure 9

Grid throughput vs. SRTA cluster throughput without PBS Pro peer scheduling, or Slurm multi-cluster

Under PBS Pro peer scheduling (i.e., grid using PBS Pro) and initial job submission to SRTA, 112 jobs were executed on SRTA, whereas ASU received 32 jobs. Throughput generally improves, especially after 96 jobs as illustrated in Fig. 9.

PBS peer scheduling is automatically activated to transfer tasks from a fully utilized site to another available site. The concept of SRTA-ASU involves submitting the entire benchmark workload, consisting of 144 tasks, to the SRTA site. Any excess tasks beyond the capacity of SRTA are then offloaded to the ASU site. Given that each task requires one core, congestion at the SRTA site occurs after 32 tasks, leading to offloading the next 32 tasks to the ASU site. After processing 32 tasks, Asu also becomes busy, and the remaining tasks are queued within the grid until resources become available at either site. Although the ASU cluster is considered a slow cluster with low resources, the improvement occurred because 32 tasks were offloaded to the ASU cluster.

The two solid lines in Fig. 9 illustrate the performance of the grid using PBS Pro peer scheduling (red line) in comparison to the Slurm multi-cluster (blue line) scheduler. The grid performance of the Slurm scheduler is superior until 84 completed tasks. Afterward, the grid performance of both schedulers becomes similar. Subsequently, the grid performance of PBS Pro improves, surpassing that of Slurm, after 104 completed tasks. This is due to the difference in decision-making between peer scheduling in PBS Pro and multi-cluster scheduling in Slurm. PBS peer scheduling decides to offload a task to another cluster when there are sufficient resources available there. In other words, the job migrates to the cluster and starts running immediately in an online decision fashion. In contrast, Slurm multi-cluster decides to offload a job to the cluster with the earliest expected start time considering the cluster queue of running and pending jobs. Once Slurm decides to offload a job to a specific cluster, the job cannot be migrated to any other cluster, even if some resources become available at the other clusters (e.g., some running jobs at the other clusters have finished before their scheduled end times) [68]. Thus, the Slurm multi-cluster makes an offline decision about where each job should be allocated. Accordingly, Slurm decided to offload 55 jobs to the ASU cluster since their submission time. This decision resulted in a poor throughput after the completion of about 112 jobs due to the long running time of jobs at ASU.

Fig. 10
figure 10

Grid throughput vs. ASU cluster throughput without PBS Pro peer scheduling, or Slurm multi-cluster

When jobs are initially submitted to the ASU site using the PBS Pro peer scheduler, there is a significant improvement in throughput when utilizing the grid for all workflow tasks. The throughput was somehow consistent with the performance observed with the Slurm multi-cluster scheduler, as illustrated in Fig. 10.

As shown in the two solid lines in Fig. 10, the grid throughput of the Slurm multi-cluster scheduler outperforms the throughput of the PBS Pro peer scheduler until 92 completed tasks. After that, the grid throughputs of both schedulers become close. Subsequently, the grid throughput of the PBS Pro peer scheduler improves after 122 completed tasks. This difference is attributed to the online decision-making of PBS peer scheduling and the offline decision-making of the Slurm multi-cluster. In contrast to the PBS Pro peer scheduling results shown in Fig. 9, the throughput of PBS Pro peer scheduler in Fig. 10 declines after 130 completed jobs. The decline in throughput happens because ASU receives more jobs (46 jobs), when jobs are initially submitted to ASU as shown in Fig. 10, than the number of jobs when initially submitted to SRTA as in Fig. 9 (32 jobs). The increase in the number of allocated jobs to ASU in Fig. 9 than Fig. 10 is attributed to the online decision making of PBS Pro peer scheduling. However, the throughput decline in the case of Slurm multi-cluster happens earlier than PBS Pro peer scheduler.

Another observation is that the Slurm multi-cluster mode does not account for the submission host. This might be perceived as unfair for local users if there is a need to have a higher priority for running jobs in the local cluster. In contrast, PBS Pro initially allocates jobs in the local cluster. If there are no available resources, the job enters the queue state, and finally, it is offloaded only if there are available resources in another cluster. Considering the HPC grid as a dynamic environment, it is better to make the offloading decision just before accessing the resources, as is done in PBS Pro.

Figures 11 and 12 illustrate the number of performed tasks within a specified time interval, as calculated by Eq. (6), using the grid resources compared to the number of completed tasks within the same time interval on SRTA and ASU sites, respectively. The number of executed jobs as a function of time using the grid is represented by a solid line, whereas the number of executed jobs as a function of time using a single site is plotted as dotted lines. The number of executed jobs as a function of time using the PBS pro and Slurm schedulers are represented by the red and blue lines, respectively.

Fig. 11
figure 11

Number of completed tasks as a function of time in srta

Fig. 12
figure 12

Number of completed tasks as a function of time in asu

The effectiveness of the grid is apparent in all cases where jobs are initially submitted to ASU, whether using PBS Pro or Slurm. In addition, when employing the grid with the SRTA site, there is an increase in the number of executed tasks over time using Pbs Pro compared to Slurm. In contrast, the SRTA site does not gain any benefits when integrated into the grid using Slurm multi-cluster. The workflow jobs (144 tasks) took 534 s to complete in the SRTA-ASU grid scenario under Slurm multi-cluster scheduling, although they only required 312 s to finish under the PBS Pro peer scheduling.

Figure 13 illustrates the speedup of each site, evaluating the benefits gained by the site when integrated into the grid in terms of workflow tasks. Each site’s speedup is calculated by Eq. (2).

Fig. 13
figure 13

Grid speedup for system/workflow

Figure 14 illustrates the speedup of each site, evaluating the benefits gained by the site when integrated into the grid in terms of Quality of Service (QoS) for users. Each site’s speedup is calculated by Eq. (3).

Fig. 14
figure 14

Grid speedup for user

As depicted in the Figs. 13 and 14, the advantages of joining the grid are more pronounced for the ASU site compared to the SRTA site, whether utilizing Slurm or PBS Pro. The grid proves particularly beneficial for sites with constrained resources or aging hardware. In general, the PBS Pro scheduler achieves a faster speedup for both SRTA and ASU in terms of workflow tasks. However, the Slurm scheduler achieves a higher speedup for ASU in terms of individual tasks.

Based on the previous experimental results, it is not advisable to integrate a cluster with high-performance hardware with a cluster possessing aging or outdated hardware when using the Slurm scheduler. This is because the only site that would benefit from such integration is the one with outdated hardware. Slurm could be more efficient when integrating identical or converged clusters. However, if the scheduling algorithm in the Slurm multi-cluster mode can be changed to be comparable to the performance of PBS Pro peer scheduler (i.e., jobs are allocated to a cluster when required resources are free at this cluster, instead of early submission to a cluster based on the expected earliest start time), then Slurm multi-cluster performance can be similar to PBS Pro peer scheduler. We postpone the design of new scheduling algorithms to future work as it is beyond the scope of this paper.

6 Conclusion and future work

This study highlights the advantages of integrating clusters into a unified grid under specific workload conditions. Furthermore, it evaluates the effectiveness of well-known workload management systems in combining multiple clusters to form the required grid: (i) Slurm (Open Source) and (ii) PBS Pro (Licensed). The mutli-cluster and peer scheduling features of Slurm and PBS Pro, respectively, enable the integration of multiple clusters together into the grid environment without the usage of a specific middleware (e.g., Globus [27]). However, the underlying clusters must use the same workload manager to benefit from the Slurm multi-cluster or PBS Pro peer scheduling features. Based on this study, we are inclined to transition to the Slurm workload management system within our EN-HPCG grid, to reduce the costs associated with EN-HPCG sustainability. The experimental results have shown that Slurm is an efficient scheduler within an individual cluster or a grid based on identical hardware. In contrast, PBS Pro exhibits a more rapid speedup for both high-performance and outdated hardware when considering workflow tasks within the grid. Additionally, the PBS Pro scheduler is considerate of the online decision-making for the dynamic environment using a unified grid.

Therefore, in future work, there is an intention to investigate the Slurm Scheduler to enhance its performance. One avenue for improvement involves allowing each cluster to have its own database instead of using one common database for all clusters. Thus, if the network between the clusters fails, each cluster could still serve its users. Additionally, exploring the use of separate Munge keys for user authentication, as well as inter and intra-cluster(s) communication, is under consideration. Moreover, the design of AI-based and nature-inspired optimization scheduling algorithms for the multi-cluster system will be explored. As a different investigation, an extended comparison between Slurm and other open-source tools such as HTCondor and Globus will be conducted to reduce the costs associated with EN-HPCG sustainability.

In the exploration of prospective directions for future research, researchers may encounter various significant challenges. These challenges include effectively managing multi-schedulers, adapting to new architectures, integrating heterogeneous processors equipped with artificial intelligence chips and quantum processors, and seamlessly merging HPC grids with cloud servers. Furthermore, researchers could explore challenges associated with non-functional requirements, addressing issues such as energy consumption, scalability, and resilience [73].