1 Introduction

Gene co-expression analysis examines the coordinated pattern of expression between genes across multiple samples, allowing for the simultaneous identification, clustering, and exploration of hundreds of genes under varied conditions, providing a systematic method for assessing the functional status of genes [1]. As a result, gene co-expression networks (GCN) represent the association of genes by means of a graph, where nodes represent genes and edges represent the relationship between these. Moreover, GCN models have proven helpful in the development of hypothesis, subsequently validated through empirical methods. Thus, the reliability of GCNs is substantiated by the fact that numerous predicted interactions have been experimentally confirmed later [2]. Consequently, algorithms and computational techniques for GCN reconstruction have gained prominence within the Bioinformatics community [3].

In this context, GCNs establish relationships between genes exhibiting similar expression patterns. The degree of relationship between each gene pair is measured, and a relationship is established only if this degree exceeds a certain threshold. The threshold value represents the minimum level of similarity required between two expression patterns for the relationship to be considered significant. Common measures used to evaluate the co-expression degree between two genes include correlation coefficients such as Pearson’s, Spearman’s, or Kendall’s [4]. Mutual information and other measures have also been widely employed for GCN reconstruction [5].

However, GCNs typically exhibit two primary limitations: (a) the measures mentioned above have certain constraints [6], such as their inability to identify nonlinear associations or their assumptions about data distribution, as observed with Spearman and Pearson coefficients, respectively [7]; and (b) the inferred networks are often hyper-connected, hindering comprehensive analyses, whereas real gene networks are recognized to be sparse [1]. Therefore, attempts have been made to create new algorithms that integrate various co-expression metrics to robustly infer networks, as it is the case of EnGNet (Ensemble Greedy Networks) [7], developed to overcome the limitations of using a single correlation measure while leveraging an optimized network topology. It also includes a study of the hubs within the networks to improve the final network topology, as it was proposed in [8].

Another crucial issue to be addressed while developing co-expression gene network algorithms is computational performance. Typically, GCNs methods take as input expression datasets such as obtained from technologies like microarrays or RNA-Seq. As the costs of high-throughput sequencing have significantly decreased over the years, we now have access to vast amounts of genetic data, enabling us to more efficiently compute gene-to-gene associations [7, 9]. As a consequence, the generated datasets can include information on tens of thousands of genes across hundreds of samples. Due to this, the total number of possible gene pairs to be analyzed is so large that good performance can be critical to obtain results in a reasonable time. This scenario is described by the large \(p\), small \(n\) problem, where \(p\) (the number of variables, in this case, genes) is much larger than \(n\) (the number of observations or samples), leading to computational and analytical challenges in gene network analysis [10, 11].

In response to these challenges, there have been ongoing improvements to the effectiveness and scalability of GCN algorithms. These new developments are frequently framed within environments of high-performance computing (HPC). General purpose computing on graphics processing units (GPGPU), programming paradigms, and traditional parallel or distributed computing models are the most frequently used HPC technologies [12, 13]. Traditional parallel/distributed computing models accelerate computational performance by using all available CPU cores or clusters [14]. This performance enhancement is the result of dividing a program into sub-tasks that can be executed in parallel. On the other hand, programming paradigms such as Apache Spark and Apache Hadoop use robust and efficient architectures at the CPU and memory processing levels to accelerate computational performance and facilitate the analysis of massive datasets [15]. In gene co-expression analysis, graphics processing units (GPUs) are frequently used to enhance computational performance due to the superior hardware offered by these devices in comparison to CPU processors [16].

Multi-GPU computing is emerging as an effective solution to the computational challenges in the generation of gene co-expression networks. GPU devices, with their ability to perform large-scale parallel operations, are ideally suited to process the extensive genomic data in databases such as The Cancer Genome Atlas (TCGA), which stores genomic data from thousands of patients with various types of cancer [17]. Implementing GCN algorithms on multi-GPU platforms allows us to process these huge datasets in reasonable time, significantly improving our ability to perform statistical modelling [18].

In addition, multi-GPU computing not only improves processing speed, but also optimises memory efficiency, a crucial aspect when working with high-dimensional datasets. By distributing data and computation across multiple GPUs, it is possible to handle datasets that would otherwise exceed the memory capacity of a single GPU [19]. This facilitates the generation of GCNs from previously unmanageable high-dimensional datasets, opening up new opportunities for the extraction of useful biological knowledge.

In this paper, pyEnGNet, a multi-GPU Python version of the original EnGNet algorithm, is presented. This new implementation can generate GCNs from high-dimensional datasets, allowing statistical connections between bio-molecules to be processed across a large number of samples. In addition, a new Python library has been made available to the scientific community, representing a development framework to multi-GPU implementations that use correlation coefficients (https://pypi.org/project/pyengnet/). Additionally, the algorithm code is released at: https://github.com/aureliolfdez/pyengnet and, a user manual is also available at: https://pyengnet.readthedocs.io/en/latest. PyEnGNet has been tested with a variety of dataset sizes, and the results show that it outperforms the original version of EnGNet.

The main contributions of this work can be summarized as following:

  • A new multi-GPU Python implementation of EnGNet is presented.

  • This new version, pyEnGNet, is able to process high-dimensional input datasets and outperforms the computational performance of EnGNet.

  • A new Python library for GCN reconstruction was developed. This package provides an efficient implementation of different algorithms to infer GCNs.

The rest of the paper is organized as follows: The main related works are presented in Sect. 2. The datasets and the description of pyEnGNet are detailed in Sect. 3. The experiments carried out and the discussion are described in Sect. 4. Finally, the final conclusions are presented in Sect. 5.

2 Related works

As it was stated above, GCNs have emerged as a powerful tool for extracting biological knowledge. However, their computational intensity cannot be overlooked. Due to this, there is a need to develop more efficient algorithms and computational tools to manage the expanding quantity of data and complexity of gene co-expression networks [20]. For instance, the Weighted Gene Co-expression Network Analysis (WGCNA) method [21] is known to have a high computational cost as it relies on permutation tests to determine significance [22]. When dealing with large-scale datasets, the computational cost of GCNs models increases even more. For instance, the Pearson correlation coefficient, a commonly used method for constructing gene co-expression networks, needs a considerable amount of time and memory space for large datasets [23].

Several GCNs methods use traditional parallel and distributed computing models, such as, the KINC algorithm [24]. KINC uses a Gaussian Mixture Model (GMM) [25] approach to capture multi-modal relationships between genes. However, this approach is computationally expensive, so the authors decided to use an HPC model that allows the algorithm to run on a single CPU or a cluster of CPUs. However, the authors found that when faced with building networks for large datasets this model did not deliver results because they exceeded a 72-h maximum limit for jobs in their cluster. Therefore, they developed a multi-GPU version of their algorithm speeding up to 500 times the version with traditional parallel programming models. Another work is presented in [26], where the authors present a MATLAB library, called gpuZoo, for performance improvement in GPU-based computation of genetic networks. The library offer a new implementation of four previously published algorithms for generating gene networks: Panda, Spider, Puma and Lioness. The runtime of the gpuZoo implementation in MATLAB and Python is up to 61 times faster and 28 times less expensive than the multi-core CPU implementation of the same methods. gpuZoo is available in MATLAB through the netZooM package.

In [27], the authors have introduced an efficient methodology for parallelizing evolutionary algorithms using GPU computing to infer mechanistic gene networks. These networks are capable of evolving a specific pattern of gene expression within a multi-cellular tissue area or cell cultures. The proposed method is based on multi-CPU threads, which execute lightweight crossover, mutation and selection operators, while launching GPU kernels asynchronously. By taking advantage of the power of one or more GPUs, these kernels can run in parallel, and each kernel simulates and evaluates the error of a model using the parallelism of the GPU threads. The effectiveness of the methodology was evaluated in the spatio-temporal inference of mechanistic genetic regulatory networks, encompassing both topology and parameters, to evolve a given 2D gene expression pattern. The results showed a remarkable 700-fold speedup compared to the single-CPU implementation.

Finally, another recent work was presented by [28]. This work proposes an intelligent parallel swarm algorithm, called PGRNIG, to optimise the S-system parameters. In this algorithm, the clone selection strategy is employed to improve the whale optimization algorithm (CWOA). To improve the time efficiency of CWOA optimization, a parallel CWOA (PCWOA) based on CUDA computing platform is used. The algorithm employs the decomposition strategy and L1 regularization to reduce the search space and the complexity of GRN inference. In the paper, the authors applied the PGRNIG algorithm to three synthetic datasets and two real-time series expression datasets of Escherichia coli and Saccharomyces cerevisiae species. The obtained results showed that PGRNIG can infer the gene regulatory network more accurately than other state-of-the-art methods, with compelling computational speed-up.

As can be seen, traditional GCN algorithms are subjected to intensive tasks. This necessitates the use of HPC techniques to construct new algorithms with improved computational performance. GPU devices significantly enhance the computational efficacy of parallel computing models. However, the use of these devices does not imply that they are able to process large datasets or that they do not experience memory saturation issues [29]. In this work, pyEnGNet has accounted for factors such as the distribution of processing tasks evenly across numerous GPU devices. In addition, this new implementation efficiently manages the various types of memory on these devices, preventing them from becoming overloaded when presented with large datasets. To maximise the computational efficacy of GPU devices, a resource planning scheme is employed. This scheme enables pyEnGNet to control the number of threads required in each CUDA block to execute a given number of tasks. All of these efforts are concentrated on maximising the utilization of available resources and processing large datasets.

3 Materials and methods

The primary goal of this work is to develop a new Python version of EnGNet that uses the resources of GPU devices and multi-process architecture for CPU processors to accelerate the construction of gene co-expression networks. In this section, the key features of the original EnGNet are summarized. Following, a parallelized version for HPC environments called pyEnGNet is explained in detail. Finally, the methodology and datasets used in the experiments are described.

3.1 The EnGNet algorithm description

As it was mentioned above, this work is based on an improved implementation of the EnGNet algorithm [7]. EnGNet is an algorithm for generating large gene co-expression networks based on two steps (see Fig. 1): (a) generation of an ensemble network and (b) topological optimization of the network. In the first step, the input dataset is processed to obtain all possible gene pairs to be evaluated as potential relationships. To do this, three correlation measures (Spearman [30], Kendall [31] and Normalized Mutual Information [32]) are used to perform a major voting ensemble strategy. For each measure, a threshold is determined, and if the correlation of the relationship exceeds this threshold, the relationship is established as valid. A relationship will be considered correct and included in the ensemble network if its correlation scores are over two or more of the corresponding thresholds. The weight of each valid edge is then calculated as the average of the correlation scores from the measures that validated the relationship. As a result of this step, an ensemble network composed of more trustworthy interactions between genes will be generated, encapsulating a comprehensive and robust representation of gene co-expression relationships (detailed in step 1 in Fig. 1).

Fig. 1
figure 1

A graphical description of EnGNet. As it can be observed, the method is based on 2 steps: the ensemble generation of the co-expression network (Step 1) and the topological optimization (Step 2)

The previously generated network is used as input data in the second step. This input network is processed to improve its topology. For this purpose, first of all, Kruskal’s minimum spanning tree (MST) algorithm [8] is applied to the network. This algorithm allows us to obtain the most significant relationships in the network. However, not all the relationships that have been eliminated may be irrelevant. For this reason, the central nodes of the network are selected and, based on an established threshold, the eliminated relationships whose weight exceeds this threshold are added to them. Once these relations have been incorporated, the final network of this algorithm is obtained.

To determine which steps of this algorithm are suitable for parallelization, a study was conducted to compare the algorithm’s computational performance (in seconds) between ensemble network generation step (step 1) and the topology optimization step (step 2). Ten synthetic datasets with a fixed number of columns, 25, and a range of rows between 1000 and 10,000, with an increment of 1000 rows for each dataset, were used for this purpose. After obtaining the results, we determined that 98.63% of the original algorithm’s average execution time is attributable to the ensemble network construction step, whereas the optimization step only accounts for 1.37% of the execution time. As a result, it was determined that the ensemble network generation requires a more comprehensive analysis in order to be parallelized (see subsection 3.2), whereas the optimization step will be executed sequentially on the CPU due to its low workload.

3.2 The pyEnGNet parallel algorithm

PyEnGNet is a Python parallel version of the EnGNet algorithm based on three steps, as shown in Fig. 2: The input data and the workload are distributed and balanced during the first step, a gene co-expression ensemble network is generated in the second step, and finally, the network is optimised in the last step. The second step can be executed in a parallel way using GPU devices (GPU mode) or using multiprocessing architectures on parallel CPUs (CPU mode). This version takes as input a gene expression dataset, a set of thresholds for the classifiers (NMI, Spearman, and Kendall), the network optimisation threshold Th\({\beta }\), and the number of GPU devices to use.

Fig. 2
figure 2

The general PyEnGNet scheme is shown. The input data consist of the gene expression dataset, a set of thresholds, and the number of Graphics Processing Unit (GPU) devices to use. Then, the pyEnGNet algorithm consists of three steps. The second one is executed in parallel by GPU devices or multi-process architecture on CPU processors. Eventually, an ensemble-based gene co-expression network will be the outcome

The choice of using Python for the development of pyEnGNet is based on multiple factors, such as readability, simplicity, scalability, portability, the large number of libraries and specialized tools and the great support for data science, among others [33]. The interest of the community specialized in data science and, specifically, in Bioinformatics continues to grow considerably [34, 35]. Because of this, Python has a wide range of scientific and data analysis libraries for the field of Bioinformatics such as BioPython [36], Scikit-Bio [37] and pybedtools [38]. It is therefore interesting to develop new libraries such as pyEnGNet that can be combined with other libraries to provide a comprehensive solution to a Bioinformatics issue.

In addition, Python is highly compatible with HPC environments, allowing for the efficient processing of large datasets, a requirement that is becoming increasingly prevalent in bioinformatics [39]. In HPC environments, the development of versions for architectures with multiple CPU processors or GPU devices is a common practise [40]. By supplying versions for both architectures, this algorithm is executable on a variety of hardware platforms. This allows pyEnGNet to achieve reasonable computational performance in a variety of computing environments.

The subsequent subsections will detail the characteristics of each pyEnGNet phase.

3.2.1 Step 1: Large-scale distribution and balancing of data for HPC environments

Adapting a Python-developed algorithm to an HPC environment that supports large quantities of data presents several challenges. One of these challenges is to maximise the use of the computational resources available, guaranteeing that each resource contributes proportionally to the processing of the workload. In order to achieve this objective, pyEnGNet implements dynamic workload balancing, which enables automatic adjustment of the workload in response to changing conditions.

In order to achieve dynamic workload balancing, it is necessary to partition all gen-gen interactions in a dataset into smaller chunks. In order to ascertain the quantity and dimensions of each chunks, considerations such as memory and processing capabilities are factored in. For memory management, the amount of RAM or global GPU memory available at the time of algorithm execution is calculated. Also, in order to prevent the system from becoming overwhelmed, we use 75% of this available memory. This selection enables us to ascertain the volume of each chunk. In terms of processing, it determines the overall count of CPU cores or GPU devices that are accessible while the algorithm is being executed. This data enables us to ascertain the number of chunks that can be executed simultaneously within the same iteration. Therefore, dynamic load balancing enables us to calculate the optimal number and amount of chunks required to process all gene–gene interactions efficiently.

If pyEnGNet is run in an HPC environment with CPU processors (CPU mode), another challenges such as the overhead of input/output (I/O) operations and Python’s GIL mechanism need to be addressed. Algorithms implemented in Python that support large datasets often require numerous I/O operations, which can significantly affect execution time [41]. To mitigate this problem, I/O accesses are taken into account when distributing chunks per available CPU processor, thus reducing the need to share storage resources [42, 43]. Consequently, as shown in Fig. 3, each chunk is assigned a process, and each process operates on a unique CPU processor. Each CPU process is responsible for the parallel processing of the tasks comprising phase 2 (Sect. 3.2.2) concerning the construction of the ensemble network.

Fig. 3
figure 3

The distribution of the input gene expression dataset is shown. This is divided into multiple chunks to be assigned to independent processes. Each process is executed on a specific CPU

The decision to use processes as a parallelism method when pyEnGNet is executed in CPU mode lies in Python’s GIL (Global Interpreter Lock) mechanism [44]. This mechanism only permits a single thread to control the Python interpreter, which can negatively impact the computational performance of thread-based parallel algorithms [45]. For this reason, this methodology addresses this limitation by utilizing process-based parallel computing with the multiprocessing library. Multiprocessing is one of the most popular and well-known Python libraries for concurrent programming, which enables the management of multiple processes on the same machine.

In case pyEnGNet runs in a multi-GPU environment (GPU mode), there are some challenges to take into account, such as uniform distribution of data across multiple GPU devices, maximization of computational resources (CUDA blocks and threads), data transfer between RAM and GPU device memories, and efficient management of GPU device memories based on latency and capacity. In order to optimise this algorithm’s use of the GPU devices’ available resources, a resource scheduling scheme (see Fig. 4) has been developed. This scheme allows this methodology to control the number of threads required for each CUDA block based on the total number of gene–gene interactions involved. Also, it facilitates the uniform distribution of processing duties across multiple GPU devices by concentrating on maximizing resource utilization and processing large input datasets.

Fig. 4
figure 4

The planning scheme for pyEnGNet (GPU mode) resources is displayed. This piece of code is used to determine the number of CUDA threads and blocks required for each chunk of gene–gene interactions

The efficient management of GPU device memory and the transfer of data between RAM and GPU memories are additional crucial factors to consider. In a GPU device, shared memory (lower latency and lower storage capacity) and global memory (higher latency and higher storage capacity) are the most important and frequently used memories. This methodology employs global memory to facilitate the processing of massive datasets. This decision is based on the fact that the quantity of data transfers required between the GPU and RAM is minimized and that, in order to calculate the association measures for each gene–gene interaction, the entire dimension of the matrix columns must be analyzed. Due to its low storage capacity and inability to divide the column dimension into chunks, GPU shared memory is not recommended as it is susceptible to memory overflows.

Another factor concerning memory management must be taken into account in both pyEnGNet execution modes (CPU and GPU). Memory management are automated in Python during program execution. However, this can cause memory fragmentation and a decrease in computational performance. Furthermore, Python frequently has a larger memory consumption than other programming languages, which can restrict the number of concurrent operations that can be performed on HPC systems [46]. To solve this fact, memory usage is minimized by using techniques such as deleting unused or terminated gene–gene interaction objects and chunks.

3.2.2 Step 2: Generating a gene co-expression ensemble network

As pointed out in Sect. 3.1, in order to create an ensemble gene network, each pair of gene–gene interactions from the input dataset must be processed. Four tasks took place during this processing (see Fig. 5): three different measures are applied to each pair of interactions, and a voting ensemble technique is employed to determine whether that interaction will be included to the final network. To implement this methodology under data parallelism, several factors were considered. These include the dependence of the fourth task on the results of previous tasks, the uniform allocation of computational resources to maximise occupancy, minimising read and write operations in global and shared memory, achieving maximum coalescence in memory management, and the ability to support large datasets and workloads. This implies that these four tasks should be executed sequentially, and what is parallelized is the total number of gene–gene interactions per task. To accomplish this step, two parallelization options have been introduced in the pyEnGNet version: CPU mode and GPU mode.

Due to the dependence of the fourth task on the results of the previous tasks and the aim of pyEnGNet to be able to support large datasets and large workloads, data-level parallelism is implemented. This means that these four tasks are executed sequentially, and what is parallelized is the total number of gene–gene interactions per task. To accomplish this step, two parallelization options have been introduced in the pyEnGNet version: CPU mode and GPU mode.

Fig. 5
figure 5

General scheme to generate the ensemble network. Each gene–gene interaction present in a particular chunk performs four tasks: Kendall, Spearman, NMI, and Major voting. These gene–gene interactions are executed in parallel in a CPU or multi-GPU environment, based on the user’s preference

If the CPU mode is selected, each processor is responsible for processing the gene–gene interaction pairs included in each one of the chunks created in the previous step (see Sect. 3.2.1). The Kendall and Spearman correlation coefficients are calculated using the NumPy and SciPy Python libraries. NumPy [47] is a fundamental scientific library in Python that provides a multidimensional matrix object and a wide range of matrix operations. Internally, this library uses its own data structures implemented in native code, so operations with arrays and vectors are significantly faster [48]. SciPy [49] is a NumPy-based library with additional mathematical and statistical functions, such as Kendall and Spearman. For each coefficient calculated, its validity is determined based on the thresholds specified by the user. As the NMI correlation measure is not included in an optimized Python library, a new parallel implementation is provided.

The calculation of the NMI value is illustrated in Algorithm 1. Each CPU processor’s aim is to determine the NMI value for a chunk of gene–gene interactions. To get the NMI value for each gene–gene interaction pair GGI(ij), three values must be calculated: the normalized mutual information from the interaction between genes GGI[i] and GGI[j] and the entropy value of every one of these two genes. In line 3, the normalized mutual information between GGI[i] and GGI[j] is derived from their joint and individual probability distributions. Consequently, a high value of mutual information indicates that the two genes are closely related, whereas a low value indicates the opposite. Next, the algorithm calculates the entropy for genes GGI[i] and GGI[j] separately (lines 4 and 5). To determine entropy, the probability distribution of the values present in each gene is used. If the entropy is low, it indicates that the gene follows a deterministic pattern; thus, the values are typically predictable. If, on the other hand, the entropy is high, this indicates that the gene exhibits a greater degree of randomness, resulting in greater variation in the observed values. Finally, the NMI final value is calculated in line 6 and if it is valid, that is, if it is greater or equal than the nmiThreshold, it will be stored in the output array D (lines 7–9).

Algorithm 1
figure a

Validation of the NMI measurement for a chunk associated with a specific CPU processor

Once the correlation measures have been calculated, gene–gene interactions are validated in parallel through the main voting process. A gene–gene interaction is deemed valid if at least two correlation measures have been validated in the previous tasks. The ensemble network of gene co-expression is built from these valid gene–gene interactions, where the nodes correspond to the genes involved and the value of the edges is the average of the correlation measures.

If the GPU mode is selected, each GPU thread is in charge of calculating the correlation values for each measurement and then voting by majority to validate each gene–gene interaction. The GPU resource scheduling scheme described in Sect. 3.2.1 is used to determine which gene–gene interaction corresponds to a specific GPU thread. In order to take full advantage of the GPU devices’ computational capacity, the implementation of all correlation measures and the major voting system were implemented without using any external libraries. The computation of the Kendall measure is carried out by Algorithm 2. First, a gene–gene interaction GGI(ij) from the input dataset will be assigned to a CUDA thread based on a CUDA block’s thread id and the number of times this block has been used (line 1). To calculate Kendall’s value, it is necessary to calculate two factors: the number of concordant and discordant pairs, and the number of levels involved in the gene–gene interaction. The number of concordant and discordant pairs helps to determine the relationship between the two genes in the interaction (lines 2–13). In this case, a concordant pair refers to when two gene expression values maintain the same relative order in both genes, while a discordant value is when this relative order differs. On the other hand, the number of levels is used to form groups of pairs of gene expression values that have the same value in the two genes (lines 14–23). All these values, stored in the variables concordant, discordant, tiersGene1 and tiersGene2, are used to calculate the Kendall value, which is stored in the output array D in the position corresponding to the number of CUDA threads (line 24).

Algorithm 2
figure b

Calculation of the Kendall coefficient for a specific gene–gene interaction in a multi-GPU environment

The Spearman coefficient is calculated in CUDA using the Algorithm 3 and as in the previous algorithm, each CUDA thread is assigned to a gene–gene interaction GGI(ij) (line 1). To determine the Spearman coefficient for an interaction, it is necessary to perform three parallel tasks: calculation of ranks, covariance of the interaction, and standard deviation of each gene implicated in the interaction. The results of these three calculations are stored in the array stats and used to calculate the Spearman coefficient.

Algorithm 3
figure c

Calculation of the Spearman coefficient for a specific gene–gene interaction in a CUDA thread

As for the first parallel task, the ranks calculation primarily consist in an ordering of the gene expression values of the two genes involved in the interaction. However, ordinations incur a significant increase in computational cost. Therefore, it was decided to construct vectors that store the order of these values without any modification values or pivoting between gene values (lines 2–8), thereby saving a substantial amount of computational time. Then, each GPU thread is responsible for calculating the covariance of the designated gene–gene interaction (line 9). A GPU thread is responsible for calculating the standard deviations for each gene involved in an interaction (line 10) after all co-variances have been calculated and stored in the stats array. Finally, the Spearman value is calculated using the stats array’s contents (line 11). This method of calculating Spearman’s coefficient avoids the use of vector ordinations and improves computational performance by dividing its duties into three independent and parallel operations.

The NMI association measure is calculated in a multi-GPU environment using the 4 algorithm. Like the other previously calculated correlation measures, each CUDA thread is assigned to a gene–gene interaction GGI(ij) (line 1). To obtain the NMI value for a specific interaction, we must perform the following tasks that are executed independently and in parallel: the mutual information of the interaction and the entropy of each gene involved in that interaction.

Algorithm 4
figure d

Calculation of the NMI coefficient for a gene–gene interaction executed in a CUDA thread

Mutual information measures the mutual dependence between the two genes in an interaction, i.e., how much information one gene provides about the other. To obtain this value, this information needs to be quantified by calculating various probabilities, such as the sum of the joint probability of the two genes, the joint probability of the two genes, and the marginal probabilities of each of the genes involved in an interaction (lines 2–8). The values of these probabilities are stored in an array called stats. These probabilities are then used to calculate the value of the mutual information between the genes involved in the interaction (lines 9–12).

Once the mutual information has been calculated, the entropy value of each gene in each interaction are calculated in parallel, and their values are stored in the array stats (lines 14–15). Finally, the values of the mutual information and entropy calculated above are used to calculate the value of the NMI coefficient (line 16) and its value is stored in an output array D that will be used in the pyEnGNet major voting task (line 17). The calculation of the NMI value and the process of storing its value are also performed in parallel in CUDA.

After all correlation measures have been calculated, the major voting task is carried out. For each gene–gene interaction, the voting strategy is based on determining the number of correlation values that are larger or equal to the user-supplied thresholds. Finally, the gene–gene interactions considered valid are stored for subsequent use.

3.2.3 Step 3: Optimization

Once the ensemble network has been generated, this step optimizes the network’s topology. To implement this sequential step, the Python NetworkX library [50] has been used to identify the most significant genes and relationships by applying the Kruskal’s minimum path algorithm (MST). The hub genes of this highly significant network are then identified through a topological analysis. For each hub, the removed relationships are reevaluated using a user-supplied threshold that determines the level of biological relevance of the removed relationships.

As mentioned previously, pyEnGNet uses the GPU’s global memory rather than shared memory. Even though the former has greater latency, its larger storage capacity and reduced number of data transfers more than make up for it. Shared memory, on the other hand, may cause memory overflow issues when dealing with datasets containing a high number of columns. In this work, a resource management scheme has been developed that optimises the use of GPUs and CPUs’ available resources. Consequently, pyEnGNet has achieved a balanced workload distribution, whether using a single CPU or GPU or multiple GPU devices. As demonstrated in Sect. 4, this enables pyEnGNet to attain high processing speeds and the capacity to process large volumes of data.

3.3 Datasets description

This section describes the datasets used in the experimental part. Specifically, twelve synthetic datasets and two real gene expression datasets have been used to evaluate the performance and scalability of pyEnGNet. In addition, two other real datasets specific to human organisms have been used to demonstrate the biological utility of the algorithm.

3.3.1 Synthetic datasets

Using a modified version of the algorithm proposed in [51], twelve synthetic gene co-expression datasets have been artificially generated. This modified version guarantees that these datasets contain differentially co-expressed genes. This enables the efficiency and scalability of pyEnGNet to be evaluated in an intensive environment by increasing the likelihood of having a larger number of valid interactions than if the datasets were generated with random values. To generate a synthetic dataset \(D_{i,j}\), where i represents the number of genes and j represents the number of samples, pyEnGNet uses an adjacency matrix A as the structure of the gene–gene correlations and a noise level that determines the intensity of the correlation. Then, for each position within the adjacency matrix, correlation patterns between genes are generated. If the correlation is positive, the expression values of gene i are replicated in gene j; otherwise, the co-expression values are inverted. In both instances, co-expression values adhere to a Gasussian distribution with a mean and standard deviation of 1.

In the results of this work, it has been discovered that the computational complexity of correlation measures increases as the number of gene–gene interactions increases. Therefore, twelve synthetic datasets with variable row sizes and fixed column sizes have been generated. These sizes range from \(1000 \times 25\) to \(10,000 \times 25\), with a 1000-row increment, to \(10,000 \times 25\) to \(30,000 \times 25\), with a 10,000-row increment. To account for the total number of genes in the human body, the maximum number of rows for each experiment has been set to 30000.

3.3.2 Real datasets to evaluate performance and scalability

A real gene expression datasets was utilized in pyEnGNet in an effort to construct large gene co-expression networks and demonstrate that this modified version can support and process large workloads and datasets. The objective is to collect data from patients with glioblastoma multiforme (GBM) and low-grade gliomas (LGG) using clinical information provided by The Cancer Genome Atlas (TCGA) [52]. GBM [53] is regarded as the most fatal form of heterogeneous brain cancer, impacting a large number of patients each year with an 8–15 month survival rate. In contrast, LGG [54] are a heterogeneous group of primary brain lesions with a generally higher long-term survival rate than GBM.

The LinkedOmics database [55] was utilized to collect RNA-Seq expression data for a total of 669 patients screened using the Illumina HiSeq platform. For a total of 20,118 genes, gene-level counts normalized as reads per kilobase million (RPKM) were obtained for each patient. Therefore, the generated human GBM-LGG dataset consists of a \(20,228 \times 669\) numerical matrix containing 669 samples for six distinct glioma subtypes. In addition to the RNA-Seq expression analysis, a separated clinical data file was obtained for the database. These clinical data encompassed valuable information about the patients included in the study. This RNA-Seq dataset was filtered based on two clinical criteria: First, it included only patients diagnoses with astrocytoma, which allowed us to focus our study on a specific form of glioma. Second, we specifically selected patients who had undergone radiotherapy treatments. By applying these criteria, we aimed to establish a more heterogeneous patient cohort.

In order to assess the survival outcomes of the patient cohort, quartile analysis was performed on the survival values. The first quartile (Q1) and third quartile (Q3) were determinate, representing the points below which 25% and 75% of the patients had lower survival times, respectively. These quartiles provided valuable cutoffs for classifying patients as \(short\_survival\) or \(long\_survival\) based on their survival values. Patients with survival times below Q1 were categorized as \(short\_survival\) indicating shorter survival duration compared to the majority of the cohort. Conversely, patients with survival times above Q3 were classified as \(long\_survival\) indicating longer survival duration. The dimensions of these two datasets are 20,118 genes and 33 patients. These data were then normalized to conform to the specifications of pyEnGNet input matrices.

3.3.3 Real dataset from invasive aspergillosis study

Invasive aspergillosis [56] is a fungal infection caused by organisms of the Aspergillus genus, primarily affecting patients with acute myeloid leukaemia and those undergoing allogeneic hematopoietic stem cell transplantation (allo-HSCT), but also observed in other haematological malignancies [57, 58]. Despite advancements in prophylaxis and treatment strategies for invasive aspergillosis, its incidence and mortality remain high [59].

RNA-Seq data have been obtained from a longitudinal pilot study including case–control cohorts of two distinct patient groups: patients undergoing allogeneic stem cell transplantation (allo-SCT) and patients diagnosed with invasive aspergillosis. We refer to the first group of patients as control and the second group as invasive. The dataset, which consists of 2588 genes and 66 samples, can be obtained through the Gene Expression Omnibus (GEO) database with accession number ID: GSE174825 [60]. Of the 66 blood samples analyzed, 3 probable IPA cases (two male, one female) and three matched controls without Aspergillus infection were found. All had past allogeneic stem cell transplantation. Samples were gathered biweekly if possible. For each sample a technical replicate was obtained. Libraries of cDNA generated from the samples were sequencing using Illumina NovaSeq 500. Further details on sample preparation can be found in the original article by [60].

Pre-processing has been performed using the EdgeR package [61]. Genes with very low counts in all libraries provide little evidence of differential expression and may interfere with certain statistical approaches. In addition, they compound the burden of multiple testing when estimating false discovery rates, thus reducing the power to detect differentially expressed genes [62]. Therefore, preprocessing was performed to obtain only those genes that are expressed above 0.5 counts per million (CPM) [63]. Multidimensional scaling plots (MDS) were used to further segregate samples according to their attributes. Finally, between-sample similarities were evaluated via hierarchical clustering. These analyses collectively allowed for a comprehensive exploration of the underlying patterns and relationships within the dataset. As shown in Fig. 6, MDS analysis revealed dissimilarity between control and invasive aspergillosis samples.

Fig. 6
figure 6

Multidimensional Scaling (MDS) plots showing main differences between individual samples according to control and invasive groups

Differential gene expression (DEG) analysis has been performed to identify key regulatory elements that will contribute to the reconstruction of gene networks. This approach allowed modelling of the genetic relationships considered relevant in the comparison between samples from the control patient group and the invasive patient group. To perform the DEG analysis, the Voom method [64] has been used to transform the read counts into logCPM (log counts per million) taking into account the mean-variance relationship of the data. After this transformation, a linear model can be applied to the transformed data to check for differentially expressed genes. The total number of differentially expressed genes was 2588 genes.

In order to investigate possible genetic disparities between patients without Aspergillus infection and those affected by aspergillosis, the dataset with the differentially expressed genes is divided into two subsets based on the two patient groups (control and invasive). The control dataset consisted of 28 columns and 2588 genes and the invasive dataset consisted of 38 columns and 2588 genes. These two subsets are stored in two separate files to be subsequently run via pyEnGNet. In the supplementary information, the results generated by this algorithm for these two subsets are shown.

4 Results

In this Section, the main results of the experiments carried out are presented. It compares the efficiency and scalability of the pyEnGNet implementation in multicore-CPU and multi-GPU versions to that of the original EnGNet algorithm. The results are presented in Sect. 4.1, wherein first twelve synthetic datasets were created and analyzed (see Sect. 4.1.1) and the two real datasets were computationally evaluated in Sect. 4.1.2, respectively. In addition, the supplementary information shows how real datasets are used to demonstrate the algorithm’s utility in a biological study involving a group of patients with invasive aspergillosis following allogeneic stem cell transplantation. It is important to mention that the performance of EnGNet against other methods in the literature has not been carried out in this section because the original algorithm has already been tested in an exhaustive way. For more information in this regard the reader is invited to read the original EnGNet publication [7].

As stated in the methodology (see Sect. 3.2.3), the ensemble network generation phase is the most complex and computationally intensive. Due to this, the execution times presented in the results only account for this phase of the methodology; the time required for data loading, storage of results, and the optimization phase are not taken into consideration. Note that for all datasets, all implementations receive the same input parameters: a dataset, a threshold for each correlation coefficient (NMI, Kendall, and Spearman), a threshold that determines when a node is considered a hub (hubThreshold), and a threshold that determines whether an edge returns to the network after pruning (addingThereshold). All of these thresholds have been set to 0 in order to be as unrestrictive as feasible and, therefore, to be able to force the algorithm’s performance with the maximum number of computational resources required. The only modification of these values is in the algorithm’s biological utility study.

All experiments were conducted on an Intel Xeon E5-2686 v4 (18 cores at 2.30 GHz) with 32 GB of RAM and eight NVIDIA K80 12 GB graphics cards, each with 2496 CUDA cores. The original EnGNet algorithm is implemented in JAVA, while the pyEnGNet algorithm is implemented in Python+CUDA C/C++. Finally, single-GPU and multi-GPU versions of pyEnGNet have been applied to every dataset.

4.1 Benchmarking the performance and scalability

4.1.1 Benchmarking on synthetic datasets

As stated previously, various experiments have been conducted with synthetic datasets containing 1000 to 30,000 genes and a fixed column size of 50. The primary objective is to evaluate the performance and scalability of the various EnGNet implementations, namely the original algorithm, the parallel multicore CPU version, and the GPU version (with a single GPU and up to eight GPUs in parallel). The outcomes are depicted in Figs. 7 and 8, with execution times in seconds under a \({\text {log}}_{10}\) scale. The objective of the first study of the experiment is to compare the efficiency of the various algorithm versions. Systematically, the same input datasets are used for each study for this purpose. The objective of the second study is to assess the scalability of the CUDA version of pyEnGNet. In this instance, the algorithm is executed using high-dimensional datasets and multiple GPUs operating in parallel. Thus, it can be seen how the addition of multiple GPUs substantially enhances the performance of the original algorithm. Moreover, based on these figures, we can deduce that the algorithm’s complexity follows a quadratic pattern, denoted as \(O(N^2*K)\), where N represents the quantity of genes and K represents the number of columns (samples). Nevertheless, a discernible trend emerges indicating that the intricacy can be dramatically diminished in multi-GPU architectures. This is due to the fact that the time of execution does not experience a substantial rise as the dataset size expands, provided that an adequate number of GPUs devices are employed.

4.1.1.1 Performance evaluation between versions
Fig. 7
figure 7

Comparison of the performance of the original EnGNet, pyEnGNet (CPU parallel), and its CUDA implementation. The original version shows an exponential trend, while the new implementations show a better trend as the dataset volume increases

In this study, different experiments have been carried out to test the performance of different implementations of EnGNet. Specifically, Fig. 7 provides a detailed comparison of the efficacy of the original algorithm (EnGNet), pyEnGNet (parallel CPU), and its CUDA version (single-GPU). The figure illustrates the progression of execution time in seconds for each implementation. The original algorithm exhibits an exponential trend, whereas the parallel CPU and GPU versions of pyEnGNet exhibit a nearly linear trend. Thus, it can be confirmed that it is not feasible to use the original algorithm when faced with large datasets and/or a high workload. Regarding the utilized datasets, the number of genes ranges from 1000 to 8000. This graph was originally intended to display datasets containing up to 10,000 genes. However, the original algorithm requires more than 10 days to complete a run on datasets with more than 9000 genes, making it impracticable to display these results graphically (see Tables 1 and 2 for details).

Figure 7 presents an only exception between the original algorithm and the CPU parallel version is for datasets containing fewer than 2000 genes, due to the cost of initializing multiple processes with the CPU parallel version. However, the GPU version incurs no cost in these initializations, so this version improves execution times for all cases significantly. Consequently, the results depicted in these figures demonstrate the substantial improvement brought about by the new EnGNet implementations, especially for large datasets. This advantage is significant, as emphasized in the introduction of this work, due to the growing abundance of information in gene expression datasets resulting from advancements in technology. Consequently, the development of efficient network generation algorithms becomes crucial in order to harness the full potential of such data.

Also, the same figure illustrates that when comparing the CPU parallel version to the CUDA version, the CUDA version (using a single GPU) achieves nearly linear performance, as it does not exceed 300 s for datasets containing fewer than 4000 genes (see Table 2 for details). The parallel CPU version’s performance is also nearly linear, although with longer time values than the GPU version. Even though the input dataset is not particularly large, these results indicate that the GPU version outperforms the CPU version. It has been demonstrated that GPU-based algorithms do not always achieve better performance for small datasets due to factors including data transfer between RAM and GPU global memory, improper memory usage that can lead to memory overflows, thread synchronization, initialization, and improper use of the GPU device architecture, among others [65, 66].

Fig. 8
figure 8

Results of the experiment with the CUDA version testing its scalability. The graph shows the results of the algorithm when running on a changing number of GPUs. As expected, the best results are obtained when using 8-GPUs in parallel

4.1.1.2 Scalability performance test for CUDA-pyEnGNet

As previously mentioned, a scalability test was carried out for the CUDA version of pyEnGNet with different numbers of GPUs running in parallel. Specifically, the experiment was performed for 1, 2, 4 and 8 GPUs working in parallel with different input dataset sizes (from 1000 to 30,000 genes).

Despite the quantity of input data, all datasets implied in this experiment obtained a high level of efficiency, as depicted in Fig. 8. Also, the figure illustrates that the execution time decreases dramatically as the number of parallel GPUs increases (for details, see Table 4). The instance of this experiment presenting the results for eight GPU devices is the most remarkable. As can be seen, the execution time is practically constant throughout the entire experiment regardless of the amount of input data.

These results support the contention that the use of multi-GPUs can be a significant advancement in biological studies involving high-dimensional datasets, as opposed to other versions that may make it impossible to conduct a series of experiments in reasonable run times (see the run times for the 30,000 gene dataset in Tables 1 and 4). The enhanced processing efficacy of these datasets permits us to skip preprocessing steps such as data filtering for dimensionality reduction. Thus, it can avoid information loss in the generated biological models, resulting in more accurate and comprehensive depictions of biological reality.

4.1.2 Benchmarking on real datasets

The aim of this experiment is to generate gene co-expression networks with the highest possible volume on two real gene expression datasets to demonstrate that the pyEnGNet algorithm is able to support and process large datasets and intensive workloads.

4.1.2.1 Performance and scalability evaluation
Fig. 9
figure 9

Comparison of the computational efficiency, in seconds, of the parallel CPU and GPU implementations of pyEnGNet utilizing a single GPU device for the two real datasets As can be seen, GPU devices are recommended when executing this algorithm on enormous workloads or large datasets

Figure 9 depicts the evolution of execution times for both real datasets. This figure does not include the execution times of the original EnGNet algorithm due to it transcend 10 days. In the case of the parallel CPU implementation, the algorithm could generate the gene co-expression network for the \(long\_survival\) dataset in 137568.66 s and for the \(short\_survival\) dataset in 133055.16 s. Despite the significant advance between the original algorithm and this version, the GPU-based implementation yields encouraging results. When the algorithm is executed with 1 GPU, the \(long\_survival\) and \(short\_survival\) datasets are generated in 11298.37 s and 10898.62 s, respectively. This data shows that the algorithm running with 1 GPU for these datasets is able to generate results up to 100 times faster than the original algorithm (see tables 5 and 6 for more details).

Fig. 10
figure 10

This figure depicts the scalability test of the multi-GPU variant of pyEnGNet. This test determines the effectiveness of memory management and workload distribution across multiple GPU devices. Evidently, the use of eight GPU devices significantly enhances computational performance compared to versions with fewer devices

The scalability of the CUDA version of pyEnGNet operating on parallel architectures with multiple GPU devices has been evaluated. In particular, this test was conducted with 1, 2, 4, and 8 GPUs for the two real sets used in this experiment. The computational efficiency improves as the number of GPU devices increases, as depicted in Fig. 10, despite the high computational burden posed by these datasets. For the run with 8 GPU devices, the algorithm was able to generate the gene co-expression network in 86.11 s for the \(long\_survival\) dataset and 85.51 s for the \(short\_survival\) dataset (see tables 7 and 8 for more information). This indicates that, for these datasets, this algorithm with eight parallel GPU devices can produce results approximately 13,000 times quicker than the original EnGNet algorithm.

The efficiency and scalability test demonstrates that the correct use of multi-CPU and multi-GPU parallel architectures, as well as correct memory management, data transfers, and efficient use of the CUDA processing environment, enable the system to rapidly process these types of real datasets. Therefore, these implementations provide a solution to a real problem when the scientific community encounters studies that necessitate the use of computationally intensive and data-intensive algorithms such as EnGNet.

5 Conclusions

Parallel and distributed system architectures with multiple CPU processors and GPU devices can be used in studies involving the generation of gene networks from large gene co-expression datasets. In particular, the high computational demands and the volume of these datasets cause the number of gene pairs to be analyzed to be so great that results cannot be obtained in a reasonable time for the GCN techniques. Therefore, it is essential to develop such massively parallel or distributed techniques so that the scientific community can conduct their experiments in the shortest amount of time feasible, regardless of the volume of data.

In order to solve these problems, we have presented pyEnGNet, a novel parallel version of the EnGNet algorithm in Python and CUDA that is capable of running on multiple CPU processors and multi-GPU systems in data-intensive and computationally intensive environments. PyEnGNet could process large datasets with very high processing speeds and no memory issues. Also, in order to achieve efficient load balancing and prevent memory overflow issues, a low-level resource planning scheme has been developed that maximises the use of all available computational resources and determines how the architecture of CPU processors and GPU devices and the associated memory types should be managed. It is able to enhance computational performance and distribute the workload efficiently as the amount of data and GPU/CPU devices involved increases, thanks to this resource planning scheme. In addition, a Python library, accessible via the PyPi package manager, has been developed to facilitate the scientific community’s use of this technique.

Experimental results on synthetic datasets have demonstrated that the pyEnGNet algorithm generated results in a reasonable time in environments with large datasets, while the original algorithm required more than 10 days. Also, pyEnGNet’s efficacy in multi-GPU environments have improved as the number of GPU devices and data volume increase. Experiments with real datasets demonstrated that pyEnGNet is suitable for use under the conditions of increased data volume and complexity posed by actual bioinformatics research challenges. Also, pyEnGNet effectively identified discernible alterations in immune system-related parameters between samples from control group and Aspergillus-infected cases, providing valuable insights into the organism’s defense mechanisms against invasive aspergillosis.