1 Introduction

In general, society is unaware of the enormous impact of Information and Communications Technology (ICT) on greenhouse gas emissions caused by energy consumption. Its constant increase is due to the significant proliferation of electronic devices and applications frequently used for routine tasks. Also, the Internet of Things (IoT) paradigm has caused the appearance of new devices that, despite their low individual consumption, have a significant global impact given their enormous quantity. According to the predictive models on electricity use by ICT developed by Andrae and Edler [1], consumption of these technologies will increase from 13% of global electricity use in the world in 2022 to 21% in 2030 and could reach more than half (51%) of the earth’s total demand in the worst-case scenario. This would mean that ICT could contribute up to 23% of global greenhouse gas emissions in 2030 [2]. Therefore, it is necessary to analyze different options, as in this work, to reduce the contribution of ICT to environmental impact. Fortunately, the most pessimistic forecasts of the ICT are not being fulfilled since consumption is lower than expected. This is because the industry around ICT is aware of the problem and is investing resources in developing policies to reduce consumption. Another way to reduce the carbon footprint produced by ICT is through the electric tariff with hourly discrimination [3]. This consists of executing the applications during the hours when the tariff is lower, and the wind and/or solar electricity production is higher. The result is that the economic cost of the energy necessary for the execution of the programs could decrease, and the use of alternative fuels is also encouraged. Some countries digitally report the energy price in real-time, making it possible to combine this information with scheduled executions of the applications. This is one of the aspects dealt with in this work. Specifically, the contributions of this paper are the following:

Fig. 1
figure 1

Architecture of HPC clusters and the corresponding hybrid MPI-OpenMP model implemented in this work to take advantage of their computing nodes

  • Provide an energy-efficient, parallel, and distributed exploration approach based on mRMR+KNN that exploits heterogeneous clusters with Non-Uniform Memory Access (NUMA) nodes.

  • Apply mRMR for feature selection and KNN for subsequent classification to solve an EEG classification problem that involves a dataset of the University of Essex. The mRMR+KNN combination had not previously been applied to that dataset.

  • Analyze the proposed application from three fundamental metrics: energy consumption, execution time, and accuracy of the results. The study aims to show how greater energy efficiency can be achieved by adequately exploiting the hardware architecture on which the application runs. This is especially useful to save time and energy in future research based on the mRMR+KNN combination.

  • Provide an energy policy to save money or energy, depending on the user preferences. The policy is intended to be easily adaptable to other HPC applications.

The approach developed in this work implements a hybrid MPI-OpenMP model to achieve higher performance by increasing parallelism and minimizing communications and overhead. Message-Passing Interface (MPI) is mainly used for inter-node communications and OpenMP for multi-threaded parallelization in each node. Figure 1 shows how the implementation of the proposed approach (Fig. 1b) maps to the physical architecture of the clusters for which the application has been designed (Fig. 1a). In this sense, each MPI process is in charge of (i) managing the execution of a NUMA node; (ii) exchange information between nodes through the network, and (iii) process their part of the work by distributing computational load between the different CPU cores with OpenMP threads.

After this introduction, the rest of the document is structured as follows: Sect. 2 reviews different works in the literature related to the topic addressed in this paper. Section 3 details the different parallel implementations of the proposed approach and its energy policy for money-saving. Then, Sect. 4 analyzes the experimental results and discusses the importance of energy awareness in HPC systems. Finally, Sect. 5 provides the paper’s conclusions.

2 Related work and background

Given the great importance that a good balance between performance and energy efficiency has in computing, different methods and solutions have been proposed to address this problem. The work [4] compiles an extensive list of works on HPC and categorizes them based on compute device type, optimization metrics, and energy methods. The compatibility of employing techniques for HPC systems, such as workload balancing or Dynamic Voltage and Frequency Scaling (DVFS), has also been discussed in several studies [5, 6]. The environmental impact produced by ICT is also being combated in different ways. Among those actions are [7]:

  • Technological improvements in electronic components The increase in the number of devices in the ICT makes the energy efficiency of the devices acquire greater relevance. Changing the internal architecture of microchips has been some outstanding action in this field. For example, special-purpose processors such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) have been developed, which have turned out to be very efficient in specific applications. Neuromorphic computing [8] and the integration of cooling directly into the chip using microfluidic systems [9] are also booming. The change from Hard Disk Drive (HDD) to Solid State Drive (SSD) technology has significantly reduced energy consumption in mass data storage.

  • Scheduling and resource management The objective is to use the different resources available in the system to reduce energy consumption. For example, implementing power management features to dynamically switch between different power states [10], depending on the workload, is a suitable option to make data centers more energy efficient [11]. Power could be saved in different ways: by using standby modes on resources that are not currently needed [12], or by setting a hardware energy consumption cap [13]. The use of parallelism makes it possible to speed up applications and reduce their energy consumption by avoiding idle processing cores.

  • Scale changes This is intended to migrate small specialized systems (calculators, alarm clocks, etc.) to more energy-efficient equipment like smartphones. In distributed systems, the energy consumption of consumer devices is decreasing because application execution is offloaded to networks and data centers [14].

Regarding bioinformatics, this field has experienced exponential growth in recent years. As a result, biological datasets have grown considerably in size, as in the case of Electroencephalography. This discipline deals with EEG signals, which represent the electrical activity of different brain parts. For example, EEG signals are used to aid in the diagnosis of disorders such as schizophrenia [15], dyslexia [16], depression [17], autism [18], sleep problems [19], epilepsy [20] and seizure manifestations, in general [21,22,23,24]. They are also used to classify motor functions, such as movement of limbs or eyes [25], and for the classification of human emotions [26]. Some emerging areas of Artificial Intelligence, such as Machine Learning, aim to recognize patterns in these signals for their subsequent classification [27]. This tasks can be addressed through various Machine Learning methods using non-brain-inspired or brain-inspired techniques. KNN algorithm, used in this work, belongs to the first ones. In this context, this paper considers a case that falls within the scope of Motor-Imagery (MI) classification due to its social interest in Brain-Computer Interface (BCI) tasks [28]. The problem consists of identifying from EEG signals different motor-imagery movements without its real execution [29]. However, the main problem of working with EEG signals is their high dimensionality, which makes their correct prediction difficult since most of them do not contain relevant information. Therefore, it is important to apply feature selection techniques, such as mRMR, to obtain the most relevant ones.

Among the different existing datasets in the literature, this work focuses on a dataset [30] that belongs to the BCI laboratory of the University of Essex. It correspond to a human subject coded as 104 and includes 178 EEG signals for training and another 178 for testing, each with 3,600 features. As the signals can belong to three different motor-imagery movements (left hand, right hand, and feet), the proposed approach deals with a 3-class classification problem.

There are different works that have dealt with this dataset. The first one [30] analyzed the performance of the Linear Discriminant Analysis (LDA) classifier depending on the MultiResolution Analysis (MRA) approach used to preprocess the data. In addition, they propose a new MRA method called Graph Lifting Scheme (GLS), which is compared with others such as Linear Lifting Scheme (LLS), Db5, and Haar. Subsequently, in the works [31, 32], three alternatives were analyzed: OPT1, OPT2, and OPT3, which carry out the classification process by varying the number of LDA classifiers according to the majority voting. In [33, 34], two filter methods that use Non-dominated Sorting Genetic Algorithm II (NSGA-II) for multi-objective feature selection are proposed. In this case, the filter method is based on a set of label-aided utility functions that do not require the accuracy or the generalization of the classifier. The procedure defines a function for each label in the classification problem that is used as the objective (or fitness) function by NSGA-II. The paper [35] analyzes the dataset using Sparse Representation (SR) in combination with LDA and Support Vector Classifier (SVC), while [36] presents LeOCCEA, a wrapper procedure that hybridizes concepts of Cooperative Co-Evolutionary Algorithms (CCEAs) and lexicographic optimization to make possible the simultaneous optimization of two interdependent problems: finding the best hyperparameter values for the classifier applied within the wrapper method while minimizing the number of features that best describe the dataset. The wrapper is compared with other classifiers such as Support Vector Machine (SVM), Naive Bayesian Classifier (NBC), and KNN. It is also used as a feature selection method prior to using these classifiers.

The application of more modern classification methods to these datasets, such as neural networks, has been studied in [37,38,39], where [38, 39] also provides measures of energy consumption and execution time. On the one hand, [37] analyzes the performance of a Deep Belief Network (DBN) when combined with LDA to reduce the dimensionality of the datasets. On the other hand, in [38] several Convolutional Neural Networks (CNNs), Feed-Forward Neural Networks (FFNNs), and Recurrent Neural Networks (RNNs) are analyzed when their hyperparameters are optimized (or not) by means of a Genetic Algorithm (GA). It is also studied whether previously applying feature selection to the dataset through logistic regression affects the quality of the results. Finally, in [39] a framework is proposed to automatically optimize the hyperparameters of the classifiers through an NSGA-II. The results exceed those obtained by the EEGNet [40] and DeepConvNet [41] neural networks without optimization. A complete comparison of the results of all the methods mentioned above can be seen in Table 1.

Table 1 Kappa values from previous works dealing with the University of Essex dataset 104

As can be seen, none of the previous works applied the mRMR+KNN combination to the University of Essex dataset, although the KNN algorithm has been studied in [36]. KNN classifier has been chosen in this work due to its good effectiveness in motor-imagery applications, as shown in [42]. However, KNN is very sensitive to the curse of dimensionality, worsening its performance with high-dimensional datasets. Thus, as mRMR is able to reduce the dimensionality of the patterns and KNN has proven to be efficient with a small number of input features [36], the combination of mRMR+KNN could achieve good results. In addition, mRMR has shown great effectiveness in other fields of biomedicine, such as those dealing with microarray gene expression data [43].

3 The proposed approach

EEG classification has been approached using the KNN algorithm together with the mRMR technique [44]. KNN generally tries to classify the instances (patterns) by assigning them to the predominant class among their K nearest neighbors. The steps required to classify each new instance are:

  1. 1.

    Calculate the distance between the instance to classify and the training ones. In this work, the Euclidean distance has been used.

  2. 2.

    Sort the distances in increasing order.

  3. 3.

    Identify the predominant class among the nearest K distances.

  4. 4.

    Assign the new instance to the predominant class.

Although interpretability is the main advantage of this method, the computing time depends linearly on the dataset size since each new instance must be compared with all the training ones. In this regard, multiple improvements over the basic procedure have been proposed [45]. However, one of the most useful ways to reduce execution times without decreasing accuracy is to use parallelism techniques. Parallelism is especially useful in clusters that include multiprocessors and accelerators, such as GPUs, since it allows us to take advantage of these devices’ potential. Regarding mRMR, this method tends to select the subset of features that has the highest correlation to the class (output) and the lowest correlation between them. It ranks the \(N_F\) features of the datasets according to the minimum-redundancy-maximum-relevance criterion, whose implementation in this work is based on the one proposed in [43] and takes the F-test Correlation Quotient (FCQ) to select the next feature. mRMR helps to deal with the dimensionality problem [46] and to reduce computation time by avoiding the evaluation of all \(2^{N_F}\) possible subsets of features. Instead, the proposed approach only evaluates \(N_F\) of them, where in each one, the next feature from the list provided by mRMR is added. For each subset, all possible values of the K parameter are evaluated to get the best classification accuracy.

figure a

Since finding the best K and feature subset is computationally complex, the algorithm has been parallelized to reduce execution time and, in turn, energy consumption. In fact, parallelization occurs at two levels through a hybrid approach with MPI and OpenMP libraries: distributing subsets among worker nodes with MPI and distributing the test instances to classify among CPU threads with OpenMP. The latter can be seen in the #pragma omp parallel for directive of Algorithm 1 (Line 4). The evaluation of all values of K for a test instance has been optimized by calculating its distance from all training instances (Line 6). In this way, the array D is reused in the loop of Line 7 to obtain the predominant class according to the value of K (Line 8). If the prediction is correct, the value of the k-th position of the prediction vector, \(C_P\), is incremented by one. Once all instances have been classified, the prediction vector is sorted, and the first position is used to calculate the classification accuracy of the best K (Lines 14 and 15).

The algorithmic complexity of a standard KNN to classify all instances of the test dataset is \(\mathcal {O}(K \cdot Tr \cdot Te \cdot N_F)\), where K is the number of neighbors of the KNN algorithm, Tr the number of instances in the training dataset, Te the number of instances in the test dataset, and \(N_F\) the data dimensionality. However, the proposed approach evaluates all the K possible values for each subset of features, which have a different computational load since the number of features in each subset is greater. As the computational load also grows when increasing the value of K (more neighbors to compare), the complexity of the procedure when executed on a single-processor machine is \(\mathcal {O}(K \cdot \log _2(K) \cdot Tr \cdot Te \cdot N_F \cdot \log _2(N_F))\). Taking into account that the program takes advantage of all processors and is designed to be executed on many multi-core machines, the final time complexity of the proposed approach is \(\frac{\mathcal {O}(K \cdot \log _2(K) \cdot Tr \cdot Te \cdot N_F \cdot \log _2(N_F))}{P \cdot N_{Wk}}\), being P the number of processors (CPU threads) and \(N_{Wk}\) the number of worker nodes.

3.1 A distributed master-worker scheme for node-level parallelization

Table 2 MPI tags used during communications between master and workers
figure b

The proposed application, whose pseudocode can be found in Algorithm 2, follows a master-worker scheme where the master tells each worker which feature subset must use to evaluate a KNN. The MPI process #0 is assigned to the master process and the rest to the workers. The algorithm receives the input parameters: the datasets, the number of workers, and the maximum chunk size to send to the workers. The execution ends when there is no more work to process and the function returns the best accuracy found (Line 41). The operation is as follows: the master waits in Line 8 for some worker to request its first job or to return the result of one of them, which is also implicitly associated with the assignment of a new job. The message type is identified by the MPI tags described in Table 2. When the master receives a result, it checks if the accuracy of that job (feature subset) is better than the current one. If so, update its value (Lines 9 to 11) and send a new chunk with the JOB_DATA tag (Line 14). Before sending work to a worker, the master checks for unprocessed chunks. If there is no availability, the worker will receive the STOP tag and stop its execution since there is no more work to do (Line 16). As all workers must receive the tag to finish, the master must track how many workers have received it (Line 17).

Regarding the workers (Lines 20 to 40), they apply the mRMR algorithm on the training dataset to obtain the ranked list of features. As the index of features changes, datasets are reordered to speed up computation by making use of Coalescing [47] (Line 22). This technique allows multiple memory accesses to be combined into a single transaction. At this point, a worker is ready to request jobs by sending a message with the FIRST_JOB tag (Line 23). For each chunk of features received (Lines 26 to 36), if the energy price is cheap the worker obtains the accuracy of the corresponding feature subset by calling in Line 32 to the evaluateFeatureSubset function of Algorithm 1. If the accuracy of the processed feature subset is greater than the existing one, it will be update. Once all the possible subsets of the received chunk have been processed, the worker returns the best accuracy obtained to the master by sending a message with the RESULT tag, and waits for the assignment of a new job (Lines 37 and 38). This process is repeated until the STOP tag is received, indicating that the worker can end its execution.

3.2 Implementation of the energy policy for money-saving

The proposed master-worker algorithm also incorporates an energy policy that automatically pauses the algorithm during the hours when the energy price is higher and resumes execution when the price is lower. Consequently, this policy would allow data centers to save money but at the cost of extending the execution time and the energy consumed, although the computing devices would still be on even if the algorithm is paused. This policy is a client that opens a secure communication socket with the energy Application Programming Interface (API) [48]. In this way, the client sends HyperText Transfer Protocol (HTTP) requests to port 443 to obtain the price of energy according to the regulated tariff of the Spanish energy market, called Voluntary Price for Small Consumers (VPSC). The API returns a data structure that contains attributes associated with the energy market. Among them, those used by the policy are:

  • The price of energy, expressed in €/MW·h.

  • A parameter indicating whether the price of energy is low.

  • A parameter indicates whether the energy price is below the average for the day.

The user can disable the policy by unchecking the money-saving option in the program’s configuration file. Otherwise, the MPI process that controls each worker node will be in charge of checking the price of energy in two possible circumstances (see Algorithm 2):

  • When the worker receives a work chunk from the master. If the energy price at that moment is expensive, the execution is paused, and the MPI process sleeps until the next hour. Through the program’s configuration file, the user can indicate that values below the daily average are also considered cheap or define a custom price threshold. However, this is risky since if the specified value is not in the range of upcoming prices, the algorithm will never start.

  • When the MPI process is asleep and the next hour is reached. In this case, the process wakes up, checks the price, and resumes execution if the price is acceptable or goes back to sleep until the next hour (when the tariff changes).

3.3 Ways of distributing the workload

As previously seen in Sect. 3.1, feature chunks are sent to workers via the message-passing interface provided by the MPI library. This allows the application to distribute the workload among the different nodes of the cluster [49]. However, in the algorithm proposed here, the workload of each feature subset is asymmetric since the number of features in each one is variable. For example, suppose a dataset with ten features, two worker nodes, and a chunk size of 2. In this scenario, the first chunk that the master will send contains the indices 1 and 2, corresponding to the subsets \(\{1\}\) and \(\{1,2\}\). The second worker will receive indices 3 and 4 to compute the subsets \(\{1,2,3\}\) and \(\{1,2,3,4\}\). In other words, a higher index implies computing more features within the KNN and, consequently, a longer execution time. To deal with workload imbalance, by default, the procedure distributes chunks dynamically according to the specified chunk size. Although this has the disadvantage of increasing communications, it is essential in heterogeneous systems to avoid performance drops. If the user wants, the master can also give each worker a chunk of features at the start of the algorithm by dividing the number of total chunks by the number of workers. This can be done in two ways: contiguous or striding blocks (see Fig. 2). The strided assignment could reduce the workload imbalance [50] present in the contiguous blocks alternative since each node would compute similar subsets. The impact of the different workload distributions and the chunk size on performance is discussed in Sect. 4.

4 Experimental work

This section analyzes the classification results obtained by the proposed approach and compares the energy-time performance of each workload distribution when using multiple nodes. In addition, a benchmark is performed to demonstrate the benefits of using the energy policy when computing over long periods. All experiments are repeated ten times to obtain more reliable measurements of the application’s behavior.

Fig. 2
figure 2

The two different static workload distributions used by the master node

Table 3 Characteristics of the cluster used in the experiments

4.1 Experimental setup

The application has been executed in an HPC cluster composed of eight heterogeneous Non-Uniform Memory Access (NUMA) nodes interconnected via Gigabit Ethernet and whose CPU devices are detailed in Table 3. The cluster runs the Rocky Linux distribution (v8.5) and schedules the jobs using the Slurm task manager (v20.11.7) [51]. The C++ source codes have been compiled with the GCC compiler (v8.5.0), the OpenMPI library (v4.0.5) with support for the MPI API (v3.1.0), and optimization flags -O2 -funroll-loops.

The energy measurements of each node have been obtained from a custom wattmeter called Vampire, which is based on the ESP32 microcontroller [52], capable of capturing in real-time information of instantaneous power (W) and accumulated energy consumed (W \(\cdot\) h). This meter allows the monitoring of multiple computers in a synchronized and independent way, accurately measuring the energy consumption of a program that is running in a distributed manner in a multi-node cluster. Since the meter reads the voltage (V) and current (A) every second, the total energy consumed by the application over a given time can be determined. Although the data obtained is transmitted to a remote server via WiFi or Bluetooth and stored in an InfluxDB database for further processing, it can be viewed in real-time using the Grafana user interface.

4.2 Classification analysis

Table 4 Comparison of Kappa values between the proposed approach and the top 3 of Table 1
Fig. 3
figure 3

Classification results of the proposed approach when using mRMR and \(K=18\)

The proposed algorithm achieves a Kappa index of 0.83 using the first 62 features of mRMR and \(K=18\). This solution widely outperforms a run without mRMR (0.34), and other approaches in the literature that use the same dataset (see Table 4). The result has been validated by replicating its value when executing the KNN with the Python and Matlab languages and the same input parameters. Figure 3a shows the corresponding confusion matrix, which reveals an overall accuracy rate of 88.8%. The evolution of the accuracy rate and the Kappa index depending on the number of selected features can be observed in Fig. 3b. The general trend is that both metrics increase as new features are added until reaching the peak (62), and then progressively fall. It seems that the algorithm’s convergence is penalized with the selection of many features, which are irrelevant. It is also observed that the values of accuracy and Kappa distance themselves for extreme values of the graph.

The statistical validation of the results has been carried out using 10,000 iterations of a permutation test. This ensures that the results obtained are not achieved by chance when testing the null hypothesis. In our case, the null hypothesis consists in obtaining higher performance values (Kappa index values) in the permuted dataset than in the non-permuted one. In the test, a p-value \(< 1.10^{-7}\) has been obtained, showing that the null hypothesis can be rejected since its value is less than 0.01. Despite the good results, there are limitations inherent to the use of the KNN classifier in the global optimization method proposed in this work. For example, it is well-known that the appropriate value of K can vary greatly depending on the dataset used, especially if it is characterized by the curse of dimensionality, which can deteriorate the generalizability of the KNN classifier when few samples are available. However, this is a common limitation for most classifiers and should not be a particular drawback compared to other classical alternatives, although classifiers such as SVM have proven to be efficient with high-dimensional datasets. An example of this can be seen in [53], where an adaptive SVM is used in conjunction with neural networks to address the BCI Competition IV 2a dataset. It could be interesting to study if this approach could be applied to the dataset addressed in this work since both datasets share similarities, although it is also worth noting that KNN is less prone to overfitting than classifiers based on neural networks. There are other types of limitations as well. On the one hand, deal with unbalanced datasets: KNN assumes that instances within the same class are grouped together. However, in some cases, the decision boundaries may be complex or non-linear, and KNN may have difficulty capturing them accurately. In other words, unbalanced data can lead to biased predictions. In the dataset discussed here, the classes are balanced, without any appreciable bias (33%, 37%, and 30%). On the other hand, for very large datasets, the computational load and memory consumption associated with the training process could be a limitation, since it requires the comparison of a test instance with all the training samples.

4.3 Energy-time performance

Figure 4 shows the application’s performance after running on Node 3 without using the power policy. The goal is to depict the speedup scalability of the first parallelism level, which occurs within each computing node when the number of logical CPU cores is increased. From Fig. 4a, it can be seen that the maximum speedup of 12.67 is obtained using the 48 threads available in the node. Its behavior is approximately linear, up to four threads, and logarithmic for higher values. The main reason is that the motherboard supports quad-channel memory. Increasing the number of threads above four causes competition for memory accesses since not all of them can do so simultaneously. It is also due, although to a lesser extent, because the workload for each thread decreases and the cost of managing threads becomes important. This means that the speed gain could increase with larger datasets that allow threads to compute for longer periods. It has also been found that distributing the instances to be classified among the threads statically provides the best performance (Line 4 of Algorithm 1). This is expected, as indicated in [54], because the computational workload is the same for each thread, so a dynamic distribution has been discarded. Regarding energy consumption, also for the case of using 48 threads, it provides the lowest total energy consumption. This may seem contradictory since the use of more resources is associated with a higher instantaneous power (see Fig. 4b). However, energy consumption also depends linearly on execution time, and since speedup increases at a greater rate than energy, total energy consumption is less. This behavior has been widely demonstrated in the literature for a wide variety of parallel and distributed applications.

Fig. 4
figure 4

Performance obtained by the proposed approach in a single-node configuration when varying the number of OpenMP threads

Fig. 5
figure 5

Performance obtained by the dynamic workload distribution when using all nodes

The performance of the hybrid MPI-OpenMP approach that corresponds to the second level of parallelism is shown in Figs. 5 and 6. Again, the power policy has not been activated in order to measure the performance scalability of the application. On the one hand, Fig. 5 exposes the behavior of the application when all nodes are used, and the workload distribution is dynamic. Figure 5a reveals that a very large chunk size leads to worse execution time and energy consumption mainly due to workload imbalance. It is also noteworthy that a chunk size of 1 provides good results, which suggests that the cost of communications in this application is very low. The instantaneous power of each node for a chunk size of 4 is plotted in Fig. 5b. Although the optimal size ranges from 1 to 64, the value 4 has been set as definitive since it works well with few nodes and should do so with more than 7. What is observed in the figure, and in the other 9 repetitions of this experiment, is that most nodes end up simultaneously, which is expected in dynamic workload distributions.

Fig. 6
figure 6

Comparison of performance between the different workload distributions

On the other hand, Fig. 6 compares the different workload distributions. The number of computation nodes indicated in Fig. 6a does not correspond to the order shown in Table 3. Instead, the nodes in the graph correspond to homogeneous and heterogeneous ones in that order. That is first Nodes 3 to 7, and later Nodes 1 and 2. In this way, the scalability of the program can be analyzed according to the type of node added. As expected, for all distributions, the observed speedup grows linearly as more nodes are used, but up to 5 and in different magnitudes. From this point on, only dynamic distribution continues to scale its performance, although to a lesser extent since heterogeneous nodes begin to be used. In fact, it can be seen that the increase in speedup is in line with the added heterogeneous node: adding Node 2 boosts speed up more than adding Node 1 (the slowest one), until reaching a maximum speedup of 5.88. With respect to a sequential execution (1 thread), the application achieves a speedup of 74.53, consuming only 13.38% of energy. The use of heterogeneous nodes also negatively affects static and dynamic distributions but in different ways. In the case of strided, speedup plummets for 6 nodes and improves slightly after adding the last one. Not so for the static distribution, which worsens its performance for each node added. The instantaneous power in Fig. 6c reveals that workload imbalance is responsible. Here, the nodes finish computing in a staggered manner, with a long interval between the first (\(t = 30\)) and the last (\(t = 125\)). In the strided case (Fig. 6b), only the homogeneous nodes finish at the same time, but before the heterogeneous nodes, causing a bottleneck. Based on the results, it can be affirmed that the dynamic distribution provides the best results in speedup, energy consumption, and scalability since the speed gain is very close to the number of nodes used to compute. Extrapolating the data, if all the nodes were homogeneous, a speedup of approximately 6.4 could be achieved.

It should also be noted that the speedup has been calculated with respect to the time depicted in Fig. 4b, where the master-worker scheme does not exist and therefore Node 3 does all the work. The objective is that the data shown in the figure take into account the overhead caused by the existence of the master and its communications with the worker nodes involved. As a consequence, all speedup values are somewhat less than those calculated based on the single-node time of the dynamic distribution. For this reason, and although it is difficult to see in the figure, the speedup of the 3 distributions is slightly below 1 when one computing node is used. Furthermore, it is the only case in which the static and strided distributions are slightly better than the dynamic one. This makes sense because if only one node computes, there is no need to continuously distribute chunks of work.

4.4 Energy policy benchmark

As discussed in Sect. 3.2, the energy policy checks the price of energy every hour and decides whether to pause, continue, or resume the execution of the application. However, its execution time, as has been observed in previous figures, sometimes does not even reach one hour. In order to measure the energy and economic impact of the policy, the application has been running in a loop for a total of 24 h using dynamic distribution and 7 nodes, each computing with the maximum number of OpenMP threads available. Figure 7 shows the price of energy and the sum of the instantaneous power of all the nodes during the execution. In the slots where the price is cheaper (shown in green), the instantaneous power increases, meaning that the nodes are computing. This can be seen in the slots [00:00–03:00, 04:00–07:00, 08:00–09:00, 12:00–19:00, 21:00–00:00]. In the rest of them, the nodes pause their execution, and therefore the power drops. Now suppose the need to compute a task that requires 6 h of execution without the use of the policy. Depending on the time window in which it runs, using the policy could save or lose money for the user as the total number of hours needed could be much higher. This is analyzed in Table 5.

Fig. 7
figure 7

Activity of the nodes and energy price for a 24-hour period. The cheap time slots, colored green, are those where the price is below the daily average (171.48 €/MW \(\cdot\) h)

Table 5 Energy consumption and money-saving for a 6-hour run when the policy is activated

The results obtained are very different. On the one hand, for the first seven time slots, using the policy loses money. In the first two, despite the fact that the computation takes an hour or two more, the price of energy in the expensive slots is only slightly higher than in the cheap ones, making it not worth it. The same, but in a more extreme form, occurs in the following four. What happens here is that in all execution windows, in addition to the above, there is also a period of three consecutive expensive slots, causing the amount of energy consumed to be much more than if the policy were not used. On the other hand, for the rest of the time slots in the table, the policy saves money or at least does not lose it, except for the slot [14:00–22:00]. The reason why in the slots [12:00–18:00, 13:00–19:00] the saving is 0 is that the number of computing hours, whether the policy is used or not, is the same (6 h). Finally, it must be taken into account that the results have been obtained considering cheap areas whose price is below the daily average. However, as discussed in Sect. 3.2, the application allows the user to configure the threshold that defines each slot. This means that the results could be worse or better depending on the threshold value, so a study that determines the optimal threshold is essential. Also, the user could estimate the best time of day to run the program: the energy price is known a priori and the energy consumption both in idle and active states always has the same pattern.

5 Conclusions and future work

This work has proposed to investigate the energy efficiency of a bioengineering application capable of exploiting the qualities of distributed and heterogeneous parallel platforms. The use of mRMR for the selection of features has allowed for the improvement of the performance of existing approaches in the literature that use the same dataset. Another contribution of this article has been to consider energy efficiency as a fundamental parameter, contrary to other works that focus only on the accuracy of the results and on the execution time. Experiments have been carried out with the synchronization of the executions in time slots with different energy costs. In addition, different workload distributions for the proposed procedure have been analyzed. The results have verified that a dynamic distribution is the most appropriate option to distribute asymmetric jobs in heterogeneous systems, reaching speedups of up to 74.53 consuming only 13.38% energy of sequential execution. Even so, the next step is to improve this result using accelerators such as GPUs and increasing data parallelism through vectorization techniques [55].

Regarding the energy policy, it allows for saving money or energy depending on the needs of the user, stopping or resuming the program execution according to the cost per megawatt. One of the advantages of this method is that it does not depend on the implemented application, so it can be easily adapted to other applications without major changes. Another advantage is that running the algorithm in the lowest cost time slots also contributes to reducing greenhouse gas emissions, since during these periods the generation of energy from renewable sources (wind, photovoltaic, hydraulic, etc.) is greater. However, while it has proven useful under some circumstances, the following drawbacks have been identified:

  • The price difference between expensive or cheap slots must be substantial to offset the energy consumption of overtime.

  • A period of consecutive expensive slots makes it difficult to save money since the total computing time is much higher.

  • It is not possible to save if the proportion of expensive slots is much higher than that of cheap ones. According to other benchmarks carried out, savings can be made if the number of cheap slots is greater than 65% unless the difference in cost between the expensive and cheap periods is very abrupt. However, although it is common for cheap slots to predominate, finding abrupt differences with expensive slots is less likely. A quick solution could be to adjust the threshold to define which slots are considered expensive or cheap.

The common problem with those points is that the application consumes energy during the expensive slots without advancing the computation. In future work, different alternatives are proposed to improve the policy. One of them would consist of applying the DVFS technique to reduce the frequency and voltage of the CPU, minimizing energy consumption and, therefore, the monetary cost. The policy should consider the consumption of standby devices as one more parameter and completely turn off the devices when possible. Another improvement would be to not pause the execution of the procedure or to do it as little as possible. For example, if the system had a big.LITTLE architecture, only the so-called low-performance cores should be used during the expensive slots. This would allow computing but with moderate energy consumption, shortening the execution time and minimizing the probability of falling into expensive slots. Also, in cases where a company has processing centers in different countries, it would also be possible to dynamically reallocate the workload in those where the cost of energy is lower [56]. In this way, the execution of the procedure would never be paused, but the counterpart is that the application must be redesigned to be able to transfer its execution status and data to other computers, which entails an extra cost for communications.