1 Introduction

Anomaly detection (AD) plays a critical role in numerous fields, including remote sensing, surveillance, and environmental monitoring [1,2,3]. In the context of multispectral image analysis, AD techniques hold significant potential for identifying and characterizing irregularities or deviations from the norm in the images, providing valuable insights into complex natural systems [4].

The processing of multispectral or hyperspectral images always demands a considerable amount of computational resources due to the high dimensionality and complexity of the data and the need for real-time processing for many applications. This is especially true in the case of AD tasks over very high-resolution images, as AD algorithms require processing the whole spatial and the spectral information available in the images [4,5,6]. Different computational paradigms, ranging from high-performance computing (HPC) platforms such as clusters, grids, or clouds, to specialized accelerators such as graphics processing units (GPUs), field-programmable gate arrays (FPGAs), or even quantum computing solutions, have been exploited in this context. The choice of the most appropriate computing platform depends on the problem at hand and on the context in which it must be addressed [7]. For instance, in certain scenarios, data may be efficiently offloaded to supercomputers, while in others, it may be more practical to tackle the problem in situ using edge computing solutions or, simply, the commodity hardware available, for instance, a personal computer with a hardware accelerator such as a GPU.

Supercomputers play a dominant role in parallel computing applied to remote sensing [8]. A supercomputer is a mixture of shared and distributed memory systems that, in many cases, include heterogeneous nodes. Each node could consist of central processing units (CPUs), GPUs, FPGAs, or any other accelerator. Many models and portable libraries have emerged as possible standards for supercomputer programming, being OpenMP [9] and MPI [10] the most widely used. In fact, MPI is the facto current standard for parallel programming.

Accelerators have also played a dominant role in parallel computing and, especially, in image processing including multidimensional images from the remote sensing field [7]. GPUs, renowned for their ability to perform thousands of computations in parallel, excel in tasks involving repetitive mathematical operations. Their highly parallel nature based on multi-threaded, many-core processors and their very high memory bandwidth make them particularly well-suited for processing large-scale datasets and computationally intensive tasks. The development of NVIDIA’s Compute Unified Device Architecture (CUDA) [11] has simplified the programming model for GPUs. Many libraries have been developed to provide the most common operations implemented in CUDA. OpenCL is also a relevant standard for parallel programming across CPUs, GPUs, and other processors [12].

Presently, GPU-based computers integrate the attributes of general-purpose computing, a high degree of parallelism, and high memory bandwidth, at a relatively lower cost compared to other alternatives. This makes them an attractive option compared to a massively parallel system made up of only CPUs [13]. The heterogeneous computing approach that combines the strengths of both CPUs and GPUs can lead to even greater performance gains. By distributing the computational load between CPUs and GPUs based on their respective strengths, a more balanced and efficient workflow can be achieved, thereby maximizing overall processing speed and resource utilization. This kind of architectures has been previously used to tackle other tasks in multispectral images, such as registration [14] or domain adaptation [15], among others.

Recently, the use of edge computing architectures has arisen as a promising hardware alternative for remote sensing applications [7]. Edge computing is effective in reducing the delay between the acquisition and the processing of data [16]. Most edge computing devices are heterogeneous computing platforms such as, for example, the NVIDIA Jetson [17] used in this paper. These devices can help to make it possible to perform onboard remote sensing computation. Edge computing architectures are also very effective in reducing power consumption, which, nowadays, is a relevant requirement from both the sustainability and the economic perspectives. This can be especially important in real-time monitoring scenarios when the time needed to transfer the raw data to HPC centers is not an acceptable option [18].

Addressing the challenge of reducing the AD computation times, this paper presents an efficient heterogeneous parallel implementation to perform AD in high-resolution multispectral images. The objective is the detection of anomalies corresponding to human constructions within natural fluvial ecosystems. The resulting algorithm is executed on both a supercomputer and an edge computing device. Our methodology allows to automatically identify and characterize anomalies corresponding to human-made constructions by combining the use of extinction profiles (EPs) for extracting spatial information from the images with the well-known Reed–Xiaoli (RX) AD algorithm. The proposed algorithm can contribute to the development of effective conservation and management strategies for fluvial ecosystems, promoting sustainable development while safeguarding the ecological integrity of the ecosystem.

The remainder of this paper is structured as follows: Sect. 2 reviews the background related to efficient parallel implementations of AD for multispectral and hyperspectral images. In Sect. 3, the AD algorithm proposed in this work is introduced, with a particular emphasis on the parallel features of the algorithm. Section 4 describes the experimental setup and presents the results obtained on a real-world dataset of multispectral images. Finally, in Sect. 5, the concluding remarks are presented.

2 Related work

Processing multispectral images of fluvial ecosystems presents several challenges due to the inherent complexity and variability of natural landscapes [19]. The similarity between the spectral signatures of the various materials in the scene, such as geological features and human-made structures, coupled with the prevalence of vegetation in the landscapes under study, pose difficulties in distinguishing human constructions, which seamlessly blend into the surrounding environment [20]. Additionally, multispectral images are vulnerable to atmospheric conditions, sensor noise, and fluctuating illumination, adding complexity to the accurate identification of anomalies in these dynamic environments [21].

Various approaches have been proposed to tackle the challenging problem of AD in multispectral images [19]. These methods can be broadly categorized into unsupervised, supervised, and semi-supervised techniques. Unsupervised methods, such as statistical modeling and clustering algorithms, do not require labeled training data and can automatically identify anomalies based on deviations from the normal data distribution [1, 22, 23]. Supervised methods, on the other hand, rely on labeled training data to learn the characteristics of normal and anomalous samples, enabling them to classify new instances accordingly [2, 24,25,26]. Semi-supervised methods aim to strike a balance between the two by utilizing a limited amount of labeled data and a larger pool of unlabeled data during the training process.

One notable algorithm that has garnered attention in AD is the RX algorithm [27]. The RX algorithm is an unsupervised technique based on the concept of the Mahalanobis distance. It characterizes anomalies by measuring the spectral deviations from the local mean of the surrounding pixels. The effectiveness of the RX algorithm lies in its ability to handle high-dimensional data and identify subtle anomalies, making it a promising candidate for detecting human constructions within fluvial ecosystems in multispectral images. Several variations of the RX algorithm have been applied to improve the detection accuracy of the traditional algorithm [28,29,30,31].

In the context of AD in multispectral images within fluvial ecosystems, incorporating spatial information alongside spectral data becomes a crucial aspect of achieving accurate and reliable results, as it has been shown by [32] for AD in hyperspectral images. This has also been explored in the previous classification-oriented works [23, 33,34,35]. Spectral information alone may not provide sufficient context to differentiate between natural variations and genuine anomalies in the complex and heterogeneous landscapes of fluvial ecosystems. By considering the spatial relationships among neighboring pixels, valuable contextual cues can be extracted, enhancing the discrimination of anomalies from the background. In this work, we incorporate the spatial information together with the spectral one by introducing the filtering technique called EP [36], which is an alternative to the widely recognized attribute profile (AP) [37, 38]. EP is based on the concept of extinction filters (EFs) which are extrema-oriented connected filters that, unlike the AP, preserve the original height of the extrema kept in the image. The parameter tuning is also easier for EPs than for APs due to the fact that they are independent of the kind of attribute being used and only based on the number of extrema to be kept at each level of the EP [36].

Several works have explored the use of parallelization techniques to efficiently tackle the real-time AD problem. [39] presents a GPU implementation of the RX algorithm for AD using CUDA. It remarks the relevance of processing sub-images of the image independently in different hardware to exploit the parallelism minimizing the communications, as they are a common bottleneck in multi-GPU implementations. The paper also analyzes the power consumption of the algorithms, which is especially relevant for real Earth observation missions using onboard computation.

The projection of AD to a widely used edge computing device, an NVIDIA Jetson GPU, is explored in [40] over a hyperspectral urban image acquired by the AVIRIS sensor, concluding that there is a promising solution for hyperspectral image processing in low-power consumption scenarios. In turn, [41] proposes the use of FPGAs to perform recursive RX-based AD in hyperspectral images from both urban and natural scenes, also acquired by the AVIRIS sensor. They show how different variants of the traditional RX algorithm can be projected efficiently to FPGA architectures, along with GPU and cloud computing alternatives, concluding that the main limiting factor in FPGAs and GPUs is memory capacity. This limiting factor can be avoided by scaling the number of nodes used. Another FPGA AD algorithm focused on minimizing power consumption is presented in [42], demonstrating that hyperspectral images with thousands of pixels and hundreds of spectral bands can be processed with a power budget of only 1.3 W. Embedded devices such as FPGAs can also be used to perform AD through the use of deep learning techniques. For instance, a deep convolutional neural network is used in [43] for natural anomaly detection in multispectral images. Another possibility to make the algorithms more adequate for edge computing devices is to align the data processing with the data acquisition process. For this purpose, a line-by-line AD technique is presented in [44], which aims to process hyperspectral data in a manner consistent with its collection by push-broom scanners.

When assessing a task such as AD is also important to take into account that different architectures may be better suited for each specific processing step. This aspect is explored in [45], where a combination of CPU and GPU is employed to achieve a time-efficient AD technique. The proposed technique is based on the use of multivariate normal mixture models applied to a simulated search and rescue scenario in a real hyperspectral image captured by a HySpex visual and near-infrared (VNIR) hyperspectral camera.

3 Heterogeneous parallel EP-based AD scheme

In this section, the proposed parallel algorithm for detecting anomalies in high-resolution multispectral images of fluvial ecosystems is presented. A hybrid algorithm using MPI, OpenMP, and CUDA is used to exploit the different levels of parallelism offered by a heterogeneous architecture that, as it will be explained later, includes nodes with a multi-core CPU and a GPU.

The outline of the algorithm is shown in Fig. 1. It consists of three main stages: First, the spatial information is extracted from the input image by computing an EP over each band of the image separately. The five bands available in the images under study are represented in the figure. As a result, an extended EP is produced by accumulating the results of the individual EPs constructed for each band of the image. Then, a second stage consists of applying the RX anomaly detector over the resulting extended EP. This stage is more efficiently calculated over only one node of the computer platform. A two-dimensional gray-scale intensity image is obtained as a result. Finally, the processing requires calculating and applying a threshold to produce the output AD map. This is the objective of the third stage, which applies Otsu’s algorithm, producing a binary map of anomalies as output.

More details of the implementation presented in Fig. 1 are offered in Algorithm 1, where all the computational steps for each stage of the algorithm and the platforms where are computed are annotated on the right side.

Fig. 1
figure 1

Parallel RX+EP AD algorithm for heterogeneous computing

More in detail, the first stage of the algorithm presented in Fig. 1 and detailed in lines 1–18 of Algorithm 1, can be considered as a spatial processing where the structures of interest of the image are highlighted through the computation of the EPs. As the anomalies that need to be detected correspond to a set of uniform pixels more than isolated pixels, this spatial processing stage helps in identifying these structures at different levels. This is a very common process followed when changes or anomalies need to be detected in multi- or hyperspectral images. In this case, the method for extracting spatial information is the use of EPs. The processing of the EP for each image band is independent, making it a good candidate to be computed in parallel. After the tasks corresponding to the processing of each band are distributed through the use of MPI, a hybrid CPU-GPU approach is used to tackle this operation. As each EP calculation consists of some steps where the individual operations are performed over each pixel of the image, it is a perfect fit to be computed in GPU using CUDA. Other steps that are performed over the node-array representation of the image tree [36] offer fewer opportunities for parallelization to be more efficiently computed by the CPU. This is the case for the stages devoted to obtain the parents of each node by means of the union-find algorithm and the computation of the node array (steps 4–5 in pseudocode in Algorithm 1).

The calculation of the EP of a one-band gray-scale image consists of the application of several opening and closing operations at different granularity levels over the original images [36]. So, inside each node, different levels of the EP need to be computed. To perform this operation in the most efficient way, the node-array representation of the image is computed first (lines 2–5 in the pseudocode). This representation can be seen as a max-tree [36] computation where the parent of each node is obtained in a union-find process and stored in a structure together with the attribute of interest for each node, that is, the area in this case. Once this information is gathered, it is possible to compute the extinction values of each node as proposed by [36] (line 6). The steps corresponding to the application of the extinction filter for each number of extrema selected (lines 7–9 and 16–18 in Algorithm 1) can be computed in parallel through the use of OpenMP once the extinction values of the image have been computed in the previous step, as they are independent computations for each level of the EP. Figure 2 illustrates the EP for one band of a multispectral image. The original band is in the middle, and the different components of the profile, corresponding to the result of applying a three-level EP, are represented on both sides.

It is worth noting that, for the computation of the closing profiles (lines 10–18), the same code is used but carried out over the negated input image. This is a usual approximation to this task that has been used, for instance, in [38]. The EPs for the different bands of the input multispectral image are concurrently computed in different nodes of the computing platform, through the use of MPI, thus reducing the computation time. The opening and closing profiles individually computed for each band of the image are stored in memory composing an extended EP. This EP will be the input of the second stage of the algorithm.

Fig. 2
figure 2

EP computation for one band of a multispectral image (first row) and zoom over a small region (second row)

The second stage of the algorithm consists of the application of the anomaly detector, the RX algorithm, over the extended EP. As the second stage of the algorithm (lines 19–22) presents a low computational load, it is efficiently computed in only one node, so the extended EP is computed by gathering the individual EPs, as shown in Fig. 1. This part of the algorithm is individually performed over each pixel and, therefore, totally computed in GPU to achieve low execution times. We can identify in the algorithm the steps for the parallel calculation of the RX anomaly detector [27] to obtain a one-band gray-scale image where the higher the intensity of a pixel, the higher the probability of it being an anomaly. It is based on the calculus of the Mahalanobis distance between each pixel of the stacked EP and the average pixel value of the same stacked EP. All the steps are calculated in GPU using CUDA.

The third and final stage of the algorithm performs the application of a threshold technique to obtain a binary AD map (lines 23–26 in the algorithm) identifying each pixel as anomaly or non-anomaly. This is tackled through Otsu’s threshold algorithm as it is an automatic threshold technique that has proven to produce the best discrimination with a low computational cost [46, 47]. This algorithm calculates the threshold based on the histogram of the gray levels of the image. This stage of the algorithm is also executed on GPU as both the histogram calculation step and the final binary decision over each pixel of the image can benefit from the highly parallelizable GPU architecture thus achieving lower execution times.

Algorithm 1
figure a

Parallel RX+EP AD algorithm.

3.1 Comparison with other parallel AD implementations

The algorithm proposed in this paper aims to exploit the capabilities of the available hardware on a heterogeneous computing platform at different levels. This has been the methodology used by other parallel AD implementations in the literature, such as [45], where a dual Quad-Core Intel Xeon CPU and a NVidia GeForce 8800 Ultra GPU are used to reduce the execution times of an AD method for hyperspectral images based on the use of multivariate normal mixture models. Similarly in [41], the authors conclude that embedded, GPU, and cloud architectures should be combined to achieve efficient processing of remote sensing data.

In particular, in our paper, the first stage of the algorithm, as we described in Algorithm 1, exploits a multi-node implementation for extracting parallelism. The reason is that the EP calculation computed by this stage of the algorithm can be individually processed for each band. The implementation combines the use of MPI, OpenMP, and CUDA to efficiently distribute the computational load of this stage among the available nodes of a supercomputing device. The same approach is applied, for instance, by [41], which introduces a multi-node cloud implementation of an AD algorithm based on the use of RX.

The second stage, the RX detector, is performed entirely in a single GPU by using CUDA in our algorithm. Given that the amount of data to be processed is small in this stage, it would be difficult to compensate for the communication times needed to use a distributed architecture. This task has been tackled similarly in other parallel AD implementations such as [39, 41]. The same remarks apply to the Otsu’s thresholding needed to obtain the final AD map.

Finally, when prioritizing the minimization of power consumption, the use of embedded devices has been established in the literature as the optimal alternative [40, 42]. Different parallel AD algorithms in the literature, for instance, [41] or [43], have been projected to FPGAs. In our paper, an NVIDIA Jetson platform [17] is used. This platform presents, in comparison with a FPGA, the advantage that the implementation required is more similar to that of general-purpose architectures.

4 Experimental results

This section is devoted to summarize the experiments carried out for the validation of the algorithm presented in this work for the AD of human-made constructions in fluvial ecosystems. First, the dataset and experimental setup selected for the experimentation will be presented in Sects. 4.1 and 4.2. Then, the achieved accuracy and performance results will be analyzed in Sect. 4.3.

4.1 Dataset

The experiments were conducted utilizing data captured by the MicaSense RedEdge multispectral sensor, onboard a specialized unmanned aerial vehicle (UAV). This advanced sensor is capable of capturing imagery across five distinct spectral bands: blue (475 nm), green (560 nm), red (668 nm), red-edge (717 nm), and near-infrared (NIR) (840 nm). The aerial images of fluvial ecosystems were taken during the summer months of 2018 in the region of Galicia, Spain, at an altitude of 120 m, offering a very high spatial resolution of 8.2 cm per pixel [20].

Figure 3 shows an RGB color composition and the reference data of anomalies available for the dataset used for the experiments. These images depict watershed ecosystems located in densely vegetated regions. Within this context, structures such as buildings, dams, and roads are categorized as anomalies necessitating detection to trigger alarms. These alarms, in turn, will be managed by individuals responsible for overseeing the ecosystem. It is important to note that anomalies within the reference data of each image represent a small fraction of the total pixel count, which aligns with typical scenarios. Furthermore, it is of paramount importance to emphasize that detecting all anomalies in the reference data is crucial for this application. Additionally, it is crucial to highlight that detecting all the anomalies in the reference data is imperative for this application as missing alarms could provoke damage in the ecosystem.

Fig. 3
figure 3

Oitavén river dataset consisting of two multispectral images (z1 and z2). Anomalies in white color

Table 1 summarizes the main characteristics of the two considered 5-band multispectral images. As can be seen, each of the images presents around 4% of anomalies over the total number of pixels. The size of the images corresponds to the case where the pixel information is stored in a 4-byte format.

Table 1 Dataset description consisting of two multispectral images

4.2 Experimental setup

In this section, we describe the hardware and software configurations utilized for conducting the experiments in this paper. Two different hardware setups have been used in order to compare different approximations to solve this problem: different nodes of a distributed memory supercomputer and a Jetson NVIDIA computing platform.

First, a multi-node supercomputer called FinisTerrae III with multiple GPUs per node, the FinisTerrae III supercomputer, is used to maximize the exploitation of parallelism at different levels. FinisTerrae III is located at the Galician Supercomputing Center (CESGA) [48] and consists of 354 nodes that are interconnected as shown in Fig. 4. As shown in Table 2, each node includes two Intel Xeon Ice Lake 8352Y processors with 32 cores each and 256 GB of memory. For the experiments, five nodes were used, as each node computes the EP over one band of the 5-band multispectral images captured by the MicaSense RedEdge multispectral sensor. In each one of these nodes, CUDA codes run on one of the NVIDIA A100 GPUs available in the node. This GPU model is based on the NVIDIA Ampere architecture, and it is equipped with 108 multiprocessors and 64 cores per multiprocessor, resulting in 6912 cores. The CUDA capability version is 8.0, and each card has 40 GB of DRAM memory, as shown in Table 2.

Fig. 4
figure 4

FinisTerrae III distributed memory system

The previously described computing platform is too expensive to be available in the usual remote sensing environments where decisions need to be taken for many applications with short response times and far from supercomputing centers. A more affordable computing platform that has been considered is an NVIDIA Jetson AGX Orin platform, also described in Table 2. This platform was selected as a representative of edge computing devices, aiming for in-place real-time computation of the proposed algorithm. This platform also allows us to analyze the effect that different energy power availability has on the computation times of the scheme, as remote sensing applications usually require computation in energy-limited platforms.

Table 2 Hardware setup of the different computing platforms used for the experiments

The NVIDIA Jetson AGX Orin Developer Kit [17] used to evaluate the performance of the proposed algorithm over a mobile embedded system provides a 12-core Arm Cortex-A78AE v8.2 64-bit CPU together with a 2048-core NVIDIA Ampere architecture GPU. It also includes 64 GB of RAM. Besides, as shown in Table 3, the kit allows us to configure the hardware to operate within different power budgets ranging from 15 to 60 Watts, allowing us to simulate real edge computing scenarios. The varying energy consumption is achieved by disabling some hardware components, such as reducing the number of online CPU cores or disabling the GPU Texture Processor Cluster, and also by limiting the frequency of both the CPU and the GPU cores [49].

Table 3 Jetson AGX Orin power mode budgets

The FinisTerrae III codes have been compiled using the g++ 10.1.0 version with OpenMP 4.0 support under Linux. Regarding the GPU implementation, the CUDA codes have been compiled using the nvcc version 12.2 of the toolkit. Version 4.1.4 of the OpenMPI library was used for the multi-node experiments. The Jetson codes have been compiled using the g++ 9.4.0 version with OpenMP 4.0 support under Linux and nvcc version 11.4 of the CUDA toolkit. The Thrust library was used to accelerate sorting operations.

4.3 Results

4.3.1 Accuracy assessment

In this section, the results, in terms of AD accuracy, are presented. For this purpose, two main metrics are considered: the area under the curve (AUC) and the percentage of anomalies detected (i.e., the true-positive rate). The number of true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) pixels are shown for completeness purposes, together with the precision–recall AUC (PR-AUC). The AUC can be considered as a standard in the literature to analyze the quality of AD algorithms [50,51,52]. The percentage of anomalies detected will show how the proposed method improves the ability to detect anomalies as compared to a traditional RX-based algorithm. We emphasize this metric because, as mentioned earlier, the primary objective of this study is to detect as many anomalies as possible, even if it leads to a higher false-positive rate. It is worth noting that the PR-AUC archives higher values when low false positive and negative rates are obtained, whereas the main purpose of this work is to increment the number of anomalies detected. Tests were performed to check that the parallel versions of the algorithm present the same accuracy values as the sequential one.

Table 4 shows the accuracy values obtained for the considered dataset with different EP configurations, whereas Fig. 6 shows the corresponding AD maps obtained for both images. The EP parameters, indicated between parenthesis in the table, have been chosen empirically for each image. Each value indicates the number of extrema to be kept for each level of the EP calculated for each band of the image. As can be seen in the table, varying the EP configuration parameters allows for improving the anomaly detection up to 23% and 27% for the z1 and z2, respectively, with respect to the case without the application of EP (first row in the table for each image). This also results in an improvement of the AUC for both considered images even though the number of FP is bigger when the EPs are introduced. Nevertheless, this increase is much smaller than the TP one, being about 5% for both z1 and z2 images.

Table 4 Accuracy of the parallel AD algorithm for z1 and z2 multispectral image. Results for different configurations of the EP stage. The best results for each column and image are highlighted in bold

Figure 5 shows the ROC curves of the accuracies obtained for the proposed method for different configurations of the EP stage of the algorithm that produces the extended EP of the image. It can be seen that a proper parameterization of the EP is relevant to achieve the best results. Nevertheless, in general, almost every parameterization improves the accuracies obtained without the application of EPs. It is also worth noting that, as expected, increasing the ability to detect the highest number of anomalies possible (high values of TP), which is our main objective for the aim of the ecosystem supervision studied in this work, also involves an increase in FP.

Fig. 5
figure 5

ROC curves for z1 and z2 images with the parallel RX+EP algorithm for AD and for different EP configurations

The AD maps obtained for both images are shown in Fig. 6. In the images, TP is colored in green, FP is colored in blue, FN is colored in red, and TN remains black as background. As it can be seen, the areas colored in red, highly decrease when the use of EPs is included together with the RX, meaning that the capability of the algorithm to detect real anomalies is greater in these cases. It also can be seen that the blue areas that appear do not have regular shapes and have the appearance of irregular structures. This will make it easier for these areas to be disregarded later by the application of some automatic post-processing technique.

Fig. 6
figure 6

AD maps for z1 and z2 images with the parallel RX+EP algorithm

4.3.2 Performance results

The execution times achieved for the different hardware setups introduced in 4.2 are summarized in this section. All data shown here correspond to the average values of 10 independent runs over the EP configurations which achieve the highest accuracy values, i.e., RX+EP(64,8,4,2) for the z1 image and RX+EP(32,16,2,1) for the z2 image. Table 5 shows the time needed to perform the computation for z1 and z2 images in the FinisTerrae III setup, using one and five nodes (one per band of the image), respectively. An alternative implementation where all the processing is performed in the CPU is included as a baseline. The speedup of the different versions with respect to the CPU baseline is also included. As it can be seen, the most expensive computation steps are Get parents and Node array, which have to sequentially walk through the image to obtain the max-tree representation that will then be used to obtain the extinction values. These steps are even more costly, in relative terms, in the CPU-GPU version, as they cannot be accelerated with the GPU. On the other hand, the Negate image and RX: Mahalanobis steps are the ones that can benefit the most from the use of GPUs, achieving speedups up to 469\(\times\) and 169\(\times\), respectively.

Therefore, given the few parallelization opportunities available inside the EP computation for each band of the image, it becomes crucial to exploit the parallelism that can be achieved with the use of a multi-node hardware platform such as the FinisTerrae III. As it is shown in Table 5, the EP computation stage of the z1 image can be accelerated from 8.3 s to just 1.4 s when all the bands are computed in parallel in different nodes. Similarly, the EP computation stage of the z2 image takes 2.1 s in the single-node CPU version and only 0.36 s in the multi-node CPU-GPU one. This allows the total speedup of the RX+EP AD algorithm, as compared with the CPU version, to increase from 6.9\(\times\) in the single-node configuration to 23\(\times\) in the multi-node configuration for the z1 image and from 4.3\(\times\) to 8.1\(\times\), in the same configurations, for the z2 image.

Table 5 Execution times, in seconds, and speedups for z1 and z2 images for the RX+EP(64,8,4,2) and RX+EP(32,16,2,1) configurations, respectively, on FinisTerrae III. Times include CPU processing, network communication costs, and GPU computations

In order to illustrate the performance of the algorithm for different image sizes, experiments have been carried out. The multi-node CPU-GPU version of the RX+EP algorithm, executed on the FinisTerrae III supercomputer, was used for these experiments. Figure 7 shows the time needed to execute the algorithm for three different sizes: the original z1 image, an image with the same number of spectral bands but 2\(\times\) bigger in the spatial dimension, and an image 4\(\times\) bigger in the spatial dimension. The time is shown separately for the EP stage and for the RX and Otsu stages. As can be seen, the time needed for the execution of the algorithm for these images scales nearly linearly with the image size. Nevertheless, in the 4\(\times\) size image, it starts to be noticeable that the spatial part of the processing, the EP, loses some performance as it needs 4.2\(\times\) the time needed for the original image. On the other hand, the RX detector achieves better performance (it only needs 3.69\(\times\) the time needed for the original image). This can be explained because the spectral part is more suitable for pixel-level GPU parallelism and the bigger the number of pixels to process, the bigger the opportunities for exploiting the high number of processing cores available in the GPU.

Fig. 7
figure 7

Execution time scaling when the size of the z1 image is increased. The times are expressed in multiples of the execution time over the times for the original z1 image. Times include CPU processing, network communication costs, and GPU computations

Regarding the edge computing Jetson platform introduced in Sect. 4.2, experiments have been carried out with three different power budgets (15 W, 30 W, and MAXN (60 W)). The findings previously commented in the FinisTerrae III platform remain valid for this platform: The Get parents and Node array steps of the EP computation are still the most time-consuming steps, and the speedups achieved when the available GPU is exploited also remain similar. Nevertheless, as was expected, the more limiting the power budget selected, the higher the execution times achieved for the same computations.

The most parallelizable steps are those that benefit the most from the higher power budgets, as they can exploit both, the larger number of cores available and the increased frequency of the CPU and GPU. In this way, the RX: mean and the RX: Mahalanobis steps are the ones with a higher increase in speedup when the CPU version is compared with the CPU-GPU one. The speedup increasing on these steps ranges from 4\(\times\) to 10\(\times\) for the 15 W power budget up to 9.7\(\times\) to 17.7\(\times\) for the 60 W power budget.

As it is shown in Table 6, the reduction in the available power and the disabling of some hardware components, resulting in fewer parallel computing opportunities for the algorithm, greatly increase the time needed for the computation up to 7.1\(\times\). This makes it clear that the power requirements must be carefully chosen in an edge computing environment to achieve the right balance between autonomy and performance.

Table 6 Summary of execution times, in seconds, and speedups, with respect to the 15 W version, for different versions of the algorithm on the Jetson AGX Orin. Times include CPU and GPU processing

5 Conclusions

This paper introduces a computationally efficient parallel algorithm for AD. The algorithm is specifically designed to run on heterogeneous computing platforms, comprising nodes with multi-core CPUs and GPUs. AD is accomplished through a combination of an extended extinction profile for spatial information extraction and a detector known as the RX algorithm.

The resulting parallel hybrid MPI+OpenMP+CUDA algorithm outperforms a traditional RX approach, detecting up to 27% more anomalies in the images presented in this paper. Experiments were conducted using the FinisTerrae III multi-node supercomputer, analyzing high-resolution multispectral images of fluvial ecosystems. Speedups of up to 23 \(\times\) were achieved.

Furthermore, the same algorithm was executed on a mobile embedded system, specifically a NVIDIA Jetson. This aimed to assess the feasibility of running the algorithm under various power consumption limitations. Experiments have shown an increase of up to 7.1 \(\times\) in execution time when the power consumption is limited to 15 W, compared to the situation with a limit of 60 W.

A challenge for the future would be addressing workload imbalances when processing images, such as hyperspectral ones, with a higher number of spectral bands. In such cases, it may not be practical to allocate as many nodes as there are bands in the image. Additionally, adapting the implementation to work with series of images covering more extensive spatial areas would allow the application of the algorithm to a wide variety of remote sensing applications.