Introduction

Modern contributions in hyperspectral image processing (HSIs), including recent Remote Sensing (RS) image acquisition expeditions, have promoted the exploitation of the information contained in such data [1]. HSI data offers promising insights to characterize the Earth surface through spectrometer sampling. Hence, composition of ground materials are determined over hundreds of spectral bands by the representation of distinct features such as spectrum, shape or textures [2]. Thus, HSI scenes exhibit, pixel by pixel, the properties of the materials over the data cube \(\textbf{X} \in \mathbb {R}^{H\times W\times N}\), where \(H,\,W\), and N are the height, width and channel dimension, respectively. Also, the data can be represented in matrix form as \(\textbf{X} \in \mathbb {R}^{M\times N}\), where M are the pixel vectors. Accordingly, each pixel has a spectral signature represented as a N-dimensional array, which depends on the number of bands. Generally, a longest number of bands represents more accurately the pixel materials. In this context, remotely sensed images provide rich spectral information that can be applied to diverse applications such as agriculture [3] or urban planning [4], using different classification approaches [5], e.g. land cover classification [6].

The overwhelming details within the spectral dimension entails a heavy computational demand for its processing. In this regard, Deep Neural Networks (DNNs)-based classifiers take advantage of this information to determine pixel classes. Nevertheless, a large spectral dimension complicates the class prediction, increasing classifiers’ complexity. Classifiers must address a remarkable amount of features to select and extract the best representation of each class. This feature selection step is determined through a training procedure by matching network predictions with real labeled data. Hence, a noteworthy memory space is required to handle storage requirements. In order to address these needs, many efforts have been invested in the development of efficient training techniques. For instance, Wu et al. [7] proposed parallel techniques to optimize the classification on multi-core devices. These techniques have been brought to distributed High Performance Computing (HPC) environments to efficiently tackle with the high computational demands [8] for RS image analysis. Also, dedicated grid computing [9] platforms have explored similar satellite image processing applications. Alternatively, Cloud Computing (CC) environments provide a powerful centralized infrastructure with high performance capabilities. In this context, CC brings a service-oriented environment over a cluster with multiple machines and massive storage. Machines are launched through virtualization protocols, offering ease of use and effective distribute solutions. An additional advantage of CC is the potential of increasing the computing resources offered by providers, such as Amazon Web Services (AWS) [10] or Microsoft Azure [11]. This facilitates the adjustment of the resources needed to meet the computing requirements.

Workload parallelization approaches aim to reduce the training time of classification algorithms. In this regard, determining the classifier that is most suited for research based on its behavior and performance is highly advantageous For instance, the Multinomial Logistic Regression (MLR) classifier predicts the variable probability of category membership according to independent observations. Indeed, MLR obtains a noteworthy performance in terms of accuracy with a low amount of training samples. However, processing large HSI data cubes degrades its performance. Instead, Support Vector Machine (SVM) has been proven to outperform classifiers in the literature in several applications. It relies on pattern recognition techniques to exploit optimal decision boundaries [12], showing robust behaviour and impressive accuracy results when handling large HSI data [13]. Nevertheless, SVM clearly poses high computational requirements, and therefore faces significant runtimes for large-scale data processing. Paper [14] proposed multiple distributed solutions to speed up computations. Likewise, diverse techniques have been proposed in the literature to reduce both the data complexity and processing time. In this sense, paper [15] studies decomposition algorithms to split the problem into iterative sub-problems. Similarly, paper [16] analyze the well-known Quadratic Programming (QP) problem by splitting the algorithm into computationally light-weight QP sub-problems. Furthermore, workload balancing techniques effectively manage large computational demands by distributing the calculations across resources. In this context, work [17] used a static workload balance to perform data-parallel training of DNNs by considering resources speed, whilst model-parallelism approaches in [18] proposed a heterogeneous partitioning over the network layers filters. As a consequence of the heterogeneous partitioning, gradients calculations are studied and resolved in [19, 20].

Given the parallelizable nature of workload distribution approaches, the CC infrastructure is suitable for this purpose. Cloud environments are composed of fast computational devices such as Graphic Processing Units (GPUs) and last generation Central Processing Units (CPUs) to handle high workloads. The open-source distributed framework Apache Hadoop [21] provides reliability and scalability for high computational cost calculations, such as aggregations or queries over SQL databases. Additionally, Apache Spark [22] is a multi-pass engine designed to process large-scale data in a fault-tolerant environment. Spark implements Resilient Distributed Datasets (RDDs) for parallel operations whilst providing data collection sharing between resources. An essential feature of Spark is its design for data science and data abstraction. In this regard, Spark provide fast MapReduce [7] operations since it performs the processing in the main memory of the worker nodes, and hence preventing unnecessary disk operations. Meanwhile, Hadoop MapReduce tracks the disk after a mapping or reduction operation. In addition, the Spark engine uses a fast Directed Acyclic Graph (DAG) that defines the scheme of operations to be performed. Both spark and Hadoop are deployed on the top of an OpenStack environment, i.e., a modular cloud system that manages collections of computing, storage and network resources. These resources are managed through user-friendly APIs to provide integration and security. The OpenStack infrastructure provides IaaS functionality through three main shared services: (i) networking service to avoid network bottleneck and provide self-service network configurations with more control and flexibility for the users; (ii) object and block storage serviceFootnote 1 for cost-effective and scalable distributed storage, and (iii) computing service to provide on-demand computing resources by deploying virtual machines. The compute architecture is accessible via web interfaces and designed to support scaling. In summary, the capabilities of OpenStack leverage the management of the physical hardware and creates a solid foundation for the deployment of virtual machines. The proposed workflow design is shown in Fig. 1.

Fig. 1
figure 1

Cloud-based integration of OpenStack at the hardware level and Apache Spark as the cloud software layer

Furthermore, the progression of the CC paradigm is intricately intertwined with the advancements in machine learning techniques and the increasing complexity of data. This connection highlights the seamless synergy between the evolving capacities of CC and the intricate processing demands posed by hyperspectral imaging data. Consequently, the parallelization of computation within cloud environments emerges as an ideal platform for leveraging the enhanced capabilities of advanced classification techniques.

Recent Advances in Remote Sensing Classification

Convolutional neural networks (CNN) have emerged as a cornerstone for visual data processing. Such advances have led to the creation of novel techniques and architectures that enhance feature extraction capabilities [23, 24].

The work [25] delves into the intricacies of 3D-CNNs, introducing the innovative concept of a Hybrid Spectral CNN (HybridSN). This model addresses the inherent complexity of 3D-CNNs by streamlining the operation through spatial-spectral feature representation. It offers an unique solution for handling volumetric HSI data without compromising the discrimination of spectral features. Additionally, recent breakthroughs in HSI classification [26] have leveraged the power of vision transformer (ViT) models. The work [27] introduces a morphological ViT approach that combines spectral and spatial information through feature fusion. ViT models generate the challenge of heightened network complexity while showcasing remarkable classification capabilities. This prompts the pursuit of achieving a balance between optimizing performance and maintaining model simplicity. In this context, the research [28] aims to diminish the complexity of traditional HSI methodologies without increasing the number of network parameters. The method excels in detecting both geometric and spectral changes across the data. These integration of spectral, spatial, and morphological information within CNN and ViT frameworks has opened exciting avenues for efficient and accurate methodologies.

Additionally, synthetic aperture radar (SAR) data presents noteworthy contributions in recent years. The work [29] proposes a scattering vector model with a roll-invariant method to effectively capture both coherent and partially coherent target scattering, leveraging unique target characteristics to enhance wetland classification. The study [30] introduces a three-component decomposition approach using SAR data acquired in Oberpfaffenhofen, Germany. This approach effectively distinguishes between adjacent urban and forested areas by examining differences in the scattering mechanisms. Additionally, a novel research [31] explores scattering decomposition for polarimetric SAR data in agricultural landscapes within a region in Canada. The core idea revolves around the utilization of temporal changes in scattering methodologies to facilitate urban change detection, adding a dynamic element to SAR-based classification techniques. Alternatively, this data can be input into multimodal methods to integrate various data types [32].

Motivation and Challenges

Leveraging hyperspectral imaging within a cloud-based framework provides a valuable asset to disaster response initiatives. As previously introduced, the increasing volume of HSI data, collected through innumerable missions using satellite or airborne platforms, presents a significant convergence point for rapid and efficient processing within cloud-based environments. As a consequence, the computational capabilities of CC play a pivotal role in this context, enabling fast decision-making during critical real-world situations. This facilitates the timely extraction of valuable insights, expediting the assessment of disaster impact and the initiation of responsive measures. As an example, the Chamoli disaster in the Indian Himalaya [33] vividly illustrates the far-reaching implications of a major natural catastrophe, particularly for the rapid expansion of hydropower infrastructure into increasingly precarious areas. In this scenario, the integration of hyperspectral data and machine learning models is of upmost importance. Also, in the post-disaster phase, the integration of HSI and CC serves for multiple purposes, such as damage assessment, resource allocation, and recovery efforts. Another example is the submerged kelp detection from the subtidal zones of Helgoland, in Germany under a time-effective approach [34]. From these outcomes, the conclusion is straightforward: the synergy between such concepts is instrumental in risk assessment and early warning systems, as it empowers the detection of surface anomalies or change detection [35].

However, the surge in data volume is accompanied by the rich spectral information contained within the datasets and continuous improvements in data acquisition devices. Consequently, the implementation of distributed algorithms in scalable platforms becomes imperative to address such increasing complexity. In addition to these findings, evident benefits arise from the utilization of CC. There are three key advantages to highlight. Firstly, the inherent scalability of cloud environments. Secondly, the accessibility allows for rapid system deployment. Lastly, the cost-effectiveness of these platforms, which operate on a pay-as-you-go model. This approach eliminates the need for organizations to manage and upgrade their own data centers, thereby reducing economic expenses.

This work is motivated by the aforementioned challenges and the absence of scalable cloud solutions capable of handling large datasets on a time-constrained basis. Also, the rich information included in HSI data underscores the promising potential of employing SVMs for data processing. SVM properties, such as dimensionality reduction while preserving critical information, robustness to noise, and the ability to identify non-linear relationships resulting from the intricate interplay between spectral bands, establish it as a highly suitable option. Additionally, SVMs track record of high performance in environmental monitoring and agricultural applications emphasizes its relevance for disaster monitoring [36].

Contributions of this Work

This study explores the potential of harnessing CC architectures to establish a distributed framework for the processing of extensive hyperspectral imagery. For this purpose, a novel and adaptable SVM implementation using Apache Spark is proposed. The proposal searches for accelerate data processing while maintaining performance levels comparable to standard implementations. Also, the proposed framework efficiently handles an expanding number of cloud workers, ensuring scalability.

To substantiate the effectiveness of the proposal, an in-depth analysis is conducted using well-known HSI datasets from existing literature. This analysis encompasses multiple node distributions to showcase the versatility of the methodology in real-world scenarios. The experimentation with these datasets, characterized by diverse properties and sizes, yields invaluable insights. The adaptability and scalability of the approach across different datasets emphasize its potential to address the distinct requirements and challenges posed by a range of disaster scenarios.

The remainder of the paper follows this structure. In “Related work” Section discusses related work on parallelization methodologies. In “Background of the SVM approach” Section delves into the functionality of the SVM algorithm. In “Experimentation analysis” Section presents the research findings in terms of scalability and classification performance. Lastly, “Conclusions” Section formally outlines the conclusions drawn from this study.

Related Work

In the literature, various methods have been investigated to handle and parallelize the vast amount of data. For instance, distributed image processing has been explored to identify the challenges inherent in Hadoop and assess the potential of this approach in future remote-sensing cloud computing systems [37]. Other studies have concentrated on spectral clustering for mining large datasets [38] and parallelization techniques for complex neural networks [39]. As such, the current research emphasizes the processing of hyperspectral imagery in cloud environments using machine learning techniques. This high computational demands for processing high dimensional HSI data has been addressed in recent works using parallel cloud designs. Next, an overview of noteworthy distributed algorithms implementations from the literature is provided. Three main algorithms are studied: (i) DNNs based on Auto Encoders (AE), (ii) Multinomial Lineral Regression (MLR), and (iii) Principal Component Analysis (PCA).

Distributed Auto-Encoder In [40], DNNs distribution is proposed by exploiting the Apache Spark computation engine. The proposal implements a stacked Auto-Encoder (AE) to conduct HSI dimensionality reduction. Specifically, the work performs an optimization for a fully connected architecture based on the Multilayer Perceptron (MLP) network. The computation process for the i-th workerFootnote 2 is defined layer by layer as \(\left[ \textbf{x}^{(l)}_k = \delta \left( \textbf{x}^{(l-1)}_k \textbf{W}^{(l)} + b^{(l)} \right) \right] ^{(i)},\) where \(\textbf{x}^{(l)}_k\) is the output data representation of the k-th vector sample \(\textbf{x}_k\in \mathbb {R}^N\), in the space defined by the l-th layer, i.e., the transformation function applied to the input data \(\textbf{x}^{(l-1)}_k\) through the current set of weights \(\textbf{W}^{(l)}\) that interconnects neurons from previous \(l-1\) and current l layer, \(\forall l \in [1, L]\). Finally, the bias \(b^{(l)}\) is added and a non-linear activation function \(\delta\) is applied. In this context, weights \({\textbf {W}}^{(l)}\) determine pixels output responses based on the input samples. Then, the training error is obtained in each worker using the MSE loss function by comparing the prediction \(\textbf{x}^{(L)}_j\) with the real pixel observation \(\textbf{x}_j\). Hence, the global error \(\varepsilon\) is obtained for a specific training iteration t as the average of each disjoint error as: \(\varepsilon _t = \frac{1}{I} \sum _{i=1}^{I} \left[ \frac{1}{M} \sum _{k=1}^{M} \mid \mid \textbf{x}_k^{(L)} - \textbf{x}_k\mid \mid ^2 \right] ^{(i)}\). Then, local gradients \({\textbf {g}}^{(i)}_t\) are calculated in each worker and reduced to obtain global gradients \({\textbf {G}}_t = \frac{1}{I} \sum _{i=1}^I {\textbf {g}}_t^{(i)}\). As a result, the computation performance of the DNN is boosted for the processing of larger data sets.

Distributed Multinomial Logistic Regression A distributed approach of the Multinomial Logistic Regression (MLR) is implemented in [41] using Spark. The method calculates the fitting probability of a sample data \(\textbf{x}_k\) for a specific class d using a linear prediction function \(\left[ f(\textbf{x}_{k},y_k=d)\right] ^{(i)}\). Therefore, a probability score is calculated using a vector of logistic regressors that corresponds to each class \(d \in [1,D]\), denoted as \(\mathbf {\omega }(d) = [\omega _0, \omega _1, \dots , \omega _M]\), where M is the number of linear or non-linear functions defining the features of the input sample. Hence, \(\mathbf {\Omega } = [\omega (1), \omega (2), \dots , \omega (D)]\) compose the regressors for all classes. The objective is to estimate regressors from the input training set \(\mathcal {D}=\{\textbf{X}, \textbf{Y}\}\), where \({\textbf {X}}\) and \({\textbf {Y}}\) are the training samples and their respective class labels. Therefore, the MLR probability score in a distributed environment is considered as a set of independent binary regressions for each worker. These individual calculations are performed by using a pivot label and regressing the rest against the pivot. Hence, the probability of obtaining the pivot class in the i-th worker is calculated as: \(\left[ p(\textbf{x}_k,d) = \left( 1+\sum _{d=1}^{D}\exp \left( \mathbf {\omega }(d) \cdot \textbf{x}_k \right) \right) ^{-1} \right] ^{(i)}\). Lastly, errors and gradients are determined individually for each worker. In this regard, the master node manages to calculate the overall values of losses and gradients to obtain the optimal \(\mathbf {\Omega }^{*}\).

Distributed Principal Component Analysis A distributed implementation based on CC platforms has been implemented in [7]. The algorithm provides a data transformation into linearly uncorrelated variables, i.e., principal components. The components are either equal or less to the original image features. In this regard, these components values are defined by their importance in the processed data. In the context of HSI analysis, this feature representation removes spectral variances. However, the computation requirements are substantially increased due to the expensive calculations. The distributed proposal divides the required matrix multiplications by rows between the workers, ensuring no correlation among the obtained pixel vectors. As an approximation of algorithm performance, these calculations for an example data matrix \(\textbf{U}^{(i)}\) are determined as \(\textbf{U}^{(i)}= \left[ \sum _{k=1}^{M} (\textbf{x}_k^{\top } \times \textbf{x}_k)\right] ^{(i)}\). Subsequently, master node performs the pixel sum of the rows assigned to each worker and the eigen-decomposition extracts the eigenvalues and vectors \(\textbf{V}^{(i)}\). After that, the data is broadcast to the workers. In this sense, the algorithm conducts a spatial-domain partitioning. Hence, the pixel information (i.e., spectral signature) is stored in the same node, where each nodes deploys two different worker instances \(R=2\).

In addition, several data distribution techniques [42] and data complexity reduction techniques have been proposed for HSI data processing. For instance, Kang et al. [43] maintained rich spectral information whilst reducing its dimensionality. Also, feature extraction method [44] focus on obtain discriminative features that minimize both intraclass variation and interclass likelihood. Computation requirements have been studied in [45, 46].

Background of the SVM Approach

Sequential Algorithm Description

The SVM is a supervised classifier. Let the input data \(\textbf{X} \in \mathbb {R}^{M\times N}\) contain the pixel row information, i.e., \(\textbf{x}_k = [x_{k,1}, \dots , x_{k,N}], \forall k \in [1,M]\). Pixels are defined by the corresponding class labels \(y_k\in [0,1]\) for binary classification, or \(y_k\in [1,D]\) for multiclass classification, considering one-against-one or one-against-all approaches, where the former has robust behavior when classifying unbalanced data sets, but the latter is computationally less expensive.

Fig. 2
figure 2

Representation of the SVM algorithm illustrating two distinct classes, aiming to maximize the margin \(\mathcal {M}\) between the classes

In the N-dimensional space, the SVM defines an optimal hyperplane as an affine subspace of \(N-1\) dimensions to separate the samples into positive and negative by maximizing the margin between classes. Thus, the classifier is the discriminant function \(f(\textbf{x}) = \textbf{w} \cdot \textbf{x} + b = \pm 1\), where \(\textbf{w}\in \mathbb {R}^N=[w_1, \dots , w_N]\) is the normal vector to the hyperplane and the bias b adds offset to increase the samples. Hence, the distance to the hyperplane is used to find the optimal hyperplane \(f(\textbf{x}) = \textbf{w} \cdot \textbf{x} + b = 0\) in aims of maximizing the margin \(\mathcal {M} = \frac{2}{||{\textbf {w}}||}\). The complete functionality is shown in the Fig. 2.

Following the convex quadratic optimization problem, and considering the soft-margin approach as non-linearly separable data will be classified, the SVM optimization problem is described as the following soft-margin formulation:

$$\begin{aligned}{} & {} \displaystyle \min _{\textbf{w},b,\zeta } \underbrace{ \frac{1}{2} \mid \mid \textbf{w} \mid \mid ^2 + C\sum _{k=1}^{M} \zeta _k }_{\text{ function }\,\, f(\textbf{w},\zeta )},\nonumber \\ {}{} & {} \,\,\text{ s.t. } \,\, \underbrace{ y_k\left( \textbf{w}\cdot \textbf{x}_k\right) +b\ge 1-\zeta _k }_{\text{ functions }\,\, g_k(\textbf{w},b,\zeta )}, \,\, \zeta _k \ge 0, \,\, k=1,\dots M, \end{aligned}$$
(1)

where slack variables \(\zeta _k\) provide some flexibility to the model, since it is feasible to satisfy the constraint although the sample does not meet the original constraint. At the same time, the sum of all slack variables penalizes the selection of too large error margins, while C controls the impact of the soft-margin. Particularly, this work sets slack variables as the hinge loss \(\zeta _k=\max (0,1-y_kf(\textbf{x}_k))\) (L1-SVM), while C is set by grid search with cross-validation.

Lagrange multipliers \(\alpha\) are used to optimize the \(f(\textbf{w},\zeta )\) function subject to the M inequality constraints \(g_k(\textbf{w},b,\zeta )\) of Eq. (1), introducing the Lagrangian function \(\mathcal {L}(\textbf{w}, b, \alpha ,\zeta )=f(\textbf{w},\zeta )-\sum _{k=1}^{M}\alpha _k g_k(\textbf{w},b,\zeta )\). The minimum of f is found when its gradient points in the same direction as the gradient of g. In this regard, and using the duality principle, the Lagrangian problem \(\min _{\textbf{w},b,\zeta } \max _{\alpha } \mathcal {L}(\textbf{w},b,\alpha ,\zeta ), \,\,\text{ s.t. } \,\,\zeta _k\ge 0\) should be optimized. This problem can be rewritten into the dual Wolfe problem as:

$$\begin{aligned} W(\alpha )= \sum _{k=1}^{M}\alpha _k - \frac{1}{2} \sum _{k=1}^{M}\sum _{j=1}^{M} \alpha _k\alpha _jy_ky_j\textbf{x}_k\cdot \textbf{x}_j, \end{aligned}$$
(2)

which is optimized as \(\max _{\alpha } W(\alpha ), \,\,\text{ s.t. }\,0\le \alpha _k\le C, \,\,\sum _{k=1}^{M}\alpha _k y_k=0, \,\,k=1,\dots M\). Indeed, the dual Wolfe problem is a standard quadratic programming problem, and can be solved with a QP solver.

At this point, the normal vector is obtained as \(\textbf{w}=\sum _{k}^{M}\alpha _ky_k\textbf{x}_k\), where the data points \(\textbf{x}_k\) corresponding to nonzero \(\alpha _k\) are the support vectors, whilst the bias is obtained as the average of S support vectors, \(b=\frac{1}{S}\sum _{s=1}^S\left( y_s-\textbf{w}\cdot \textbf{x}_s\right)\), providing a more stable solution.

Proposed Cloud-Based SVM

The implementation of the proposed cloud linear-SVM was done using SCALA programming language. Moreover, the proposal implements a soft-margin “one-vs-all” multi-class strategy for D classes. In this regard, SVM is implemented in a distributed fashion to fully exploit the potential of a scalable architecture. The Apache Hadoop open source software framework enables distributed processing of large data sets on clusters of computers, and provides a fault-tolerant base to process concurrent computing tasks. In particular, HSI data is stored into the HDFS (Hadoop Distributed File System) disk, a java-based system that is optimized to store massive data sets, scaling horizontally, and to maintain multiple copies to ensure high availability and fault tolerance. Moreover, each HSI scene is divided into blocks of the same size (128 MB) and distributed across Hadoop data-nodes to prepare the environment with the aim of performing training procedure on several data subsets independently. However, traditional Hadoop Map-Reduce scheme is inefficient when performing iterative computations, as it requires disk I/O operations to reload data at each SVM iteration. In this regard, Apache Spark is used to avoid I/O operations, as it is an in-memory cluster computing platform that prioritizes storing data in the slaves cache memory (when there is enough space) instead of repeatedly reloading it from disk. A driver/controller node execute the global operations of the model and control the execution of the distributed SVM, while several executors/workers will perform the local operations on the distributed data. The steps executed by the proposed approach are illustrated in Fig. 3 and can be summarized as follows:

  1. 1.

    The Spark Driver is launched to manage the execution and communication between workers. Workers are in charge of performing the computations. Also, the driver creates the context process which is responsible of the variables initialization. Next, the driver converts the user program into tasks and schedules the tasks to executors.

  2. 2.

    The data is assigned to the workers \(\textbf{X}^{(i)}\), where \(i \in I\) determines the i-th worker identifier. In this regard, the data is managed by the HDFS.

  3. 3.

    The workers are launched and the driver coordinates tasks among available workers determined by the context execution logic. The workers must complete its tasks over the respective RDDs that contains the concatenated spectral pixels (M) to be computed at the same time. The conversion of RDD lineage into tasks is performed by the scheduler. The task assignment is realized smartly on-demand based on the application needs.

  4. 4.

    The running configuration (as weights \(\textbf{W}\)) is broadcast into available workers. In this sense, the parallel executors running on the different nodes do not require any network routing between them.

  5. 5.

    After completion of the tasks, a reduction step coordinates the output data from the workers and perform the optimization step. Hence, data from workers passes through the driver node in a centralized manner. The scheduler updates the job stage and returns the final output.

Fig. 3
figure 3

Overview of the proposed cloud architecture. RDDs collections depend on the data dimensions assigned to each of them. Tasks are executed in parallel to perform forward-backward steps over their assigned data until the maximum iteration or threshold is achieved

The cloud-based SVM operates by initially loading the training data from the HDFS onto a RDD. This provides a fault-tolerant distributed data collection that facilitate distributed and parallel computing over the resulting data partitions. This is structured as a table of M training samples, each one with its corresponding label \(y_k\) and features \(\textbf{x}_k\) of the k-th sample. The mean \(\mu\) and the standard deviation \(\nu\) are obtained in a distributed manner from features to conduct the standardization of the data \(\tilde{\textbf{x}}_k={\left( \textbf{x}_k-\mu \right) }/{\nu }\), improving the rate of convergence. Moreover, and following the “one-vs-all” approach, D SVMs should be trained, with D setting the number of different land-cover classes from the \(\textbf{X}\) HSI datacube. Therefore, a binary SVM is trained for each class. In this regard, the normal vectors \(\textbf{w}\) related to the different classes are rearranged into the matrix \(\textbf{W}\in \mathbb {R}^{D\times N}\), whilst biases b are collected into the vector \(\textbf{b}\in \mathbb {R}^{D}\). Indeed, the cloud manager is responsible for storing and managing the reduced data from all workers.

Regarding the optimization problem, instead of considering the L1-normalization with \(\mid \mid \textbf{w}\mid \mid _1\) which is not differentiable, the L2-normalization loss formulation provided by Eq. (1) is used, as it is easier to optimize and more stable. Furthermore, the optimization procedure is conducted over the RDD containing the training samples through the Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) optimizer, which is provided by the Breeze library for numerical processing of Spark. This algorithm generalizes the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method and it is executed by the driver node. In this sense, the cost function is defined for the optimizer as a DiffFunction element. This calculates both, the loss value and the gradient at a point, and aggregates the information on each worker using the tree aggregation method, summarizing it to the driver where the ML model is located. To update the loss function value and its corresponding gradient, the DiffFunction maps a DifferentiableLossAggregator element over the RDD containing the training samples. Indeed, the DiffFunction sends a DifferentiableLossAggregator instance to the workers, particularly to each partition, which collects the local gradient updates of the Hinge loss function, and combines the information to get the accumulated gradient for a given iteration. Then, the driver updates the trainable parameters by taking into account the obtained accumulated gradient until the optimizer converges or the maximum number of iterations is reached.

Experimentation Analysis

Environment Configuration

The experimental evaluation have been performed under a cloud computing platform with OpenStack as backbone. Spark v3.3.0 is configured against a Hadoop v3.2.3 framework. The master node has 16GB RAM of memory and 80GB HDD storage with 6 virtual cores. Eight worker nodes are deployed with same features as the master node. The HDFS and Yarn package manager are used through a web GUI API. Physical nodes are composed of 8x Dell PowerEdge M630 nodes with CentOS 7.9 as Operating System (OS). Each node is constituted of 2x Intel(R) Xeon(R) CPU E5-2650 v3 running at 2.30 GHz. All nodes mount a shared storage volume from a dual NetApp FAS3140 Network-Attached Storage (NAS) appliance, which is connected via NFS using 4\(\times\)1 GB Ethernet network.

Fig. 4
figure 4

IP dataset. The right-side table includes the land cover labels along with the respective number of samples per class

Fig. 5
figure 5

BIP dataset. The right-side table includes the land cover labels along with the respective number of samples per class

Experimental Settings

Experiments were conducted using popular high-dimensional HSI data. In particular, the validation of the proposed CC-based method has been conducted in the popular Indian Pines (IP) and Big Indian Pines (BIP) scenes [47], which are described in Figs. 4 and 5, respectively. It is noteworthy that, although both scenes were collected by the Airborne Visible Infrared Imaging Spectrometer (AVIRIS) during a flight campaign over agricultural fields in northwest Indiana, the former scene, IP, contains 16 classes in a \(145\times 145\) datacube with 224 channels (where 24 belongs to water absorption bands that can be removed), whilst the later scene, BIP, is a larger version of the IP scene, comprising 58 land classes in a \(2678\times 614\) datacube with 220 bands. The wavelength range of both scene is bounded between 0.4 and 2.5 \(\mu\)m.

To evaluate the performance of the proposed CC-based SVM, different widely-used classification models are included for comparative analysis, considering both supervised and unsupervised approaches. These models encompass (i) Gaussian Naive Bayes (GNB); (ii) PERCEPTRON; (iii) Decision Tree (DT); (iv) K-Nearest Neighbors (KNN); (v) RIDGE; (vi) Multinomial Logistic Regression (MLR), and Random Forest (RF).Footnote 3 To assess the classification performance, overall accuracy (OA), average accuracy (AA) and kappa index (\(\kappa (x100)\)) have been used.

The SVM, MLR, and PERCEPTRON algorithms are configured with 200 iterations maximum. Focusing on the MLR, newton-cg is set as the solver, whilst the penalty is set to none. The KNN classifier is employed with 3 neighbors. Both the DNN and RF algorithms have 40 as the maximum depth, and the RF algorithm uses 40 estimators with 1 as the maximum features value. Additionally, the SVM, MLR, RIDGE, and PERCEPTRON algorithms have a tolerance set to 1e−6. The rest of the configuration parameters are the default ones.

Experimental Results

The experimentation is conducted to determine both the scalability and accuracy of the proposed cloud-SVM. For this purpose, four different configurations of the cloud environment were used, varying the number of workers. Specifically, 1, 2, 4 and 8 workers were used during the experimentation.

Fig. 6
figure 6

Runtime in seconds (s) for small data sizes in megabytes (MB) are presented, illustrating the distributed performance of the proposed algorithm

Fig. 7
figure 7

Runtime time in seconds (s) on large data volumes in megabytes (MB) to show scalability in terms of data volume. Nodes configurations are represented individually in sub-figures a), b) and c) and jointly in sub-figure d)

The first experiment focuses on analyzing the scalability of the proposal with different worker configurations. This is done using the BIP dataset and considering different dataset sizes. This scene presents a high complexity, thus demanding a greater commitment from the computational resources within the cloud environment. This increased resource demand enables a more comprehensive analysis of the proposed methodology response to intricate situations. The obtained results are provided in the Fig. 6, demonstrating a significant reduction in processing time when employing the proposed distributed SVM algorithm compared to a single-worker execution. Furthermore, utilizing a larger number of workers with the same data size markedly decreases the runtime. This effect is particularly pronounced when dealing with larger data sizes, as it capitalizes on the computational resources processing capabilities. Moreover, the decrease in runtime exhibits a proportional relationship with the increase in the number of workers for each dataset size. This observation contributes to the conclusion that the computational resources are achieving near optimal speedup in each configuration, thereby indicating promising scalability as the dataset size continues to grow. Finally, the evidences from the results demonstrates that the proposal provides a better response in terms of training times to the increase of data.

The same behaviour is observed in Fig. 7 for sizes significantly larger than those included in the previous experiment. In particular, Fig. 7a–c provide the individual behavior of the proposal for environments with 2, 4 and 8 workers, respectively. The scalability of each configuration is visually appreciated considering different data sizes. Finally, Fig. 7d provides the comparison of all the observed behaviors for the above configurations as a whole. It can be observed the improvement in training times altogether, where the proposed distributed implementation with 8 workers is the most efficient configuration due to the workload distribution. The scalability with respect to the number of workers and data size is unequivocally demonstrated.

This initial experiment capitalizes on the strengths of the proposed methodology in terms of scalability and acceleration. The increase in data size and number of workers leverages the advantages afforded by cloud computing environments, particularly with respect to enhanced memory and computational processing capabilities.

Finally, the last experiment is conducted to validate the accuracy obtained by the proposed method considering the IP dataset. The choice of this dataset is based on its extensive usage in prior literature. This enables our method to have a direct performance comparison with stable Scikit-learn implementations.

Table 1 Classification results of different classification models over IP scene
Fig. 8
figure 8

Evolution of performance metrics with different percentages of training data (\(\%\)) from IP scene

Fig. 9
figure 9

Classification maps over the IP dataset

Firstly, Table 1 presents per-class classification results for all models. It is evident that the SVM shows superior performance in all the metrics considered. Indeed, the selected classifiers have been arranged based on their accuracy outcomes, where less effective models appears on the left-side of Table 1 whilst proficient models are on the right-side. The SVM demonstrates a clear differentiation for seven classes, comprising \(43.75\%\) of the total 16 classes (ranging from 0 to 15). Notably, minor classes such as 1-Alfalfa, 6-Grass/pasture-mowed, and 8-Oats show significant accuracy improvements for the SVM compared to other models. Additionally, major classes also benefit from enhanced classification, contributing to a noteworthy overall performance. This is particularly evident in the AA value, signifying consistent performance across all classes.

Secondly, Fig. 8 displays OA, AA, and \(\kappa (x100)\) for increasing data amounts. This aligns with the preceding scalability experiments and serves as an indicator of the classification performance in a scalable environment. Figure 8 also highlights the SVM classification performance with low data amounts for training. Moreover, high-resolution classification maps are provided in Fig. 9 to offer detailed insights. It is noteworthy that the SVM model struggles in fine-grain classification of specific pixels, as is recurrent in traditional models based exclusively on pixel spectral information treated in complete isolation. Nevertheless, its classification map is satisfactory compared to the other algorithms.

Finally, the evaluation of classification performance underscores the robustness inherent in the proposed distributed SVM methodology. Notably, the presence of minor classes, characterized by pixels weakly represented across certain partitions, does not significantly impact the overall performance. Conversely, the reduction step tackles this challenge efficiently, yielding remarkable and consistent accuracy across all classes.

Conclusions

This work exploits a distributed cloud computing environment for the processing of large dimensional hyperspectral data. The proposed methodology focuses on analyzing the scalability whilst ensuring a notable classification performance. In this context, sequential algorithms suffer an important degradation of their performance due to the computational-expensive processing of such complex data. This is exacerbated by the increasing complexity of machine learning models, where performance improves at the cost of more complex and sophisticated algorithms. Therefore, the classification entails a noteworthy amount of calculations that exceed the capacities of an individual machine. Besides that fact, sequential algorithms are useful when the data can be stored and processed in a single machine. This work presents a well-optimized cloud parallel implementation of the SVM machine learning algorithm over a cloud environment. The SVM approach demonstrates notable outcomes in discriminating features for classification purposes. The driving rationale behind this research is the high performance of the SVM and the idea of reducing the computational time required for fast decision-making. In the provided study, the experimental results demonstrate computational performance gains on the processing of complex HSI scenarios. Moreover, the proposal exhibit effectiveness for classification purposes, achieving notable performance in comparison to the standard Scikit-learn models. As a case of study, the proposal have been evaluated against the baseline one-node implementation for scalability purposes. Encouraged by the outstanding results obtained in this work, the future work is set to develop new distributed implementations for the processing of large HSI data in cloud computing environments.