1 Introduction

Hyperspectral (HS) remote sensing is a major breakthrough in remote sensing technology for Digital Earth (Fu et al., 2020). HS sensors, mounted on aircraft or satellites, produce digital images (HyperSpectral Images - HSI) of an observed scene, by recording reflected light in hundreds of narrow frequencies covering the visible, near-infrared and shortwave infrared bands (pixel spectrum) (Appice & Malerba, 2019; Bioucas-Dias et al., 2013; Hoye & Fridman, 2013; Stuart et al., 2019). Such an abundance of spectral data represents an invaluable source of knowledge regarding the physical nature of the different materials possibly observed. In particular, the high-dimensionality of the measured pixel spectrum may be useful for saliency analysis in HS imagery data, in order to possibly characterise a landscape having higher contrast with its surroundings (Liang et al., 2013).

In general, the concept of saliency refers to identifying parts, regions, objects or features that first draw visual attention and, hence, can be considered notable and important. In remote sensing, saliency analysis mimics human intelligence and detects the most attracting objects or pixels in a sensed scene. This analysis has great potential in many fields such as military science, ocean research, resources exploration, disaster and land-use monitoring. In fact, boosted by a variety of applications (e.g. scene understanding, geological and environmental monitoring and precision-guided system), various studies in imagery saliency detection have been progressed better and better in recent years (Borji et al., 2019). These studies commonly describe machine learning-based techniques that address the task of saliency detection as a binary pixel labelling problem. In general, these techniques analyse the pixel data of an input image to learn a binary saliency matrix that reveals foreground and background imagery pixels, respectively.

Despite HS imaging techniques have been progressively gaining popularity in the recent Earth observation, current saliency detection technologies have achieved real maturity in the analysis of traditional colour images. The current state-of-the-art of literature mainly describes techniques for estimating saliency in colour data. With the recent boom of deep learning, we can already count on various powerful deep learning models (Hou et al., 2019; Wang et al., 2020; Gao et al., 2019; Favorskaya & Jain, 2019; Liu et al., 2020; Luo et al., 2020) that have been accurately trained on a big amount of colour images and can be applied to estimate accurate saliency in new colour data. Nevertheless, a few studies have occurred on the stage a few years ago to start the investigation of saliency detection in HS images (e.g. Liang et al., 2013; Cao et al., 2015; Le Moan et al., 2013; Imamoglu et al., 2018; Du & Zhang, 2014; Du et al., 2016; Yan et al., 2016; Zhang et al., 2018; Falini et al., 2020; Appice et al., 2020).

The spectral variability and the curse of dimensionality are the major challenges to deal, in order to properly analyse HS imagery data. The spectral variability is commonly caused by many conditions such as incident illumination, atmospheric effects, unwanted shade and shadow, natural spectrum variation and instrument noises. It has been indicated as one of the main issue of HS classification learning (Appice & Malerba, 2019) and recently of HS saliency detection (Zhang et al., 2018). On the other hand, the curse of dimensionality, that is due to the abundance of information in HS spectrum (Hughes, 1968; Appice & Malerba, 2019), prevents us from processing HS imagery data by simply applying one of the powerful techniques designed for saliency detection in colour images. A naive approach to fit colour imagery techniques to HS imagery data is based on the idea of recovering a colour composition schema for HS imagery display (Du et al., 2008) and analysing colour data in place of HS pixels (Liang et al., 2013). Of course the consequence of this approach is that the abundant HS information is given away for the saliency analysis.

A recent research trend has started estimating the saliency in HS images without resorting to their colour rendering. Following this research direction, several studies concentrate on mitigating the curse of dimensionality by applying data reduction techniques to HS data, in order to extract non-redundant informative features that aid in highlighting discriminative properties of salient regions (Le Moan et al., 2013; Falini et al., 2020; Appice et al., 2020). A few studies perform the saliency analysis of HS imagery data according to statistical properties of HS pixels processed through an anomaly detector (Du & Zhang, 2014; Du et al., 2016). In alternative, contrast techniques coupled to clustering are applied to gradient spectral data (Yan et al., 2016; Zhang et al., 2018). All these studies estimate saliency in the HS imagery data by elaborating the abundance of information enclosed in HS pixels. Spatial information is often coupled to spectral data to deal with spectral variability. In any case, these studies turn down the great progress made in saliency detection techniques for colour images. In addition, they analyse the imagery pixels of a single HS image.

In this paper, we revamp the idea of leveraging powerful methodologies fine-tuned in the literature to estimate saliency in colour data. However, we complement initial colour-based saliency assignments with spectral-based saliency refinements that allow us to better separate a salient landscape from its un-salient surroundings. In addition, we handle the HS imagery saliency detection task in the common scenario of surveillance missions, where multiple HS images of various scenes are collected using the same HS sensor technology. In this scenario, the task of saliency detection is performed on a dataset of HS images. As reported in Imamoglu et al. (2018), any saliency detection methodology designed for a single HS image can, in principle, be used for saliency detection in HS image datasets. In fact, independent HS saliency detection processes can be run in parallel on separate HS images to detect saliency in each HS image independently of other HS images. Differently, in this paper, we intend to take advantage of the collaboration among multiple saliency detection patterns learned from multiple HS images so that every pattern can possibly refine saliency assignments of other patterns by gaining accuracy.

To this aim, we propose an HS saliency detection methodology, named A GNES (pseudo-lA bel G eneration for uN supE vised S aliency detection in Hyperspectral image datasets), that takes a dataset of HS images acquired with the same HS sensor as input and yields the binary saliency matrices of the input images as output. In particular, the proposed methodology takes advantage of saliency classification patterns that can be learned by jointly elaborating the multiple HS images of the input dataset. The learning process is conducted on both the colour mode and the HS mode of each input image. First it applies a pre-trained, dataset-independent, saliency detection pattern to elaborate colour data and produce saliency pseudo-labels for all pixels of each input HS image. Then it leverages the produced colour-based pseudo-labels to supervise learning of a distinct HS saliency classification pattern from each input HS image. The PCA is used to deal with the curse of dimensionality during HS classification learning. Each classification pattern learned in this step is able to predict the binary saliency based on the HS pixel spectrum. The use of supervision allows us to learn an ensemble of multiple HS saliency classification patterns from multiple images. This ensemble is used to construct the final saliency matrices of the input images (or any other new testing images acquired with the same HS sensor).

To the best of our knowledge, the novelty of this study consists in the specific use adopted for colour-based saliency detection patterns within an HS ensemble learning methodology and the effectiveness of the combination that actually outperforms several state-of-the-art competitors on a benchmark HS image dataset. In particular, colour-based saliency detection patterns are used to fuel supervision in HS imagery analysis without requiring the acquisition of ground truth labels. So, this study contributes to proving that the proposed formulation is an effective means to delineate salient pixels in an unsupervised manner by taking actually advantage of the information enclosed in abundant spectrum of HS pixels. Another contribution is the use of the ensemble in the final labelling step. In this paper, we show that the ensemble can limit the effects of the lack of generality due to the spectral variability that may occur during the sensing operations of each single image. In fact, this lack of generality commonly affects saliency detection patterns learned processing HS data acquired with a single acquisition. Definitively, the empirical validation proves that the methodology gains in accuracy with the ensemble compared to other HS saliency detection algorithms.

The remainder of the paper is organised as follows. The next Section reports a brief overview of the recent literature. Section 3 introduces basic concepts, while Section 4 illustrates the proposed methodology. Section 5 provides the details of the experiments, the results and some discussions about them. In particular, the experiments described show the effectiveness of each component of the proposed methodology and compare the performance to that of various recent competitors. Finally, Section 6 summarises the conclusions.

2 Related work

Extensive literature studies have dealt with the saliency detection in traditional colour images. These methodologies can be roughly divided into two categories – heuristic saliency detection methodologies and deep learning-based saliency detection methodologies. The heuristic methodologies are mainly inspired by the human attention model in the neuroscience (Koch and Ullman, 1987) and have rooted in the Itti’s model defined in Itti et al. (1998). The Itti’s model mimics the human attention model through a centre-surround contrast technique that calculates the Euclidean distance of the considered pixel to its surrounding ones in the colour space. While simple, the Itti’s model has inspired the most part of the subsequent literature (see Borji et al., 2019; Ullah et al., 2020 for recent surveys). On the other hand, deep neural networks have recently gained great success in a wide range of computer vision applications. So, a significant research effort is devoted to train deep neural network-based saliency detection patterns from big amounts of annotated colour images (Hou et al., 2019; Wang et al., 2020; Gao et al., 2019; Favorskaya and Jain, 2019; Liu et al., 2020; Luo et al., 2020). In addition to these salient detection methodologies that are defined for traditional imagery data, a few studies investigate the saliency topic in more challenging domains, such as video (Jun et al., 2015; Wang et al., 2020) or audio (Zlatintsi et al., 2015). In this paper we focus the overview mainly on the literature that explores the saliency detection topic in HS imagery data.

The first saliency detection methodology for HS imagery data is presented in Liang et al. (2013). It first converts an HS image into a colour image by projecting the HS space into the CIELAB colour space. Then it applies the Itti’s model to the colour image.

The authors of Cao et al. (2015) seminally formulate a notion of saliency as the extent to which a group of pixels stands out in an HS image in terms of reluctance instead of colourimetric contrast. They extract various HS conspicuity features to construct the unique saliency matrix and they apply the winner-take-all strategy to identify salient targets: pixels with the highest salience are regarded as salient targets. Features of various dimensionalities (reflectance curves, principal components and Gabor-filtered pixels) are also extracted in Le Moan et al. (2013). They are in turn compared using either the Euclidean distance or the angle distance.

A few studies propose anomaly detection methodologies for saliency detection in HS images. For example, the authors of Du and Zhang (2014) describe an anomaly detection methodology that resorts to the manifold feature to divide HS pixels into a potential anomaly (salient) part and a potential background part. In Du et al. (2016), statistical characteristics of HS pixels are processed in combination with a sparse representation. In particular, this study illustrates an anomaly detection approach that sparsely and linearly represents a pixel with different atoms under different hypotheses and assumes that the noise has the same covariance structure, but different variances under the two competing hypotheses. It uses the generalised likelihood ratio test to construct the anomaly detector.

In Yan et al. (2016), the authors introduce the idea of resorting to a region-based spectral gradient contrast techniques to analyse HS imagery data. They first compute the gradient along each HS pixel. Then they apply segmentation and clustering techniques to gradient data to get a group of imagery regions. Finally, they resort to centre prior and local contrast to compute the saliency score of each region, with which the salient area can be obtained. A spectral gradient technique is also described in Zhang et al. (2018), where the authors propose to yield various saliency matrices through constructing an imagery region-based hierarchical structure. Each region is evaluated in multiple scales with a saliency pattern that depicts the region contrast with the spectral gradient.

Finally, very recent studies have applied sophisticated data dimensionality reduction techniques to saliency detection in HS images. In Falini et al. (2020), the authors propose an HS saliency detection methodology that cascades non-negative matrix factorisation and clustering on spectral distances computed between the input HS image and the reconstructed image. In Falini et al. (2020), the authors extend the approach described in Falini et al. (2020) by introducing new distance measures. In Appice et al. (2020), HS imagery data are reconstructed through a deep autoencoder neural network. Similarly to Falini et al. (2020), various distance measures are used to quantify the saliency degree in the data encoded and decoded through the autoencoder, while a clustering stage is performed in order to separate the salient information from the background. However, experiments reported in Appice et al. (2020) show that this methodology commonly achieves the best performance when autoencoders and clustering are coupled to the computation of the spectral spatial distance introduced in Yang and Mueller (2007).

In general, the HS saliency detection methods that have been defined in the recent literature have focused the attention on processing spectral data without taking advantage of supervision mechanisms. To the best of our knowledge, supervised algorithms have been widely investigated for saliency detection in colour images, where a big amount of annotated colour-based data is actually available. Instead, the lack of a significant amount of ground-truth labels for HS data has prevented recent saliency detection algorithms from taking advantage of supervision mechanisms with HS data. Therefore, introducing the use of supervision is a progress in the state-of-the-art of HS saliency detection literature. In any case, we differ from traditional supervised algorithms as we replace ground-truth labels, that remain unavailable during the learning stage, with saliency pseudo-labels. These pseudo-labels are predicted by applying accurate, pre-trained, colour-based saliency detection patterns to colour-based transformations of HS images. Another novel contribution is that existing HS saliency detection algorithms, in general, apply machine learning and numerical algorithms that learn the saliency matrix of a specific HS image without yielding a general pattern that can be also applied to a different HS image. In particular, they account for the spectral information acquired with the HS image under consideration only. So, if applied to a dataset of HS images, these algorithms process every HS image independently of each other. We note that this leads to overlook knowledge enclosed in saliency patterns potentially discovered from multiple HS images. A different approach is investigated in this paper. In particular, we resort to the ensemble learning theory (Brown, 2010) and explore how multiple saliency detection patterns learned from a dataset of multiple HS images may be handled as a “committee” of decision makers to achieve better overall accuracy on the input multiple HS images than the individual committee members.

3 Preliminary concepts

Let \(\mathcal {I}\) be a dataset of HS images – digital images of observed scenes which are acquired using an HS sensor. The HS sensor records reflected light in hundreds of narrow frequencies covering the visible, near-infrared and shortwave infrared bands of a wavelength λ (also called spectrum). The spectrum is an m-dimensional feature vector (spectral vector), so that λ is spanned on the numeric spectral features λ1,λ2,…, and λm.

Every HS image \(\mathbf {I}_{\mathbf {\lambda }} \in \mathcal {I}\) (see Fig. 1) is a three-dimensional set of pixels (called hyper-cube) with values representing spectral reflectance indexed by spatial coordinates u, v and spectrum λ. A pixel Iλ(u,v) is a region of around a few square meters of the Earth’s surface that is a function of the sensor spatial resolution. Specifically, it is a one-dimensional spectrum section of hyper-cube Iλ indexed by spatial coordinates (u,v) within the sensor resolution of the camera. Every pixel spectral value Iλ(u,v,λi) is numeric and expresses how much the radiation is reflected, on average, at the i-th band of λ from the resolution cell of the considered pixel Iλ(u,v).

Fig. 1
figure 1

HS imagery data

In a task of saliency detection, every HS imagery pixel can, in principle, be labelled according to an unknown binary target function, whose range is a finite set of two distinct labels, i.e. “salient” and “no-salient”. According to this function, a saliency matrix S can be associated to an HS image Iλ. In particular, S is a two-dimensional set of saliency values with every value S(u,v) representing the saliency label of the HS pixel Iλ(u,v) indexed by the spatial coordinates u, v.

4 Learning algorithm

We propose a machine learning methodology named A GNES (pseudo-lA bel G eneration for uN supE vised S aliency detection in Hyperspectral image datasets) for saliency detection in an HS image dataset. It takes as input a dataset of HS images (all images are sensed on the same spectrum) and learns an ensemble of HS saliency classification patterns by iterating colour-based, saliency, pseudo-label generation and HS classification analysis on every HS image of the dataset. The ensemble is used to build the final HS-driven saliency matrices of the HS image dataset taken as input.

We point out that, in the proposed methodology, both colour data and HS data analyses are coupled, since the learning process is performed jointly on these two different representations of the same imagery data. In particular, the saliency pseudo-labels, which are produced from the colour display of HS imagery data, boost the supervision in the HS classification learning stage. In this way, we are actually able to take advantage of the progress achieved in colour-based saliency detection to yield good saliency pseudo-labels. These pseudo-labels allow us to perform classification analyses of HS imagery data although ground truth saliency labels are unavailable for supervision – HS imagery data are collected in an unsupervised scenario. Finally, the ensemble strategy (Opitz & Maclin, 1999) allows us to strengthen the accuracy of single HS saliency classification patterns that are learned from a dataset of multiple HS images as members of an ensemble. In particular, the ensemble strategy allows us to actually leverage a finite set of different HS saliency classification patterns for saliency detection. The better flexibility of the ensemble structure turns out to be more robust to spectral variability.

The block diagram of A GNES is illustrated in Fig. 2, while the pseudo-code is reported in Algorithm 1. The main used symbols are introduced in Table 1. The learning stages of the methodology (colour-based, saliency pseudo-label generation, HS classification learning and ensemble classification) are described in Sections 4.14.3. The analysis of the time complexity of the algorithm is reported in Section 4.4. Finally, the implementation details of the algorithm are reported in sub-Section 4.5.

Fig. 2
figure 2

The block diagram of AGNES methodology. It takes as input a dataset of HS images. It iterates colour-based, saliency, pseudo-label generation and HS classification analysis to populate an ensemble. It uses the ensemble to yield the saliency matrices of the processed HS images

figure a
Table 1 Main used symbols

4.1 Colour-based saliency pseudo-label generation

In the first stage, we leverage the information enclosed in a colorimetric rendering of HS imagery data, in order to generate colour-based, saliency pseudo-labels (lines 4-5, Algorithm 1). These pseudo-labels will be used to supervise the subsequent HS classification learning stage. The pseudo-label generation procedure is repeated on every HS image of \(\mathcal {I}\) and proceeds as follows.

Let us consider an HS image \(\mathbf {I_{\lambda }}\in \mathcal {I}\). First we determine IRGB that is the colorimetric rendering of Iλ. For this operation, we apply the technique described in Foster and Amano (2019). First the pixel spectrum is scaled to range from 0 to 1 and multiplied by the global illuminant value,Footnote 1 converted to CIE XYZ tristimulus values according to equations:

$$ \begin{array}{@{}rcl@{}} X({u,v})&=& \kappa \int{\mathbf{I}_{\mathbf{\lambda}}(u,v)}{\overline{x}(\lambda) \space \mathbf{d\lambda}}, \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} Y(u,v)&=& \kappa \int{\mathbf{I}_{\mathbf{\lambda}}(u,v)}{\overline{y}(\lambda) \space \mathbf{d\lambda}}, \end{array} $$
(2)
$$ \begin{array}{@{}rcl@{}} Z(u,v)&=& \kappa \int{\mathbf{I}_{\mathbf{\lambda}}(u,v)}{\overline{z}(\lambda) \space \mathbf{d\lambda}}, \end{array} $$
(3)

where κ is chosen so that Y = 100 for a perfectly white surface under full illumination, while \(\overline {x}(\lambda )\), \(\overline {y}(\lambda )\) and \(\overline {z}(\lambda )\) are the CIE XYZ colour-matching functions for the second standard observer (Foster & Amano, 2019). Then, CIE XYZ is scaled to representation to range from 0 to 1 and transformed to the default RGB color space sRGB according to the linear transformation (IEC, 1998):

$$ \left[ \begin{array}{c}R({u,v})\\G({u,v})\\B({u,v}) \end{array}\right] =\left[ \begin{array}{ccc}3.2406& -1.5372 &-0.4986\\ -0.9689 & 1.8758 & 0.0415 \\ 0.0557 & -0.2040 & 1.0570 \end{array}\right] \left[ \begin{array}{c} X({u,v})\\ Y({u,v})\\ Z({u,v}) \end{array}\right]. $$
(4)

According to guidelines reported in Foster and Amano (2019), sRGB values less than 0 are set to 0 and sRGB values greater than 1 are set to 1, in order to satisfy range constraints. Finally, a nonlinear correction is applied to compensate approximately for the input–output function of the display device. The typical approximate correction (Foster & Amano, 2019) has the form:

$$ \left[ \begin{array}{c}R^{\prime}({u,v})\\G^{\prime}({u,v})\\B^{\prime}({u,v}) \end{array}\right] =\left[ \begin{array}{c}R({u,v})^{0.4}\\G({u,v})^{0.4}\\B({u,v})^{0.4} \end{array}\right]. $$
(5)

Then we use the ASNet pattern – a deep learning neural network pattern described in Wang et al. (2020) – to yield accurate pixel-wise saliency estimation of IRGB. Specifically, the ASNet pattern produces a grey-level image that displays the numeric intensity of saliency estimated at every colour pixel of IRGB. In this study, we use the Otsu’s algorithm (Otsu, 1979) to determine the upper threshold for separating the grey-level saliency values estimated by ASNet into two classes: foreground (“salient”) and background (“no-salient”). In particular, we assign imagery pixels with grey-level saliency intensity higher than the Otsu’s threshold to the pseudo-label “salient”, while we assign the remaining pixels to the pseudo-label “no-salient”. In this way, we produce the colour-based saliency matrix SRGB associated to Iλ, where every value SRGB(u,v) represents the saliency pseudo-label that is colour-based assigned to the HS pixel Iλ(u,v).

The ASNet pattern is the colorimetric saliency estimation pattern that is described in Wang et al. (2020). It was learned by training an Attentive Saliency Network – a deep neural network architecture that was trained with a fixation map, derived at the upper network layers, which mimics human visual attention mechanisms and captures a high-level understanding of the scene from a global view. In particular, the ASNet architecture views saliency as fine-grained object-level saliency segmentation that is progressively optimized with the guidance of the fixation map in a top-down manner. It is a hierarchy of convLSTMs that offers an efficient recurrent mechanism to sequentially refine the saliency features over multiple steps. The pattern described in Wang et al. (2020) was trained on 30160 colour images collected in problems of human fixation prediction and salient object detection.Footnote 2 In this paper, the choice of this specific ASNet pattern is due to the extensive empirical validation illustrated in Wang et al. (2020), which proves that the same ASNet pattern that we use here can yield accurate pixel-wise grey-level saliency estimation in a high variety of colour images by outperforming 15 recent deep learning-based alternatives and 4 classical non-deep learning models.

The Otsu’s algorithm is an adaptive threshold algorithm introduced in Otsu (1979) commonly used in image binarization problems. It determines a saliency threshold in a grey-level image by minimising the intra-class intensity variance defined as a weighted sum of variances of the two classes.

4.2 HS classification learning

In the second stage, we leverage the information enclosed in the spectral vectors of HS imagery data performing a process of HS classification learning. The purpose is to take advantage of the abundance of spectral information enclosed in HS pixels, in order to refine the colour-based saliency assignments that may have the highest difficulty in better separating the salient landscape from the surrounding landscape. The HS classification learning stage is repeated on every HS image of \(\mathcal {I}\) (lines 6-9, Algorithm 1) learning a new HS saliency classification pattern for feeding the ensemble Φ. The classification analysis of every HS image proceeds as follows.

Let us consider an HS image \(\mathbf {I_{\lambda }}\in \mathcal {I}\). First, we build \(\mathbf {I_{PC_{\lambda }}}\) – the hyper-cube produced by approximating the HS pixel of \(\mathbf {I_{\lambda }}\in \mathcal {I}\) on the independent feature space PCλ spanned by PC1,PC2,…, and PCH. This feature space represents the top-ranked H principal components of λ in Iλ. Then we couple the colour-based, saliency pseudo-labels of SRGB to \(\mathbf {I_{PC_{\lambda }}}\), in order to construct the training set \(\mathbf {I_{PC_{\lambda }}}\oplus \mathbf {S_{RGB}}\). Finally, we train an HS classification function ϕλ: PCλ↦{salient,unsalient} from \(\mathbf {I_{PC_{\lambda }}}\oplus \mathbf {S_{RGB}}\) and add the learned classification function ϕλ coupled to the principal component space PCλ to the ensemble Φ. We point out that 〈PCλ,ϕλ〉 defines an HS saliency classification pattern that can be used to predict the saliency label of any HS pixel sensed on spectrum λ.

Principal Component Analysis (PCA) is one of the most widely used linear feature extraction techniques, which has been proved to be a powerful HS imagery data reduction strategy in tasks of HS classification (Xia et al., 2018; Appice and Malerba, 2019) or HS change detection (Lopez-Fandino et al., 2018; Appice et al., 2020). Specifically, PCA seeks to reduce the dimension of the data and drops the curse of dimensionality by finding a few orthogonal directions (the Principal Components – PCs) that express the original spectral bands whose projections on the PCs have the largest variance. In HS image analysis, the preference for the use of PCA for data reduction is also motivated by its ability to derive a collinearity-free characterisation of the spectrum. The spectral bands are strongly contemporaneously correlated with each other in the near spectrum, while the spectral principal components are contemporaneously uncorrelated with each other. An illustration of this phenomenon can be seen in Fig. 3a and b. We note that the collinearity phenomenon among near spectral bands may not be simply neglected, as it leads to a series of problems, such as unreliable coefficients and predictions, as well as aggravated data redundancy and computational complexity (Howley et al., 2006). In general, as discussed in Pravilovic et al. (2017) and Pravilovic et al. (2018), PCA is a mandatory step in improving the learning performance, by removing collinearity, speeding up the learning process and reducing the data storage requirements.

Fig. 3
figure 3

IMG02 of HS-SOD dataset (see Section 5.1 for further details): heatmap of the Pearson correlation matrix of spectral bands (Fig. 3a) and heatmap of Pearson correlation matrix of the Principal Components, expressing an orthogonal reduced representation of the entire spectrum (Fig. 3b)

There are also valid alternatives to PCA. For example, autoencoders, that belong to the neural network family, are similar to PCA as they can be used for finding a low-dimensional representation of input data (Charte et al., 2018). They minimize the same objective function as PCA, but they are more flexible than PCA, due to the activation function that can introduce non-linearities in the encoding. Although autoencoders are really a big class of potentially extremely complex models, the advantage of PCA is that it is simple and efficient to train in comparison. Assuming that the linear transformation of PCA fits the spectral data accurately, it is much better to train PCA than try to select some complex deep model. In Appice and Malerba (2019), the viability of both PCA and autoencoders is compared in various benchmark HS imaging scenarios concluding that no significant improvement can be actually achieved in HS imagery analysis by considering auto-encoding instead of principal components.

For the classification analysis, we select XGBoost (Chen & Guestrin, 2016) as a valuable classification algorithm for this stage. Although, deep neural networks tend to outperform all other classification algorithms in various applications (included image analysis), traditional algorithms are still considered best-in-class when classification analysis comes to small-to-medium tabular data. In this paper, the choice of XGboost is due to the fact that it is currently one of the most popular machine learning algorithms (outside deep learning) in both academy and industry. It is a highly flexible and versatile algorithm that learns a decision-tree-based ensemble. It uses a gradient boosting framework to minimise the error of sequential models. In particular, XGBoost is efficient, as the process of sequential tree building is performed using parallelized implementation. It is designed to make efficient use of hardware resources and it is implemented with the depth-first approach that contributes to improve computational performance significantly. XGBoost is able to penalise more complex models through, backward tree pruning, LASSO (L1) and Ridge (L2) regularization to prevent overfitting. In addition, it employs the distributed weighted Quantile Sketch algorithm to effectively find the optimal split points among weighted datasets. Finally, there are several studies (Loggenberg et al., 2018; Samat et al., 2020; Zhou et al., 2020), where XGBoost has been applied to HS imaging with great success.

4.3 Ensemble classification

In the third stage, we use Φ – the ensemble populated with the HS saliency classification patterns learned from \(\mathcal {I}\) – to yield \(\mathcal {S}\) – the set of the final saliency matrices produced for the HS images of \(\mathcal {I}\) (lines 10-16, Algorithm 1).

In particular, let us consider an HS image \(\mathbf {I_{\lambda }}\in \mathcal {I}\), we use Φ to build the saliency matrix \(\mathbf {S_{\phi _{\lambda }^{\uparrow }}}\) associated with Iλ and add \(\mathbf {S_{\phi _{\lambda }^{\uparrow }}}\) to \(\mathcal {S}\). The computation of \(\mathbf {S_{\phi _{\lambda }^{\uparrow }}}\) proceeds as follows. First we sort Φ by an estimate of the accuracy of every pattern 〈PCγ,ϕγ〉∈Φ on Iλ. To measure the accuracy of 〈PCγ,ϕγ〉 on Iλ, we compare:

  • \(\mathbf {S^{\lambda }_{\phi _{\gamma }}}\) – the HS-based saliency matrix populated with saliency predictions yielded by 〈PCγ,ϕγ〉 on Iλ, and

  • SRGB – the colour-based saliency matrix of Iλ as it has been already computed in the first stage of the algorithm.

In particular, we measure:

$$ accuracy(\langle \mathbf{PC_{\gamma}},\phi_{\gamma} \rangle,\mathbf{I_{\lambda}}) = AUCBorji(\mathbf{S^{\lambda}_{\phi_{\gamma}}},\mathbf{S_{RGB}}), $$
(6)

where AUCBorji() is the Borji variant of the Area Under the Roc Curve (AUC), which is commonly adopted in measuring accuracy in saliency detection tasks (Borji et al., 2013).

After sorting Φ according to (6), we select the top-ranked HS saliency classification pattern \(\langle \mathbf {PC_{\lambda }^{\uparrow }},\phi _{\lambda }^{\uparrow }\rangle \in \mathcal {I}\), so that:

$$ \left\langle \mathbf{PC_{\lambda}^{\uparrow}},\phi_{\lambda}^{\uparrow}\right\rangle = \arg\max_{{\langle\mathbf{PC}_{\gamma},\phi_{\gamma}\rangle}\in \mathbf{\Phi}} { \ accuracy(\langle \mathbf{PC_{\gamma}},\phi_{\gamma} \rangle,\mathbf{I_{\lambda}}) }. $$
(7)

Finally, we use \(\left \langle \mathbf {PC_{\lambda }^{\uparrow }},\phi _{\lambda }^{\uparrow }\right \rangle \) to build the final saliency matrix \(\mathbf {S_{\phi _{\lambda }^{\uparrow }}}\) associated with Iλ. In \(\mathbf {S_{\phi _{\lambda }^{\uparrow }}}\), the saliency label Iλ(u,v) indexed by u,v is the prediction yielded by \(\phi _{\lambda }^{\uparrow }\) on the principal component values defined by fitting the principal component space \(\mathbf {PC_{\lambda }^{\uparrow }}\) to the HS imagery pixel Iλ(u,v).

4.4 Time complexity

Let us consider that: 1) the dataset \(\mathcal {I}\) collects N images with U × V pixel resolution acquired with an HS sensor covering a spectrum λ spanned on m spectral bands; 2) the used neural network A SNet was pre-trained estimating w weights; 3) the algorithm P CA is used to remove the collinearity in a system of m spectral bands; 4) the algorithm X GBoost is used to learn XGBoost patterns with K trees spanned over d layers; 5) the algorithm A UCBorji is implemented averaging the AUC values computed on s split trials with t equal-width steps processed in each split trial. Based upon these premises, the time complexity of A GNES is computed by summing-up the cost of: (i) generating the saliency pseudo-labels with A SNet, (ii) learning the HS saliency classification patterns with P CA and X GBoost and (iii) using the ensemble of the learned HS saliency classification patterns coupled with A UCBorji to produce the final saliency matrices of the input images. Note that the pipeline composed of the steps (i) and (ii) may be also run in parallel on each independent image of \(\mathcal {I}\). Again, once the ensemble of the HS saliency classification patterns is fully learned by completing steps (i) and (ii) on all images of \({\mathcal I}\), the step (iii) may be run in parallel on each independent image of \(\mathcal {I}\).

Saliency pseudo-label generation

for each HS image \(\mathbf {I}_{\lambda }\in \mathcal {I}\), the color image Iλ is determined and the pseudo-labels SRGB are predicted. The time cost of building IRGB is proportional to mUV, while the time cost of generating SRGB is proportional to wUV. Considering that wm, the time complexity of completing this step on all HS images of \({\mathcal {I}}\) may be re-written as O(N(wUV )).

HS Classification learning

for each HS image \(\mathbf {I}_{\lambda }\in \mathcal {I}\), its principal components are computed and then XGBoost pattern is trained. The time cost of determining a system of principal components for an HS image is \(\mathbf {O}(mUV\times \min \limits (m, UV) +m^{3})\). The time cost of training a XGBoost pattern is \(Kd n_{z} + n_{z}\log {r_{b}}\) where K is the number of trees, d is the number of tree layers, nz is the number of non-missing entries in the training dataset and rb is the maximum number of rows in each block. Note that according to the theory reported in Chen and Guestrin (2016), the block structure is adopted to speed-up the computation on large datasets. Therefore, assuming mUV, the time complexity of completing this step on all HS images of \({\mathcal {I}}\) is \(\mathbf {O}(N(m^{2}UV+ Kd n_{z} + n_{z}\log {r_{b}}))\).

Ensemble classification

for each HS image \(\mathbf {I}_{\lambda }\in \mathcal {I}\), the XGBoost pattern that achieves the highest AUCBorji predicting the pseudo-labels is selected and the final saliency matrix is built. The cost of measuring the AUCBorji value of an XGBoost pattern with respect to the saliency pseudo-labels of an HS image is proportional to stUV, where s denotes the number of split trials and t is the number of steps. This operation is repeated on each XGBoost pattern enclosed in the ensemble. The cost of selecting the top XGBoost pattern of the ensemble is O(N), while the cost of predicting the final labels with the selected XGBoost pattern is O(Kd). Therefore, the cost of performing this phase on all HS images of \({\mathcal {I}}\) is O(N(NstUV + N + Kd)). As N < NUV and assuming that KdUV, the time complexity of this step can be rewritten as O(N2stUV ).

4.5 Implementation details

A GNES is written in Python 3.7. It uses the ASNet pattern learned in Wang et al. (2020)Footnote 3 using Keras 2.3Footnote 4 – a high-level neural network API with TensorFlowFootnote 5 as the back-end. In addition, A GNES imports:

  • The implementation of Otsu’s algorithm from skimage.filters.threshold_otsu.Footnote 6

  • The implementation of PCA from sklearn.decomposition.PCA.Footnote 7 In particular, we compute the PCA with the full SVD calling the standard LAPACK solver (svd_solver=‘full’) and with Minka’s MLE (Minka, 2001) for the automatic choice of dimension H (component=‘mle’). By adopting this choice, the algorithm called probabilistic principal component analysis (PPCA) described in Tipping and Bishop (2006) is used.

  • The implementation of XGBoost from xgboost.XGBoostClassifier,Footnote 8 by adopting the default parameter set-up is reported in the documentation.Footnote 9 In particular, the number of trees K = 100 and the number of tree layers d = 6.

  • The implementation of the AUCBorji algorithms is done by following the guidelines reported in Borji et al. (2013). In particular, the AUC is measured on t = 10 equal-width steps and repeated on s = 100 split trials. The final AUC is computed as the average of AUC values computed on the split trials.

5 Experimental evaluation and discussion

To provide a compelling evaluation of the accuracy of our methodology, we have conducted a range of experiments on the benchmark HS saliency detection dataset called HS-SOD (Imamoglu et al., 2018). This dataset is available with the ground-truth saliency images. The main objective of these experiments is to evaluate the effectiveness of the proposed learning methodology along its various learning dimensions – learning on spectrum, classification with principal components, ensemble strategy (Section 5.2). In addition, we investigate how the proposed methodology compares to state-of-the-art HS saliency detection competitors (Section 5.3). In all these experiments, the accuracy performance is evaluated with the Borji variant of the Area Under the ROC Curve (A UCBorji) (Borji et al., 2013). This metric has been already used in Imamoglu et al. (2018), Appice et al. (2020), Falini et al. (2020), and Falini et al. (2020) for the analysis of the performance of various HS saliency detection methodologies defined in the literature and evaluated on the dataset HS-SOD.

5.1 HS-SOD

The dataset HS-SOD (Imamoglu et al., 2018)Footnote 10 is a benchmark HS image dataset that collects 60 HS images with 1024 × 768 pixel resolution. These HS images were sensed in various scenes of the public parks of Tokyo Waterfront (City in Odaiba, Tokyo, Japan) in several days between August - September 2017 when the weather was sunny or partially cloudy. They exhibit different characteristics in terms of sensed landscapes, salient landscape size, foreground-background contrast and salient landscape position on the image. In particular, each HS image of HS-SOD was acquired using the NH-AIK hyper-spectral camera that covers a spectrum between 350 μ m and 1100 μ m spanned on 150 spectral bands with spectral resolution 5 μ m. However, the authors of HS-SOD prepared the images on the visible spectrum (380-780 μ m) spanned on 81 spectral bands. In addition, they produced a ground-truth binary image for every HS image, so that salient pixels were labelled within each ground-truth image. We point out that ground-truth labels have been ignored during the execution of the saliency detection methodologies, while they are considered to evaluate the accuracy of the saliency matrices constructed.

5.2 Learning components analysis

In this Section we analyse the effectiveness of the various learning components of A GNES, in order to answer the following questions:

  1. 1.

    How does the analysis of the spectrum information affects the accuracy of the saliency matrices computed?

  2. 2.

    How does the accuracy of the HS imaging analysis change by introducing PCA to deal with the curse of dimensionality?

  3. 3.

    Is the idea of using the ensemble strategy more powerful than considering single classification patterns?

To this purpose, we perform an ablation study where we consider four configurations as baselines. These are in turn defined by removing HS information analysis, PCA and/or ensemble learning from the whole methodology of A GNES. In particular, these baseline configurations are defined as follows:

  1. 1.

    A SNet that takes as input each HS image of the input dataset and applies the ASNet pattern to construct the imagery saliency matrix of the image from the colour rendering of the HS imagery pixels. This baseline considers the colour pixel representation only by giving away both HS information analysis, PCA and ensemble learning.

  2. 2.

    X GBoost that takes as input each HS image of the input dataset and applies the ASNet pattern to generate the saliency pseudo-labels from the colour rendering of the HS imagery pixels. It uses these colour-based, saliency pseudo-labels to supervise learning of HS classification patterns with XGBoost. It constructs the final imagery saliency matrix of each HS image with the saliency predictions produced by the XGBoost pattern learned on the HS imagery pixels of the considered image. This baseline couples the colour-based analysis to the spectral-based analysis giving away PCA and ensemble learning.

  3. 3.

    P CA+XGBoost that is like the configuration of X GBoost, but it performs PCA of HS information before learning HS classification patterns with XGBoost. This baseline couples the colour-based analysis to PCA and spectral-based analysis giving away ensemble learning.

  4. 4.

    E nsemble that populates an ensemble with multiple HS classification patterns that are learned with the configuration X GBoost from the multiple HS images of the input dataset. It uses this ensemble to build the final imagery saliency matrices of the HS images in the input dataset. This baseline couples the colour-based analysis to the spectral-based analysis and considers the ensemble learning giving away PCA during HS information analysis.

A summary of the characteristics of the compared configurations is reported in Table 2.

Table 2 Characteristics of the several configurations of A GNES evaluated
Fig. 4
figure 4

AUCBorji (mean ± stdev) of A SNet, X GBoost, P CA+XGBoost, E nsemble and AGNES computed on the saliency matrices yielded for all the HS images of HS-SOD

We evaluate the performance of A SNet, X GBoost, P CA+XGBoost, E nsemble and A GNES processing all the HS images of HS-SOD. In particular, Fig. 4 reports the AUCBorji (mean and standard deviation) of the compared configurations confirming that A GNES outperforms, in average, all its baselines. These results point out that decoupling the spectral-based analysis from both PCA and ensemble learning (X GBoost) leads to a drop in accuracy with respect to the baseline decision of saliency assignments based on the colour-based information only (A SNet). On the other hand, the configuration that couples PCA to HS imagery classification learning without ensemble (P CA+XGBoost), as well as the configuration that uses an ensemble of HS classification patterns learned without PCA (E nsemble) have performance which stay close to that of A SNet. Our interpretation of these results is that PCA contributes to deal with the curse of dimensionality (as well as with HS information collinearity) in agreement with the conclusions already drawn in both (Appice and Malerba, 2019) and (Appice et al., 2020) for HS classification and HS change detection, respectively. Ensemble learning contributes to handle possible phenomena of spectral variability that may occur in single HS images by taking advantage of HS multiple classifiers in place of a single one (Ceamanos et al., 2009). So, the winning strategy that allows A GNES to properly take advantage of HS information dealing with both issues simultaneously is derived by combining achievements of both PCA and ensemble learning.

To statistically test whether the improvement in accuracy of A GNES is significant, we use the Friedman’s test. This is a non-parametric test that is commonly used to compare multiple approaches over multiple datasets (Demšar, 2006). The Friedman test compares the average ranks of the approaches, so that the best performing approach gets the rank of 1, the second best gets rank 2. The null-hypothesis states that all the configurations are equivalent. Under this hypothesis, ranks of compared approaches should be equal. We perform this test on the AUCBorji scores of the compared configurations on each image of the dataset HS-SOD and reject the null hypothesis with p-value≤ 0.05. As the null-hypothesis is rejected, we use a post-hoc test—the Nemenyi test—for pairwise comparisons (Demšar, 2006). The display of the results of this test, reported in Fig. 5, confirms that A GNES is ranked higher than all its baseline configurations. In particular, the critical difference diagram, obtained using a 0.05 significance level, shows that A GNES is on average the best performing approach with the configuration P CA+XGBoost as runner-up.

Fig. 5
figure 5

Comparison between A SNet, X GBoost, P CA+XGBoost, E nsemble and AGNES with the Nemenyi test using the AUCBorji. Groups of configurations that are not significantly different (at p ≤ 0.05) are connected

At the completion of this discussion, we compare the visual display of a few saliency matrices produced by processing colour-based information, as well as joining colour-based and HS-based information. Figure 6 shows the visual displays of both the input and output of A SNet and A GNES for the HS images 1, 2, 17 and 21 of HS-SOD. These displays highlight that, in all the HS images, A GNES can better delineate salient regions along their boundaries. This is a confirmation of the fact that we propose a saliency detection methodology that can actually take advantage of the HS information to better separate landscapes. We also note that the AUCBorji score achieved by A GNES is lower than the AUCBorji score achieved by A SNet on image 1. This score is computed with the ground truth provided by the authors of the dataset. In this image, A GNES recognises the sign as the salient part of the image, but it also separates the illustration within the sign from the white scene. This separation, that is non reported in ground truth, depends on the fact that the sign background and the illustration within the sign actually define different landscapes. This suggests that a possible direction to extend this methodology comprises a mechanism to derive a multi-level, global-to-local, representation of the saliency information within the same imagery scene (i.e., the local salient detail emerging within a global salient area).

Fig. 6
figure 6

IMG1, IMG2, IMG17, IMG21 of HS-SOD dataset: RGB rendering (Fig. 6a, e, i and m), saliency ground-truth (Fig. 6b, f, j and n), saliency images built by A SNet (Fig, 6c, g, k and o), saliency images built by A GNES (Fig. 6d, h, l and p)

5.3 Competitor analysis

Finally, we compare the AUC-Borji scores competitors considered in the recent literature for problems of saliency detection in HS images. For this comparative study, we consider:

  • A ISA – the HS saliency detection methodology described in Appice et al. (2020). It uses an auto-encoder representation of HS imagery data, spectral-spatial difference between imagery data and reconstructed data, as well as clustering.

  • S SMF and S SMF1 – the HS saliency detection methodologies presented in Falini et al. (2020) and (Falini et al., 2020) respectively. They use a sparse non-negative matrix factorization algorithm together with several error functions, based on spectral and spatial measures, and with Gaussian-based clustering techniques.

  • I tti’s model (Itti et al., 1998) computed on the colour rendering of the HS images.

  • S ED and S AM – the HS saliency detection methodologies described in Liang et al. (2013). It is used with spectral Euclidean distance and spectral angular distance, respectively.

  • GS – the HS saliency detection methodology presented in Liang et al. (2013). It first divides spectral bands into four groups (G1,G2,G3,G4) and then measures the colour opponency by computing the Euclidean distance between these vectors (G1-G3 and G2-G4).

  • S ED-OCM-GS and S ED-OCM-SAD – the HS saliency detection combinations proposed in Liang et al. (2013). They use orientation-based salient features (OCM).

  • S GC – the HS saliency detection methodology illustrated in Yan et al. (2016). It determines super-pixels by computing spectral and spatial gradients and identifies local region contrasts from super-pixels.

Table 3 reports the AUCBorji scores of all the compared methodologies. A GNES achieves a better average accuracy compared to all the other competitors tested in this study. The best performance of A GNES is mainly due to the ensemble learning process, coupled with the specific cascade of colour-based analysis and HS-based analysis. We note that we take advantage of the ensemble learning as we count on a dataset of HS images. This condition, generally, happens in surveillance missions (e.g. environmental surveillance done with a drone), where multiple images of various scenes are, commonly, sensed using the same HS technology within the same mission.

Table 3 Competitor analysis

To complete this study, we analyse the computation time spent completing the saliency detection task on HS-SOD dataset. The computation time is measured in minutes on an Intel(R) Core(TM) i7-4720U CPU@2.60 GHz and 16 GB RAM running Microsoft Windows 8.1 (64 bits). Figure 7 reports the total time spent completing the three steps of A GNES (i.e. colour-based saliency pseudo-label generation, HS classification learning and ensemble classification). Table 4 compares the computation times spent completing the saliency detection task with both A GNES and its competitors A ISA, S SMF1 and S SMF.Footnote 11 These results confirm that the higher accuracy achieved by A GNES is at the cost of the more computation time spent both performing the supervision in the HS classification learning step and adopting the ensemble for producing the final saliency matrices. Note that, in this evaluation, we have not use a parallel computation infrastructure, in order to perform the learning steps of the compared algorithms on several HS images simultaneously. In principle, assuming the availability of a parallel computation infrastructure, A ISA, S SMF1 and S SMF may process all the HS images of HS-SOD dataset in parallel spending, in average, 9.94 minutes, 6.75 minutes and 2.95 minutes, per image, respectively. On the other hand, A GNES spends, in average, 16.92 minutes per image. This estimate is derived by assuming that both the pseudo-label generation and the HS classification learning steps are completed in parallel on each HS image of HS-SOD dataset. Similarly, the ensemble classification of each HS image is also completed in parallel on each HS image of HS-SOD dataset once all the classification patterns of the ensemble have been learned.Footnote 12

Table 4 Competitor analysis

6 Conclusion

This paper illustrates a learning methodology for analyzing a dataset of HS images and detecting, in each HS image, the salient pixel region that can be considered more notable than the background.

The proposed methodology takes advantage of a learning process done on a dataset of HS images sensed through the same HS sensor. Learning is done by jointly processing the imagery pixel information represented on both the colour mode and the HS mode. In particular, the proposed methodology yields saliency pseudo-labels from a colourimetric rendering of HS imagery data. It uses these pseudo-labels to supervise learning of HS classification patterns trained with XGboost. These HS classification pattern predicts saliency assignments based on the HS pixel spectrum. It uses PCA, in order to deal with the curse of dimensionality during HS classification learning, as well as ensemble learning, in order to strengthen the accuracy of multiple HS classification patterns trained for saliency detection on multiple HS images.

The experiments are performed on a benchmark HS image dataset that comprises both HS images and ground-truth saliency images. Ground-truth labels are considered to evaluate the accuracy of the saliency assignments learned. The performed experiments investigate the sensitivity of the performance to the steps of the learning methodology proving that every component of the methodology contributes to the gain in detection accuracy. The results also reveal that the proposed methodology is able to provide competitive accuracy compared to state-of-the-art HS models. In fact, with the encouraging performance of the proposed methodology, precise salient areas of various scenes may be identified.

Some directions for further work are still to be explored. Appropriate deep learning architectures can be considered, in order to improve the accuracy of the HS classification learning stage. Active learning mechanisms can be applied for label acquisition in a possible weakly supervised enhancement of the classification analysis. In addition, we plan to explore the possibility of introducing a mechanism for a global-to-local mechanism of saliency detection to recognize (and possibly classify) local components belonging to different landscapes within the same global salient region. In addition, we intend to investigate different HS feature engineering algorithms to feed the classification stage of the learning methodology proposed in this study. In particular, we plan to explore the performance of Gabor features (Jia et al., 2015), autocorrelation features (Appice and Malerba, 2019), morphological features (Appice et al., 2016; 2017) and frequency features (Guccione et al., 2015) in the investigated HS saliency detection scenario. Finally, we plan to extend the investigation of the feasibility of a parallel strategy for implementing the proposed algorithm.