1 Introduction

Unlike two-dimensional RGB ordinary three-channel images, hyperspectral images contain three-dimensional spatial cube data with rich one-dimensional spectral information and two-dimensional spatial information. With the continuous development of spectral processing technology, hyperspectral images have been widely used in medicine [1, 2], agricultural remote sensing [3, 4], geological exploration [5], environmental monitoring [6]. and marine remote sensing [7, 8], Furthermore, HSI classification research in the field of remote sensing is attracting attention. In earlier times, related researchers have applied many classical classification methods for HSI, such as K-nearest neighbors [9], decision trees[10], support vector machines (SVMs) [11, 12], and sparse expression-based [13] and Bayesian estimation[14, 15], methods. were proposed. However, these methods are based on the shallow features of spectral information and fail to exploit the spatial features of the HSIs at a deeper level, which, in turn, leads to less-than-ideal classification accuracy.

CNNs in the field of deep learning have achieved remarkable results in extracting layered and nonlinear features, such as face recognition, autonomous driving, and drone navigation [16, 17]. In HSI classification, Zhang et al. utilized autoencoders (REA) [18], and Zhou et al. introduced stacked sparse self-encoders (SSAE) [19]with deep belief networks (DBN) designed by Chen et al. [20] to extract deeper features. However, although these methods acquire deeper local features to some extent, they change the HIS to its original data form. The rise of CNNs [21] not only captures spatial and spectral information [22] but also ensures that the original data structure of the HSI is not corrupted. Meanwhile, its weight-sharing highlight further reduces the number of parameters to be computed.

Hu et al. [23] implemented HSI classification using one-dimensional CNNs. Next, Sharma et al. [24] presented a two-dimensional convolutional network to learn HSI spatial features. Finally, Hamida et al. [25] used a joint three-dimensional convolutional approach to learn HSI spatial and spectral features. Although the 3D block convolution, droupt, and modified residual join used in the manuscript are not the most recent methods, we have designed a novel and low-cost spectral branch module and spatial dynamic convolution. Compared with other similar methods, the spatial dynamic convolution module in the manuscript can mine different spectral and spatial information according to the changes in characteristics of hyperspectral datasets and corresponding changes of filters.The three contribution points of the text are presented as follows:

  1. 1)

    An end-to-end classification framework with good sparsity is designed by combining spatial branching with the spectral branching of grouped convolution with a design idea similar to residual networks and better adaptability to join the learning of spectral space for HSI data.

  2. 2)

    The network structure LCTCS performs the HSI classification task effectively compared to other current methods and has been verified in four publicly available hyperspectral datasets.

  3. 3)

    The network structure LCTCS has further reduced the number of parameters, calculations, and storage space, compared with other similar types of 3D grouped convolution methods.

2 Related Work

2.1 Architecture of HSI Classification Models

The main challenges in HSI classification arise from (1) the poor joint utilization of the spectral space [26] (2) the excessive complexity of the HSI models used for classification, and (3) the unsatisfactory classification results achieved with limited training samples.

In recent years, to capture more abstract spatial and spectral information, Roy et al. [27] proposed a HybridSN classification method for the first time by combining 2DCNN to learn features in the spatial domain of HSI with 3DCNN to learn features in the spatial and temporal domains simultaneously while reducing the complexity of the model to a certain extent. Zhong et al. [28] designed a spatial-spectral residual network (SSRN) to capture finer spectral-spatial information by training the deeper layer properties of the classification network using residual networks. Wang et al. [29] improved an end-to-end dense convolutional classification structure framework for spatial-spectral learning, which reused previous spectral-spatial features in a dense structure to improve the feature utilization rate. Zhang et al. [30] proposed a classification method with context-aware extraction of local features, which further improves the applicability of the end-to-end model by extracting features adaptively according to the classification target. Although most of the above classification model frameworks can alleviate the problem of joint spatial-spectral learning to some extent, they still suffer from the classification of limited samples and overly complex network models. Zheng et al. [31] introduced an adaptive attention mechanism end-to-end classification framework in spatial and spectral dimensions. The method focuses on spectral space learning according to spectral bands with pixel points, thereby alleviating the problem in limited training samples [32].With the rise of graph convolutional networks in deep learning, Ding et al. [33] proposed a hybrid mechanism based on graph filters to enable information sharing and interaction among different filters to solve the problem of small sample training, which allows for diverse feature learning.Other classification network constructs, such as full convolutional networks (FCNs) [34], recurrent neural networks (RNNs) [35], generative adversarial networks (GANs) [36], and capsule networks [37, 38], have also been successfully introduced into HSI classification. Moreover, Yang et al. [31] proposed a classification framework with a self-attentive mechanism focused on spectral information called(DBDA). Although these classification model frameworks effectively alleviate the problem of limited samples and improve the generalization and robustness of the models, they also suffer from the redundancy of the weight parameters and inefficient computation.

2.2 Reducing Methods of Computational Resource

Current conventional methods for reducing computational resources include model distillation [39], quantization [40], pruning [41], and parameter sharing [42]. Another way is to design a simple and efficient network architecture to reduce the network weight parameters, storage space, and computation to save computational resources.

To reduce the waste of computational resources, Liu et al. [43] proposed a migration learning framework for hyperspectral data at different band counts by training CNNs on the initial HSI data and fine-tuning the CNNs to reduce the amount of computation required through migration learning. Li et al. [44]proposed a deep two-channel dense network with a top and local concatenation approach (DDCD), which alleviates the redundancy of weight parameters to a certain extent but also suffers from excessive computational complexity. Meng et al. [45] proposed a lightweight modular approach with point convolution and depth-separable convolution instead of weight parameters for 3 \(\times \) 3 spatial convolutions. Liu et al. [46] proposed a multiheaded knowledge distillation classification framework that used a self-guided refinement network as a teacher network to distill into a compact student network and solved the computational overload problem by compressing the student. Yang et al. [47] designed an encoding strategy that encodes the information of connection operations between nodes in a computational unit, which can save training costs better when the training samples are limited by optimizing the weight-sharing parameters. Subba Reddy et al. [48]proposed a combination of the Aquila optimizer and a compressed cooperative deep CNN to learn spatial-spectral information, which reduces the computational time and memory space by reducing the loss function of the model and the learning complexity of the wavelet band with the Aquila optimizer. Wang et al. [49] designed a lightweight spectral-spatial attentional feature classification framework based on network framing search, which reduces computational complexity by adjusting different channel weights with multi-scale Ghost grouping and attentional modules. Mei et al. [40] proposed a stepwise activation quantization method that suppressed the input to the original network with nonlinear uniform quantization, thereby saving memory space. Although the above methods have alleviated the problem of computational resources to a certain extent in HSI classification, problems, like too many weighting parameters, poor storage space utilization, and great computational effort, still exist.

Our study has used a combination of grouped convolution and residual structures to construct the feature extraction blocks to reduce computation and storage space. Dynamic adaptive convolution is also used to perform multitrait fusion extraction to improve the efficiency of spectral space utilization.

3 Research Methods

3.1 Dimensionality Reduction Process of HSI Data

The true label (Ground truth) of HSI dataset X consists of T pixel points that contain \(\left\{ t_1, t_2, \ldots , t_a\right\} \in R^{1 \times 1 \times b}\),where b denotes the number of bands. The true label vector is \(\left\{ g_1, g_2, \ldots , g_a\right\} \in R^{1 \times 1 \times c}\),where c denotes the type of feature.In our work, the HSI annotated target pixel and neighborhood cube data are directly selected for initial feature extraction and \(p \times p\) convolution as \(3 \textrm{D}\) input to feature preprocessing instead of processing the HSI data by PCA with principal component analysis because of the rich spectral information and hundreds of bands contained in the HSI.The three-dimensional convolution formula is expressed as

$$\begin{aligned} v_{i j}^{x y z}=g\left( \sum _{k=1}^m \sum _{p=0}^{P_i-1} \sum _{q=0}^{Q_i-1} \sum _{r=0}^{R_i-1} w_{i j k}^{p q r} v_{(i-1) k}^{(x+p)(y+q)(z+r)}+b_{i j}\right) \end{aligned}$$
(1)

where (pqr) denotes the position in space, \(w_{i j k}^{p q}\) is the weight of the ijk-th characteristic cube; \(v_{i j}^{v y z}\) represents the \(j-t h\) cubic block at spatial location (xyz) level \(i; b_{i j}\) denotes the j-th bias size at layer i; and \(P_i, Q_i, R_i\) refers to the height, width, and number of channels of the \(3 \textrm{D}\) convolution kernel, respectively. g (.) denotes the activation function. The size of the convolution kernel used in the feature preprocessing part of the paper is \(1 \times 1 \times 7\), and the step size is set to (1, 1, 2), thereby determining the height and width of the moving window for each convolutional kernel and resulting in repeated extraction of some local features during training, as well as a reduction in spectral dimensionality and refinement of spectral and spatial features.

3.2 Channel Attention Mechanisms

The HSI is input to the convolutional network as cubic blocks in the neighborhood, and the HSI contains rich spectral information as well as band redundancy. To improve the efficiency and accuracy of processing HSI information in the network framework,a channel attention mechanism similar to the dot product similarity in [11] is introduced to score and judge important spatial and spectral information, thus improving the accuracy of classification.The HSI annotated pixels with 3D input are taken as input from the neighborhood cube data \(t\left\{ t_1, t_2, \ldots , t_a\right\} \in R^{1 \times 1 \times b}\). The first input first layer of 3D convolution n-band information is expressed in two vectors, namely, K and V, in the form of key-value pairs in the form of \(H=\lceil \left( k_1, v_1\right) ,\left( k_2, v_2\right) , \ldots ,\left( k_N, v_N\right) \rceil ,(K, V)=\lceil \left( k_1, v_1\right) ,\left( k_2, v_2\right) , \ldots ,\left( k_N, v_N\right) \rceil \).The importance of the spectral and spatial features of the input are exhibited in the form of dot products and are normalized by the \(\alpha _i={\text {softmax}}\left( s_i\right) \) function. The weights of the important spectral and spatial elements are highlighted, and the weights are finally weighted and summed to obtain the final formula for determining the importance of the spatial spectral feature weights:

$$\begin{aligned} {\text {att}}((q, \textbf{K}), \textbf{V})_i=\sum _{j=1}^N e^{q_i^{\textrm{T}} \kappa _j} \textbf{v}_j / \sum _{j=1}^N e^{q_i \mathrm {~T}_{k_j}} \end{aligned}$$
(2)

where \(q_i^{\textrm{T}}\) denotes every vector for the \(i-t h\) significant spectral and spatial feature in the 3D block processed by the first convolution layer.

Fig. 1
figure 1

Channel attention mechanisms

3.3 Spectral Branching Modules

Considering the less contribution of the redundant parameters to the rich spectral and spatial information transfer in HSI. The manuscript designs the spectral branching module using a simple and efficient 3D grouped convolution to solve the problem of parameter redundancy caused by the redundant number of channels in the training process of 3D convolutional networks. Grouped convolution first came from AlexNet [50] in 2012, where the authors divided multiple feature maps to multiple GPUs for processing to cover the limited hardware resources, and finally fused the computed results.3DCNN The grouped convolutional network is similar to the AlexNet [50]such as the HSI data feature maps, which are inputted through \(c_1\) channel filter, are divided into S groups and accordingly for each filter channel is divided into S groups, with each channel convolving in groups with the corresponding convolutional kernel and each group convolving independently without interfering with each other. \(c_2\) that filters into the convolution should generate \(c_2\) feature maps, and the last step generates feature maps for fusion superposition so that the feature cubes are generated the same as the standard convolution. The parameter reduction module is shown in Fig. 2. We assume that the size of the his feature cube input to the nth layer of the ordinary 3D convolution is \(H_n \times W_{\textrm{n}} \times C_n\) (height, width, channel) and the size of thisHSI feature cube to the (n+1)-th layer is \(H_{n+1} \times W_{\textrm{n}+1} \times C_{n+1}\). The filter size kernels are \(M_n \times M_n \times \textrm{d}_n\) and \(M_{\textrm{n}+1} \times M_{n+1} \times \textrm{d}_{n+1}\),the spectral branching structure moves one step in a \(3 \textrm{D}\) convolutional kernel window, and the number of computed pixel points (Flops) is presented as follows:

$$\begin{aligned}{} & {} Zo=\left\{ \begin{array}{l} \left( \textrm{M}_n^2 \times \textrm{M}_{n+1}^2 \times \textrm{d}_{\textrm{n}} \times \textrm{d}_{\textrm{n}+1}\right) +\left( \textrm{M}_n^2 \times \textrm{M}_{n+1}^2 \times \textrm{d}_n \times \textrm{d}_{n+1}\right) , \text{ bias } = \text{ False } \\ =2\left( \textrm{M}_n^2 \times \textrm{M}_{n+1}^2 \times \textrm{d}_{\textrm{n}} \times \textrm{d}_{\textrm{n}+1}\right) -1 \end{array}\right. \end{aligned}$$
(3)
$$\begin{aligned}{} & {} Zo=\left\{ \begin{array}{l} \left( \textrm{M}_n^2 \times \textrm{M}_{n+1}^2 \times \textrm{d}_{\textrm{n}} \times \textrm{d}_{\textrm{n}+1}\right) +\left( \textrm{M}_n^2 \times \textrm{M}_{n+1}^2 \times \textrm{d}_{\textrm{n}} \times \textrm{d}_{\textrm{n}+1}-1\right) +1, \text{ bias } = \text{ True } \\ =2\left( \textrm{M}_n^2 \times \textrm{M}_{n+1}^2 \times \textrm{d}_{\textrm{n}} \times \textrm{d}_{\textrm{n}+1}\right) \end{array}\right. \nonumber \\ \end{aligned}$$
(4)
Fig. 2
figure 2

Parameter reduction module

The covariance of the \(3 \textrm{D}\) convolution kernel at this spatial location is calculated as:

$$\begin{aligned} P a=\left\{ \begin{array}{ll} C_n \times \left( \textrm{M}_n^2 \times \textrm{M}_{n+1}^2 \times \textrm{d}_n \times \textrm{d}_{n+1}\right) \times C_{n+1} &{} , \text{ bias } = \text{ False } \\ C_n \times \left( \textrm{M}_n^2 \times \textrm{M}_{n+1}^2 \times \textrm{d}_n \times \textrm{d}_{n+1}+1\right) \times C_{n+1} &{} , \text{ bias } = \text{ True } \end{array} .\right. \end{aligned}$$
(5)

If the corresponding number of 3D convolutional channels is divided into S groups, that is, \(C_n=C_n / S\), then the filters that correspond to feature map extraction are also divided into S groups that do not interfere with each other. At this time, the parametric At this time, the parametric number of convolutional kernels is calculated as:

$$\begin{aligned}{} & {} Grpa=C_{\textrm{n}} \times \frac{C_{\textrm{n}}}{\textrm{S}} \times \left( \textrm{M}_n^2 \times \textrm{M}_{n+1}^2 \times \textrm{d}_n \times \textrm{d}_{n+1}\right) \times C_{\textrm{n}+1} \quad , \text{ bias } = \text{ False } \end{aligned}$$
(6)
$$\begin{aligned}{} & {} G r p a=C_{\textrm{n}} \times \frac{C_{\textrm{n}}}{\textrm{S}} \times \left( \textrm{M}_n^2 \times \textrm{M}_{n+1}^2 \times \textrm{d}_n \times \textrm{d}_{n+1}+1\right) \times C_{\textrm{n}+1}, \text{ bias } = \text{ True } \end{aligned}$$
(7)

According to Equations (4) and (5), \(GrPa=\frac{1}{C}Pa\). The calculation and number of parameters are reduced to \(\frac{1}{S}=\left( V \times M_n \times M_n \times \textrm{d}_n \times \frac{1}{S}\right) /\left( M_{\textrm{n}+1} \times M_{n+1} \times \textrm{d}_{n+1} \times V\right) \) represents the pixel points in the HSI for which the classification sample is valid. The manuscript combines a 3D grouped convolutional layer with BatchNorm and Relu as a separate unit to simplify the computation, because the use of the Relu activation function increases the sparsity of the network when the neurons are trained. Figure 2 shows that only 1/s of each group of filters that should have participated in the convolution calculation after being divided into S groups, hence, resulting in better sparsity for grouped convolution than for normal convolution. In some cases, the use of grouped convolution can remove more redundant parameters in the case of learning important spectral and spatial feature information, because ordinary 3D convolutional networks have redundant parameters and channels.

3.4 Spatial Branching and Classification Module

To cut the overhead in future training and reduce parameter redundancy, The manuscript makes the following modifications to the residual block structure by using \(\bigoplus \) (a residual block-like connection in Fig. 3) to denote the unit summation operation and \(T_i\) to represents the input hyperspectral 3D data set block and replacing Relu with Droupt3d.

Fig. 3
figure 3

Improved residual connections

$$\begin{aligned} T_{l+1}=h\left( t_l\right) +\mathcal {F}\left( t_l, W_l\right) \end{aligned}$$
(8)

After the introduction of Droupt3d in the cropping layer, some channels are randomly set to zero, which is equivalent to randomly discarding some channels to make the whole spatial module network structure sparse, playing a role similar similar effect to regularization. Additionally, we remove the linear activation ReLU after the addition of the traditional residual structure to allow spatially localized features to be preserved and not discarded, thus enabling feature reuse to work well. The convolution part also uses a 1\(\times \)1\(\times \)7 convolution kernel to refine the spatial dimension of the feature blocks for dimensionality reduction. The residual equation is expressed as where \(h\left( t_l\right) \) represents the 3D convolutional 1\(\times \)1\(\times \)7 direct mapping part, \(\mathcal {F}\left( t_l, W_l\right) \) represents the residual component, and \(W_l\) represents the weights of the residual part of the 3D convolution layer.

In the classification module, we take the feature cubes from the spectral mode branching and the spatial branching feature cubes and perform a Concatenate operation to fuse the spatial and spectral information into a dynamic grouped 3D convolutional layer in the classification module. The dynamic 3D convolution layer adjusts the size of the convolution kernel dynamically according to the different feature cubes to deliver various spectral and spatial information; then, it is sent to the global average pooling layer. All feature cubes processed by the dynamic convolution layer are reduced dimensionally and finally fed to the linear layer to output the classification results. This paper uses the current popular cross-entropy loss function, which is defined as:

$$\begin{aligned} H(p, y)=\sum _{i=1}^a g_i\left( \log \sum _{m=1}^a e^{p_a}-p_i\right) \end{aligned}$$
(9)

where \(\left\{ g_1, g_2, \ldots , g_a\right\} \in R^{1 \times 1 \times c}\) represents the true label vector, c represents the type of feature, and \(\left\{ p_1, p_2, \ldots , p_a\right\} \in R^{1 \times 1 \times c}\) represents a forecast value (prediction).

Fig. 4
figure 4

Schematic of the LCTCS network structure

3.5 LCTCS Network Structure

This section focuses on the details of the superiority of the designed LCTCS network, as shown in Fig. 4 and Table 1.

The cubic block data of size (200\(\times \)9\(\times \)9,1) in HSI is inputted to the feature preprocessing 3D convolution layer (1\(\times \)1\(\times \)7,24), and the output size obtained after convolution operation is (9\(\times \)9\(\times \)97,24). The size of the feature cube after 3D convolution and dimensionality reduction is 97\(\times \)9\(\times \)9. Subsequently, the resulting cubes are sent to the channel intention force mechanism for processing to highlight the important spectral features with the weighting coefficients of the spatial features. Then, we feed the obtained output results into the upper spectral branch module and the lower spatial branch module sequentially, where the spectral branch module takes the grouped convolution layer BN layer and the linear activation layer ReLU as a separate cell and feeds the first cell divided into three groups of convolutions with a (9x9x97,24) 3D block to obtain the output results of (9\(\times \)9\(\times \)97,12) and then feeds the output into the second cell to further refine the spectral and spatial feature cubes. Meanwhile, to further the sparse network and save computational resources, the third independent unit uses the group convolution method with S=6 to refine the features with a 3D block size of (9\(\times \)9\(\times \)97,12) and then output them with the same size. Finally, in the spatial branching module, we send the 3D convolution with (9\(\times \)9\(\times \)97,24) 3D blocks to remove the linear activation layer. The significance of removing the ReLU layer is to make some of the neurons nonzero to increase the correlation between the parameters, which can extract some of the features of the HSI in space accurately. The output (9\(\times \)9\(\times \)97,24) obtained after the first two convolutional modules is superimposed with the 3D block fusion processed by the same size channel attention mechanism to reuse the previous features and then fed to the next 3D convolutional layer in the same form. At this time, the superimposed result (9\(\times \)9\(\times \)97,24) is fed into the grouped S=6 convolutional layer.

The (9\(\times \)9\(\times \)97,12) size and the 3D concatenated (9\(\times \)9\(\times \)97,12) feature block from the spectral module are input into the dynamic grouping convolution layer, and the convolution kernel used in this layer will keep changing with the different number of hyperspectral data bands to adapt to different data cubes. Finally, the final 1\(\times \)16 2D feature map is obtained by global pooling and a linear layer.

The significance of removing the Relu layer is to make some of the neurons non-zero to increase the correlation between the parameters, which can extract some of the features of the HSI in space accurately. The output (9\(\times \)9\(\times \)97,24) obtained after the first two convolutional modules are superimposed with the 3D block fusion processed by the same size channel attention mechanism to reuse the previous features and then fed to the next 3D convolutional layer in the same form. At this time, the superimposed result (9\(\times \)9\(\times \)97,24) is fed to the grouped S=6 convolutional layer. The (9\(\times \)9\(\times \)97,12) size and the 3D Concatenate(9\(\times \)9\(\times \)97,12) feature block from the spectral module is inputted to the dynamic grouping convolution layer, and the convolution kernel used in this layer will keep changing with the different number of hyperspectral data bands to adapt to different data cubes. Finally, the final (1\(\times \)16) 2D feature map is obtained by global pooling and linear layer. The detailed Settings of network parameters in the experiment are shown in Table 2

Table 1 Designed network architecture
Table 2 LCTCS setting related parameters

4 Experimental Demonstration

All experiments were conducted on the software environment Windows 10 and the PyCharm integrated development environment. CPU: i7-11700k, GPU: RTX3080Ti, RAM32GB, Memory28GB, Python: 3.8.13, Torch: 1.11.0+cu113.

4.1 Hyperspectral Dataset

To verify the effectiveness of the proposed method, four widely used hyperspectral public datasets, namely, the IndianPines (IP) dataset, PaviaU (PU) dataset, Botswana (BS) dataset, and Salina (SA) dataset, are used for experimental validation. The details of the four datasets are presented as follows:

  1. 1)

    Indian Pines (IP): Acquired by the airborne infrared spectrometer, the AVIRIS sensor at the Indian Pines test site in northwestern Indiana, USA, has 145 \(\times \) 145 pixels and 200 spectral reflection bands. The spectral coverage ranges from 0.4 to 2.5 m, with a true ground classification (the ground truth) of 16 classes of cover vegetation (Fig. 5).

  2. 2)

    PaviaU (PU): This dataset is a portion of the hyperspectral data collected by the German airborne Reflective Optics Spectrographic Imaging System (ROSIS-03) in 2003 on features in the city of Pavia, Italy. The spectral imager continuously images 115 bands in the wavelength range of 0.43 to 0.86 m with a spatial resolution of 1.3 m. The dataset size is 610\(\times \)340\(\times \)103. The ground truth is classified into nine urban feature types (Fig. 6).

  3. 3)

    Botswana (BS): A series of data acquired by NASA EO-1 satellite in 2001-2004 at the Okavango Delta in Botswana, with 1476\(\times \)256 pixels with 145 bands. The spectral wavelength imaging ranges from 0.4 to 2.5 m with a spatial resolution of 30 m and the ground truth classification has 14 classes of cover class features (Fig. 7).

  4. 4)

    Salina (SA): This dataset was captured by the 224-band AVIRIS sensor over the Salinas Valley, California, with 512\(\times \)217 pixels, of which 204 bands were used for the study. The spectral coverage range is 0.4 to 2.5 m. The spatial resolution size is 3.7 m. The true ground classification (the ground truth) has 16 crop categories (Fig. 8).

4.2 Experimental Settings

The evaluation metrics of all algorithms in our work use three metrics: overall accuracy (OA), average accuracy (AA), and k-score (Kappa) to measure the performance of each algorithm. LCTCS has been compared with currently used State-of-art methods: double-branch dual-attention mechanism network(DBDA) [31], Spectral-spatial residual network (SSRN) [28], 3-d-2-d cnn feature hierarchy (HybridSN) [27],HamidaEtAlNet [25], Double-branch multi-attention mechanism network (DBMA) [51], Double-Channel Dense Network(DDCD) [44], Dual Multi-Head Contextual Attention Network(DMuCA) [52],ast dense spectral-spatial convolution network framework (FDSSC) [53] and the classical support vector machine (SVM) [11].

Fig. 5
figure 5

The IP dataset. a represents the pseudocolor map of the three-band synthesis; b represents the real feature class labels

Fig. 6
figure 6

The PU dataset. a represents the pseudocolor map of the three-band synthesis; b represents the real feature class labels

Fig. 7
figure 7

The BS dataset. a denotes the pseudocolor map of the three-band synthesis; b denotes the test feature class labels

Fig. 8
figure 8

The SA dataset. a denotes the pseudocolor map of the three-band synthesis; b denotes the test feature class labels

Table 3 IP dataset training and testing samples for each class
Table 4 PU dataset training and testing samples for each class
Table 5 BS dataset training and testing samples for each class
Table 6 SA dataset training and testing samples for each class
Table 7 Classification results obtained using the \(10.00 \%\) training sample of the IP dataset (\(\%\))
Fig. 9
figure 9

IP dataset classification result display

The IP, PU, BS, and \(\textrm{SA}\) datasets are divided into training and test sets, and the IP dataset, \(5.00 \%\) of the samples of the PU dataset, \(9.00 \%\) of the BS dataset, and \(8.00 \%\) of the samples of the SA dataset are selected for training. The samples were used for testing for the remaining 90.00%, 95.00%, 91.00%, and 92.00%, respectively and the specific training and testing sample divisions are shown in Tables 3, 4, 5 and 6.

4.3 Comparison with the State-of-the-Art Methods for Different Data Sets Under Single Sample

This section mainly analyzes the classification graphs and classification results of different datasets under a single sample, and all experiments are run 10 times to obtain the mean and mean square deviation, which verifies the effectiveness of the method in our work under a single sample.

4.3.1 Classification Graph and Classification Results Under the IndianPines(IP) Dataset

The classification results of the IndianPines (IP) dataset under DBDA, SSRN, FDSSC, HybridSN, HamidaEtAlNet, DBMA, and SVM methods are shown in Table 7. The classification plots of real training labels (a) and test labels (b) with different methods are shown in Fig. 9. In FDSSC, DBDA and DBMA all three methods use conventional convolution, and the classification results obtained are lower than those of the proposed method, which is most likely due to the stacked fusion of each channel after group convolution, which makes the feature utilization increase.

4.3.2 Classification Graph and Classification Results Under the PaviaU(PU) Dataset

Classification plots and classification results under the PU dataset: the classification results for the PU dataset under different methods are shown in Table 8. The classification plots for real training labels (a) and test labels (b) with different methods are shown in Fig. 10. Table 8 and Fig. 10 illustrate that the SVM algorithm achieves the lowest classification results mainly due to its use of only one-dimensional spectral features, resulting in a large loss of spatial information. achieved better results than the traditional machine algorithm for joint learning in spectral space. However, its classification results were still lower than the algorithm proposed in our work, improving overall accuracy by \(1.38 \%\) over DBDA methods.

4.3.3 Classification Graph and Classification Results Under the Botswana(BS) Dataset

The results of classifying the BS dataset under different methods are shown in Table 9, and the classification plots of the real training labels (a), test labels (b), and different methods are shown in Fig. 11. By observing Table 9 and Fig. 11, the lowest classification result is still that of the traditional machine algorithm SVM, whereas the classification results of other comparison methods are above \(92.00 \%\), and most of the classification results of the methods designed in our work in each category are close to \(100.00 \%\).Meanwhile, the proposed LCTCS method still achieves the highest AA, OA, and KPa with minimum computation and number of parameters in the BS dataset, which further proves the robustness and generality of the proposed algorithm.

4.3.4 Classification Graph and Classification Results Under the Salina(SA) Dataset

The results of the classification of the SA dataset under different methods are shown in Table 10, and the classification plots of real training labels (a) and test labels (b) with different methods are shown in Fig. 12. In terms of the performance on the Table SA dataset, the results achieved by the proposed LCTCS method in our work are \(96.53 \%, 97.73 \%\), and \(96.14 \%\) for OA, AA, and KPa metrics, respectively, under \(8.00 \%\) training samples compared to DBMA, which improved by \(4.14 \%, 2.30 \%\), and \(4.58 \%\), respectively. The algorithm proposed in our work achieves desirable results above \(95.00 \%\) for most classes in terms of individual classification results.

4.4 Experimental Results Confusion Matrix Analysis

As can be seen from Fig. 13, the classification of sample points can be completed well in all sample categories on the IndianPines dataset. The larger the value of the right moment row, the larger the number of samples participating in the classification. It can be seen from the PU dataset that the dataset is highly rich and most of the categories are less than 6K. On BS data sets with relatively few samples, there are fewer misclassification points for each class. In the final SA data set, the real value and the predicted value are also in good agreement. This also shows that the algorithm has good generality and generalization.

Table 8 Classification results obtained using the \(5.00 \%\) training sample of the PU dataset (\(\%\))
Fig. 10
figure 10

PU dataset classification result display

4.5 Analysis of Experimental Results with Different Training Samples

To further illustrate the generalization and robustness of the method designed in our work, \(1.00 \%, 3.00 \%, 5.00 \%, 10\%\), and \(15.00 \%\), of data are randomly selected as training samples in four widely used HSI public datasets, IP, PU, BS, and SA. The classification accuracies are given without using the method under different training samples, as shown in Fig. 14.

Table 9 Classification results obtained using the \(9.00 \%\) training sample of the BS dataset (\(\%\))
  1. 1)

    The classification results for different training samples under the IP dataset are shown in Fig. 14a, where the HybridSN method does not achieve particularly satisfactory classification results with the traditional machine learning algorithm SVM when the training sample is \(1.00 \%\). Additionally, the LCTCS method in our work achieves very good classification results when the sample is only \(1.00 \%\).

  2. 2)

    The classification results of different training samples under the PU dataset are shown in Fig. 14b, and the LCTCS method in our work still has the best classification results when the training sample is \(1.00 \%\). HybridSN, FDSSC, and the spatial residual method SSRN also all achieve more than \(90.00 \%\) accuracy.

  3. 3)

    The classification results of different training samples under the BS dataset are shown in Fig. 14c. The classification results of the LCTCS method proposed in this paper are not optimal when the training sample is \(1.00 \%\). However, the classification results achieved by the LCTCS method are still the best as the training sample size increases.

  4. 4)

    The classification results for different training samples under the SA dataset, as seen in Fig. 14d, the proposed algorithm in our work still obtains optimal classification results when the training sample is \(1.00 \%\). With the increase in training samples, the overall accuracy was also improved to \(93.87 \%\). Meanwhile, the traditional machine algorithm SVM achieved only \(67.41 \%\) classification results in terms of performance on the SA dataset. An interesting phenomenon is that as the training sample increases from \(1.00 \%\) to \(3.00 \%\), the LCTCS method, similar to most other methods, rapidly increases the classification accuracy.

Fig. 11
figure 11

BS dataset classification result display

Table 10 Classification results obtained using the \(8.00 \%\) training sample of the SA dataset (\(\%\))
Fig. 12
figure 12

SA dataset classification result display

Fig. 13
figure 13

LCTCS confusion matrix presentation on four data sets

Fig. 14
figure 14

Overall accuracy under a IP dataset, b PU dataset, c BS dataset, and d SA dataset with different training samples

Fig. 15
figure 15

Each method under the four datasets PU,IP, BS,SA a FLOPs, b Video memory

4.6 Computing Resource Analysis

Saving computational resources is a major advantage of the LCTCS method. To verify the advantage of the method in our work in saving computational resources, we make a comparison under the same input size of 103\(\times \)25\(\times \)25 (0.25 M), 200\(\times \)25\(\times \)25 (0.48 M), 145\(\times \)25\(\times \)25 (0.49 M), and 145\(\times \)25\(\times \)25 (0.35 M), and give the computational resource usage under the four datasets of PU, IP, BS, and SA.

  1. 1)

    The specific use of FLOPs is shown in Fig. 15a. DBDA and DBMA’s floating point consumptions are the highest. The floating point consumption of the method designed in our work is the lowest in all four datasets, within 1000 M. This finding also proves the good generality and robustness of this paper’s method in terms of FLOPs.

  2. 2)

    The specific use of video memory is shown in Fig. 15b. Although the HamidaEtAlNet method is the least expensive in terms of storage space consumption, it uses many redundant and cumbersome parameters and FLOPs. The proposed method in our work is not optimal in terms of storage space, but the storage space consumed compared with FDSSC, SSRN, DBDA, and other algorithms has nearly reached a more desirable result. The HybridSN storage space usage is worse.

  3. 3)

    The details of the parameters are shown in Table 11. The number of method parameters in our work is the lowest under all four datasets. In particular, the number of parameters in the IP dataset is 9.516 K compared with the HamidaEtAlNet method with 2.191 M redundant parameters, which is only \(0.40 \%\). Thus, our method can be said to greatly alleviate the burden of computation and storage due to redundant parameters.

Table 11 Parameters of each method in the four data sets of PU, IP, BS, and SA

4.7 Ablation Experiments

To further illustrate the effectiveness of the proposed method in the work, we conducted a series of ablation experiments on the spectral module, the spatial module and the attentional mechanism module. As can be seen from Table 12, when ASe is not considered, the overall classification accuracy, average classification accuracy and kpa coefficient are \(95.94 \%\), \(96.14 \%\) and \(95.60 \%\), respectively, which are \(3.34 \%\), \(3.26 \%\) and \(3.60 \%\) lower than that when the spectral module is used in Spatial module and Attention Mechanism module (ASS). Comparing the classification results in ASe, ASa, SS and ASS classification, the best classification result in ASS is largely due to the feature reuse of spatial branches and the extraction of global spectral information by spectral branches.

Table 12 Ablation analysis of the \(5.00 \%\) Botswana dataset with different module combinations
Table 13 The running time of each method in the four data sets of PU, IP, BS, and SA (S)

4.8 Comparative Analysis of Model Running Time

The model running time of this experiment is counted after 100 iterations of each method. It can be seen from Table 13 that the support vector machine (SVM) method has the shortest running time under the four data sets because it decomposes HSI data into high-dimensional vector form and classifies hyperspectral ground objects through one or more hyperplanes. However, DBDA method consumes a relatively long time under the four data sets, because DBDA method consumes a lot of time for the calculation of the three-dimensional convolution function because of its dense block connection mode.

Although the performance of LCTCS on the four data sets is not the best, because we also expand the filter of the dynamic convolution to retain more band spectrum information when we design the dynamic convolution module, this leads to more time consumption, LCTCS, the modified residual sparse branch network greatly reduces the storage space during network model learning.

5 Conclusion

In our work, a novel HSI method called the LCTCS network, is proposed. This method utilizes an organic combination of a channel focus mechanism, simple and efficient grouped convolution, and dynamic convolution. The method utilizes normal 3D convolution to reduce dimensionality, channel focuses to highlight important spectral and spatial weights, the spectral and spatial modules of grouped convolution to extract global features, and the dynamic classification module to efficiently complete the HSI classification task. Using this research method can greatly alleviate the problem of computing resource waste in traditional HSI classification networks. Here are some of the conclusions that can be drawn.

  1. 1)

    Experiments with single samples and multiple copies show that this method can effectively maintain advanced classification performance with fewer parameters, lower calculation costs, and smaller video memory occupation. At the same time, it also shows that this research method has good universality and generalization.

  2. 2)

    Ablation experiments show that the synergistic effect of the spectral branch network and spatial branch network allows the model to achieve optimal performance, and the added channel attention mechanism can further improve the utilization efficiency of HSI features.

In the future, a more efficient attention mechanism and adaptive convolution module will be considered to be designed on top of the existing ones to improve the classification performance of HSI further. This also provides a new idea to utilize grouped convolution better to design more efficient network structures in other fields.