1 Introduction

For several years, satellite missions have been established that provide large amounts of image data with high temporal and spatial resolution covering the whole Earth. The resulting satellite image time series (SITS) do not only allow an analysis of the current state of the Earth’s surface, but also to monitor its development over time. The basis for such an analysis is the pixel-wise classification (semantic segmentation in Computer Vision terminology) of land cover (LC), i.e. the task of assigning a class label, that represents LC, to each pixel in an image. These classes correspond to the physical materials on the Earth’s surface, e.g. Water or Forest; current research in this field is dominated by deep learning methods. Using SITS offers the advantage that both, spatial and temporal features can be used to estimate LC. While spatial features capture dependencies between pixels at different spatial locations, temporal features are related to changes of the spectral signature between different timesteps. Both types of features provide important information for LC classification. For instance, neighbouring pixels having similar spectral features often belong to the same class (spatial dependencies), while seasonal variations in the spectral features are characteristic for vegetation (temporal dependencies). SITS also make it possible to create multiple LC maps and thus to monitor the temporal development of LC, and to derive cues for a change in LC.

Recent research on semantic segmentation has focused on transformer architectures using self-attention (Vaswani et al. 2017). Liu et al. (2021) apply attention modules for the semantic segmentation of images, introducing a sliding window approach (Swin Transformer) for that purpose. Some works combine an attention-based encoder with a convolution-based decoder to achieve pixel-wise outputs, e.g. (He et al. 2022; Yamazaki et al. 2023). Using convolutions, local features considering the spatial neighbourhood of a central pixel are computed; features encoding information about larger neighbourhoods can be extracted by increasing the receptive field, e.g. by pooling operations (Long et al. 2015) or atrous convolutions (Chen et al. 2018). In self-attention modules the distance between pixels is not taken into account, because attention is computed between all combinations of pairs of feature vectors in an input sequence. For images, such a sequence is usually created by splitting the image into small local patches; this splitting is required to reduce the computational burden. For every patch, a feature vector is generated, and these vectors (called tokens) serve as input to the self-attention modules.

When using SITS for LC classification, the properties of attention modules suit the expected behaviour of temporal dependencies well, because temporal variations in the appearance of LC classes usually occur within longer temporal intervals (weeks to months). On the other hand, the regular grid and local dependencies between pixels in the spatial dimensions of SITS suggest the usage of convolutions for spatial feature extraction. In literature, hybrid models that combine attention and convolution modules have already been used successfully, e.g. (Garnot and Landrieu 2021; Zhang et al. 2023a). Existing approaches either focus on the generation of one output map only, e.g. for crop type classification (Garnot and Landrieu 2021), or they are limited regarding the spatial dimensions of the input data, only considering single pixels or a very small local neighbourhood (Rußwurm and Körner 2020). An approach that is relatively close to ours is described in (Stucker et al. 2023), where temporal self-attention computed in the bottleneck of a U-Net is used to weight the extracted spatial features. This is done to save computational ressources, but might limit the potential for capturing fine spatial details as the weights are only computed at the coarsest resolution. It has to be noted that the goal of Stucker et al. (2023) is cloud removal in SITS, so that they deal with a regression problem rather than with classification.

This paper presents a new approach for multi-temporal LC classification from SITS by integrating global temporal features derived by self-attention and local spatial features obtained by convolutions. The output consists of one label image (referred to as map in the remainder of this paper) for every image in the time series. Predicting multiple LC maps over time makes it possible to predict different LC classes for the same area for different timesteps. This provides a valuable basis for subsequent tasks such as the detection of changes (usually bitemporal) or the analysis of spatio-temporal processes such as the development of settlements over time, though these applications are not pursued in this paper. We propose a new module that computes spatial and temporal features in parallel before combining them for further processing. We integrate this module into a hierarchical encoder and combine it with a convolutional decoder. We also adapt the temporal weighting module of Stucker et al. (2023), weighting the spatial features with the temporal features extracted from the corresponding resolution level in the skip connections connecting the encoder and the decoder. This allows the model to individually weight the spatial features of the time series and, thus, to focus on specific timesteps while others are omitted. The new attention module is embedded in a Swin Transformer backbone (Liu et al. 2021) extended to cope with multi-temporal input and output data. The extracted features serve as inputs to a decoder; we use UPer-Net (Xiao et al. 2018), but any hierarchical model could be used. For training, we use labels from a topographic database. This inherently leads to some label noise, i.e. errors in the reference labels, because it takes some time until a change of LC leads to a database update, whereas the changes are immediatly visible in the satellite images. On the other hand, this strategy provides a large amount of labelled data. We also introduce several variants of the feature computation module and compare them to our new hybrid architecture in our experiments. The scientific contribution of this paper can be summarized as follows:

  • We introduce a lightweight spatio-temporal attention module that computes spatial and temporal features in parallel streams instead of between all spatio-temporal input tokens.

  • We further extend this module, replacing self-attention by convolutions for spatial feature extraction.

  • We extend the temporal weighting of Stucker et al. (2023) to all spatial resolutions in order to capture fine spatial details in temporal attention.

  • As a minor contribution, we introduce a temporal position encoding based on the acquisition date composed of the day of the year and the year itself, to be able to process SITS from multiple years.

2 Related work

We start this review by discussing methods for semantic segmentation in general before focusing on applications in remote sensing and, later, on methods for processing SITS. We continue by discussing aspects of the computational complexity of transformer models, and finally, we review methods for temporal position encoding.

Semantic Segmentation:

Current methods for semantic segmentation are based on self-attention modules (Wang et al. 2023a) or convolutions (Wang et al. 2023b). Whereas convolutions compute local features in the (most times) spatial neighbourhood of a pixel, self-attention determines global features directly based on all tokens (e.g. patches), considering features independently from their order in the input sequence. Transformer models (Vaswani et al. 2017) are based on the principle of self-attention and have been adapted to various applications in Computer Vision, e.g. for image classification (Dosovitskiy et al. 2021), object detection (Li et al. 2022b) and semantic segmentation (Strudel et al. 2021). Dosovitskiy et al. (2021) introduce the Vision Transformer (ViT) for image classification. They modify the input layer of the original Transformer, stacking the grey values of image patches (e.g. 4 × 4 pixels) and linearly project them to a vector that is used as an input token. Strudel et al. (2021) further adapt the ViT for semantic segmentation by using a convolutional decoder and upsampling to determine pixel-wise class labels. The patch size is an important parameter for the ViT; if it is too large, the classification of fine details will suffer, while a small patch size increases the computational costs. To mitigate this effect, the Swin Transformer (Liu et al. 2021) computes attentions only in local windows consisting of a fixed number of patches. To include global context, these windows are shifted between subsequent network layers to allow an information flow between them. Furthermore, a hierarchical representation is built by gradually merging neighbouring patches in deeper layers. The ViT and the Swin Transformer can be used as a backbone for different vision tasks, including applications in remote sensing.

Semantic Segmentation in Remote Sensing:

In remote sensing, fully convolutional networks (FCNs) have been used for tasks such as LC classification (Pelletier et al. 2019; Voelsen et al. 2023), crop classification (Ji et al. 2018) and change detection (Caye Daudt et al. 2019). Recent research has focused on attention-based methods and on hybrid methods combining self-attention and convolutional modules to determine class labels at pixel level. Aleissaee et al. (2023) give an overview of the latest research in this field. They differentiate purely attention-based and hybrid models. There are only few approaches in the first group. For instance, Zhang et al. (2022b) use a transformer network for change detection. Xu et al. (2021) use a Swin backbone in combination with a Multilayer Perceptron (MLP) head. Using the MLP instead of a convolutional decoder slightly decreases the quality of the results, but is computationally much less complex. There are many hybrid approaches that combine attention with convolution. A common strategy to do so is to combine a transformer encoder with a decoder similar to the one used in U‑Net (Zhang et al. 2022a; Wang et al. 2022b). Panboonyuen et al. (2021) compare different decoder designs in combination with a Swin encoder and conclude that a pyramid scene parsing module outperforms the other variants. Yamazaki et al. (2023) use a similar model and introduce a skip connection at the original spatial resolution of the input image to improve the classification accuracy for fine details. Several approaches use convolutions in parallel with attentions in the encoder part of the network (Gao et al. 2021; He et al. 2022; Wang et al. 2022a). In these approaches, local spatial features are determined by convolution, while global spatial features are extracted by attention modules. This combination works for aerial images with a ground sampling distance (GSD) of 5–10 cm. However, for satellite data with a coarser GSD (e.g. 10–20 m for Sentinel-2), self-attention would be computed over a range of kilometres instead of meters, and it is doubtful whether considering context over such a large distance when classifying individual pixels is meaningful. This is why we expect local spatial features extracted by convolutions to be well suited for our application, which is based on such satellite images.

Classification of SITS:

The temporal dimension can provide important information to further improve the classification results in case SITS are available. The characteristics of SITS, e.g. irregular temporal intervals, seasonally varying appearance for vegetation and almost constant appearance for man-made objects, indicate that self-attention might be particularly suitable for the extraction of temporal features. Rußwurm and Körner (2020) were among the first authors to use self-attention for the pixel-wise classification of SITS. Their transformer architecture outperforms architectures such as Convolutional or Recurrent Neural Networks (CNN or RNN, resp.) on unprocessed satellite data, while these methods perform equally well on data that are preprocessed, e.g. by considering cloud cover. Garnot and Landrieu (2021) use a multi-temporal adaptation of U‑Net with a lightweight temporal self-attention module (L-TAE) in the bottleneck layer for the panoptic segmentation of crop parcels. The L‑TAE module computes temporal attention masks at pixel-level in the coarsest resolution; afterwards these masks are upsampled to all spatial resolutions and used in the skip connections to weight the different timesteps. In these operations, the temporal dimension of the feature maps is aggregated to one feature map for the whole time series by a sum over the temporal dimension. A convolutional decoder is used to predict a label map at the original GSD. Zhang et al. (2023a) use the L‑TAE module to compute global temporal features from the input time series. In parallel they compute local temporal features with a convolutional module and finally fuse the features with an MLP to predict crop types and LC. This method performs better than others against which it is compared, but as only time series of single pixels are processed by the attention module, no spatial context is considered. Li et al. (2022a) use a hybrid model which first uses depthwise separable convolutions to extract spatial features and afterwards temporal attention layers for temporal feature extraction. The authors employ a self-supervised training strategy for crop type mapping using optical and radar data. Bi et al. (2023) use a ViT to extract temporal features from a time series of images of different plant types with the goal to predict soybean yield. By combining these features with features from seed information extraction, they achieve better results than other approaches.

A purely attention-based approach is introduced by Tarasiou et al. (2023), who adapt the ViT for crop classification with SITS. After computing self-attention between all timesteps of the same patch, in the subsequent layer attention is computed between all patches of the same timestep, arguing that this order of temporal and spatial feature extraction is more suitable for their application. However it remains unclear if this assumption holds for and is transferable to other applications. Yan et al. (2022) classify LC from SITS using an architecture which computes self-attention only for timesteps determined to be important on the basis of modified self-attention layers. This is a way to reduce computational costs for long input sequences. While this model outperforms other methods, the authors do not compare it to the standard transformer approach.

The self-attention-based methods for SITS classification mentioned so far only predict a single label map from the time series. This is well suited for applications like crop monitoring but there are several applications that rely on spatio-temporal data. For instance, Otto et al. (2024) use different spatio-temporal data sources, including land use information, to model fine particulate matter over a time span of six years. In the field of SITS classification there are only a few methods that generate multi-temporal output label maps. Yuan et al. (2022) propose the SITS-former, a model for Sentinel‑2 time series classification. It is pre-trained in a self-supervised way and can be fine-tuned for downstream tasks. The SITS-former applies 3D convolutions to extract spatio-spectral features for each timestep, and the extracted features serve as input to the transformer encoder. The input patches have a size of 5 × 5 pixels, which reduces the spatial context to a small local neighbourhood. Whereas the output of the pre-trained model is multi-temporal, the authors combine them for the application of crop classification. Zhao et al. (2023) introduce a purely attention-based model for active fire detection by combining a transformer encoder with an MLP head. The input consists of a time series of pixels which are classified as fire or non-fire for each timestep. The experiments show that for this application the temporal information is more important than the spatial one. Stucker et al. (2023) adapt the U‑TAE from Garnot and Landrieu (2021) for sequece-to-sequence cloud removal based on SITS. In contrast to Garnot and Landrieu (2021), who aggregate the temporal dimension to one output map when the attention masks are applied, the number of timesteps remains the same in (Stucker et al. 2023). The temporal features are used as weights that guide the model to determine which timesteps are important for the prediction of the pixel values in areas that are occluded by clouds in the input images. While spatial features are computed at all resolutions, the temporal attention is only computed in the coarsest resolution, which prevents the resultant features from representing fine details.

Computational Complexity:

One of the characteristics of attention-based models is that the computational complexity directly depends on the number of input tokens and, thus, on the image size and the patch size used to generate these tokens. Semantic segmentation requires a small patch size to generate good results, which has led to research about a reduction of the computational complexity in such models without limiting the classification accuracy. In addition to the Swin transformer, introducing a sliding window approach to reduce the number of computations, there are several other approaches. For instance, the computational complexity can be reduced by omitting the computation of the value matrices in the attention modules, using the input features directly instead (Garnot and Landrieu 2020). Wang et al. (2021) reduce the dimension of the key and value matrices for the attention computation by a given factor, which can drastically reduce the computation costs. Zhang and Yan (2023b) introduce a router mechanism that first gathers information from all input dimensions by using a query with highly reduced dimension. Afterwards, the aggregated features are distributed again by using a key and a value matrix with reduced dimensions. Regarding the classification of image time series, the separate computation of spatial and temporal features is a common strategy. For instance, Arnab et al. (2021) use a ViT for video scene classification and compare different variants for computing spatial and temporal features. The best performing variant is the one in which the attention is computed between all (spatial and temporal) input tokens, but it also requires the highest computational costs. When the computation is separated (first spatial, then temporal), the quality of the results decreases only slightly, but the computational effort is reduced to 60%. In general, the use of self-attention for semantic segmentation requires a trade-off between computation time and classification accuracy, both directly dependent on the used patch size. By using lighter versions of self-attention it is possible to further reduce the patch size and, thus, to increase the classification accuracy (Dosovitskiy et al. 2021). In this work we follow this strategy for LC classification, introducing a light version for spatio-temporal attention and reducing the number of patches used in the self-attention layers.

Positional Encoding:

Transformers use a positional encoding to allow the model to make use of the order of the input sequence (Vaswani et al. 2017). This encoding can be fixed or include learnable parameters and is normally added to the input embeddings. Most approaches for image classification adapt this encoding, e.g. (Dosovitskiy et al. 2021; Liu et al. 2021; Strudel et al. 2021), and all authors agree that its use increases the performance, whereas the type of encoding is less critical. For SITS, one needs not only encode the spatial order of the patches, but also the temporal one. For this purpose, Garnot et al. (2020) adapt the position encoding from Vaswani et al. (2017) to temporal positions based on the number of days since the first acquisition date available in the time series, integrating this encoding in their temporal auto-encoder module to classify crop types based on SITS. Similarly, Tarasiou et al. (2023) use an acquisition time-specific temporal encoding to accommodate for irregular distributions of acquisition times. This encoding is learned during training and improves the mean Intersection over Union (mIoU) metric by 2%. Yuan et al. (2022) use a fixed positional encoding vector that is assigned to the day of the year the input image was acquired, which slightly decreases the performance. In our previous work (Voelsen et al. 2023) we showed that a temporal encoding based on the day of the year (doy) slightly increases the quality of the results, especially when a relatively large number of timesteps is used. This motivates further adaptations of the temporal encoding for multi-year SITS by also considering the year of image acquisition.

Summary:

Attention-based models have been used successfully for various applications of SITS. Some of them do not use any spatial neighbourhood in the input data, and therefore spatial dependencies are not considered (Rußwurm and Körner 2020; Zhang et al. 2023a; Yan et al. 2022; Zhao et al. 2023). Other approaches do not investigate different combinations of model components, leaving room for research on adaptations that might be more suitable for a specific task. Most of the reviewed methods deal with crop classification and, thus, only generate one output map for the whole time series. The method proposed in this paper fills these research gaps by introducing and investigating hybrid modules that compute global temporal features by self-attention and local spatial features by convolution to produce a LC map for each timestep of the input time series. Motivated by Arnab et al. (2021), who compare different kinds of encoder configurations for video scene classification, we compare our new module to a method considering self-attention over space and time in a combined way besides other baseline methods. To the best of our knowledge this is the first time this is done for LC classification with SITS. Furthermore, we adapt the idea of Garnot and Landrieu (2021) to weight the spatial features in the skip connections by using the ones determined by the temporal self-attention module. However, we use such an approach in different resolutions based on the temporal features extracted from the corresponding layer in the encoder, instead of computing it only once in the bottleneck layer. Also using the temporal weighting module from Garnot and Landrieu (2021), the method for cloud removal presented by Stucker et al. (2023) is relatively close to ours in spirit, but it deals with a regression rather than a classification problem.

3 Transformers for semantic segmentation

In this section, we give a brief outline on fundamental concepts for applying attention modules for semantic segmentation of mono-temporal images to make this paper self-contained. In particular, we focus on concepts used in the Swin Transformer (Liu et al. 2021), which forms the basis of our approach. There are two main adaptations required to use transformer modules (Vaswani et al. 2017) for semantic segmentation of images: the patch embedding and the window-based computation of self-attention. In this section, we first introduce the general Swin Transformer structure before discussing both of these adaptations.

Swin Transformer:

The Swin Transformer of Liu et al. (2021) consists of several blocks (called stages). Each stage contains a group of subsequent attention modules that share the same spatial resolution. We denote the stages by \(E_{i}\), with E indicating the encoder part of the model, \(i\in[1,\ldots,I]\) being the index of a stage and I representing the total number of stages. The input and output of each stage are \(z^{i-1}\in\mathbb{R}^{C_{i}\times N_{i}}\) and \(z^{i}\in\mathbb{R}^{C_{i}\times N_{i}}\), respectively, with \(C_{i}\) denoting the feature dimension, \(N_{i}=H_{i}\cdot W_{i}\) representing the number of tokens and \(H_{i}\) and \(W_{i}\) corresponding to the image height and width in stage \(E_{i}\), respectively. Between subsequent stages, patch merging layers are applied. In patch merging, the feature vectors of 2 × 2 spatially neighbouring patches are merged by concatenation. In this way the number of patches is reduced by a factor of four, which is equivalent to a reduction of the spatial resolution by a factor of two in each dimension. A fully connected layer is applied afterwards, to reduce the number of feature maps. In this way, the stages produce a hierarchical representation and the feature maps are of the same resolution as they would be in a typical fully convolutional encoder. This makes it possible to use the Swin Transformer as encoder model in combination with any decoder that is used in vision tasks, e.g. UPerNet (Xiao et al. 2018). The number of feature maps in the first stage is \(C_{in}\), which is doubled whenever patch merging is applied, i.e. in stage \(E_{i}\) the number of feature maps is \(C_{i}=C_{in}\cdot 2^{(i-1)}\). In each stage, \(L_{i}\) spatial attention (\(SA\)) modules are applied consecutively, with \(l_{i}\in[1,\ldots,L_{i}]\) denoting the l-th module in stage \(E_{i}\). One \(SA\) module is shown in Fig. 1.

Fig. 1
figure 1

Self-attention module for mono-temporal images. \(z^{l_{i}-1}\): input to module \(l\), \(z^{l_{1}}\): output of module \(l\), both for stage \(E_{i}\). W‑MSA: Window based multi-head self-attention, MLP: Multilayer Perceptron, blue rectangle: layer normalisation, +: element-wise addition

In the following we introduce the patch embedding explaining how an input image \(z_{0}\) is transformed to a sequence of feature vectors of dimension \(C_{in}\) required for computing self-attention. Afterwards, the window-based computation of self-attention in the spatial dimensions of an image is described.

Patch embedding:

A monotemporal input image can be considered to be a tensor of size \(z_{0}\in\mathbb{R}^{B\times H_{0}\times W_{0}}\), with \(B,H_{0}\) and \(W_{0}\) indicating the number of spectral bands, height and width of the input image, respectively. This input has to be transformed to a sequence of 1D vectors (tokens) with dimension \(C_{in}\) to be able to apply self-attention as in (Vaswani et al. 2017). For this purpose, the image is split into non-overlapping patches of size \(P\times P\times B\) and the patches are used to generate 1D vectors by stacking the vectors containing the spectral band values for every pixel in the patch on top of each other. Each of these 1D vectors is processed by a fully connected layer to project it linearly to \(C_{in}\) dimensions, resulting in the tokens that represent the patches in the subsequent processes. This results in a tensor \(z_{PE}\in\mathbb{R}^{C_{in}\times N_{0}}\) consisting of \(N_{0}=\frac{H_{0}}{P}\cdot\frac{W_{0}}{P}\) such tokens that are considered to be the input sequence for semantic segmentation.

Window based multi-head self-attention:

After patch embedding, the tensor \(z_{PE}\) serves as the input to the first stage of the model. The main component of one \(SA\) module is the window based multi-head self-attention (W‑MSA) module, which is an extension of the standard multi-head self attention module from (Vaswani et al. 2017): In order to reduce the number of computations, self-attention is computed in local windows of size \(M\), i.e. considering \(M\times M\) patches in the spatial neighbourhood, instead of using all tokens from the input image. To connect patches of neighbouring windows in the attention computation, the windows are shifted by \(\frac{M}{2}\) for the following \(SA\) module; cf. (Liu et al. 2021) for more details. Note that window partitioning is directly applied before and after the W‑MSA and all other computations are applied to all tokens.

In the \(SA\) module, which is shown in Fig. 1, the W‑MSA is followed by a Multilayer Perceptron (MLP) which consists of two fully connected layers with \(C_{MLP}=4\cdot C_{i}\) dimensions each, and GELU non-linearity between them. Layer normalization (LN) is applied before each W‑MSA and MLP, and a residual connection is applied after each module, resulting in the following computations for one attention module:

$$\begin{aligned}\displaystyle\hat{z}^{l_{i}}& = \textit{W-MSA}(LN(z^{l_{i}-1}))+z^{l_{i}-1}\\\displaystyle z^{l_{i}}&=MLP(LN(\hat{z}^{l_{i}}))+\hat{z}^{l_{i}}\end{aligned}$$
(1)

In Eq. 1, \(\hat{z}^{l_{i}}\in\mathbb{R}^{C_{i}\times N_{i}}\) refers to the output of a W‑MSA layer and \(z^{l_{i}}\in\mathbb{R}^{C_{i}\times N_{i}}\) refers to the output of the MLP for block \(l\) in stage \(E_{i}\). Note that for patch merging and window partitioning, the spatial structure of the tokens is reconstructed (i.e. a sequence of \(N_{i}\) feature vectors is transformed to a 2D feature map of \(H_{i}\cdot W_{i}\) vectors using a function that inverts the flattening process).

Similar to (Vaswani et al. 2017), inside the W‑MSA self-attention is computed in a number of \(n_{H_{i}}\) parallel heads based on Eq. 2:

$$Att(Q_{h},K_{h},V_{h})=SoftMax(Q_{h}K_{h}^{T}/\sqrt{c_{i}})\cdot V_{h},$$
(2)

where \(Q_{h},K_{h},V_{h}\in\mathbb{R}^{M^{2}\times c_{i}}\) represent the query, key and value matrices for head \(h\) and \(c_{i}\) represents the corresponding feature dimension. Using this definition of attention, the output of a W‑MSA layer is based on multi-head self-attention \(MH(Q,K,V)\):

$$\begin{aligned}\displaystyle MH(Q,K,V)&\displaystyle=Concat(head_{1},\ldots,head_{n_{H_{i}}})\cdot W^{O}\\ \displaystyle head_{h}&\displaystyle=Att(Q_{h},K_{h},V_{h})\\ \displaystyle Q_{h}&\displaystyle=Q\cdot W_{h}^{Q}\\ \displaystyle K_{h}&\displaystyle=K\cdot W_{h}^{K}\\ \displaystyle V_{h}&\displaystyle=V\cdot W_{h}^{V}.\end{aligned}$$
(3)

In Eq. 3, \(Q,K\) and \(V\) are the query, key and value matrices determined by linear projection of the input matrix \(LN(z^{l_{i}-1})\) using the parameter matrices \(W_{i}^{Q},W_{i}^{K}\), \(W_{i}^{V}\in\mathbb{R}^{C_{i}\times C_{i}}\), respectively. Afterwards, the parameter matrices \(W_{h}^{Q},W_{h}^{K},W_{h}^{V}\in\mathbb{R}^{C_{i}\times c_{i}}\) project the matrices \(Q\), \(K\) and \(V\) to a sub-representation of dimension \(c_{i}=C_{i}/n_{H_{i}}\). Each of the resultant triplet of matrices \(Q_{h}\), \(K_{h}\), \(V_{h}\) is processed in a different head \(h\) (Eq. 3). The outputs of all heads are concatenated, and the result is yet again projected to a representation of dimension \(C_{i}\) using the parameter matrix \(W^{0}\in\mathbb{R}^{C_{i}\times C_{i}}\), yielding the final output \(MH(Q,K,V)\in\mathbb{R}^{M^{2}\times C_{i}}\).

4 Multi-temporal Swin Transformer

In this section we introduce our new method for multi-temporal LC classification. It is based on a new module that combines self-attention for temporal feature extraction and convolution for spatial feature extraction. Furthermore, we introduce a module for the temporal weighting within the skip connections and the temporal position encoding, which are both integrated in the new method. In Sect. 4.1 we give an overview of the general structure of our method, before introducing the different self-attention-based modules and the temporal weighting within the skip connections in Sects. 4.2 and 4.3, respectively. In Sect. 4.4 the temporal position encoding is explained. Several model variants are introduced in Sect. 4.5. Finally, Sect. 4.6 describes our training procedure.

4.1 Overview

Our method is an extension of our previous work (Voelsen et al. 2023) and is based on the Swin Transformer (Liu et al. 2021), which we adapted to handle multi-temporal input and output data. The input is an image timeseries \(z_{0}\in\mathbb{R}^{T\times B\times H_{0}\times W_{0}}\) with \(T\) timesteps, with \(B\), \(H_{0}\) and \(W_{0}\) indicating the number of spectral bands, image height and width, respectively. The generated output is of size \(T\times O\times H_{0}\times W_{0}\), with \(O\) indicating the number of LC classes. An overview of the architecture is shown in Fig. 2. First the patch embedding is computed for all timesteps in parallel as explained in Sect. 3. Then the temporal position encoding based on the acquisition date is added to all input timesteps; details are explained in Sect. 4.4. This step results in \(z_{PE}\in\mathbb{R}^{T\times C_{in}\times N_{0}}\), which serves as input to the first stage of the encoder. The basic structure of all encoder stages is the one explained in Sect. 3. The main difference is the additional temporal dimension that we keep throughout all stages. In order to do so, we introduce a new attention module \(l\text{-}STA_{c}\) that is based on self-attention for temporal feature extraction and convolution for spatial feature extraction. This module replaces all spatial attention modules within our model and is introduced in Sect. 4.2. The input and the output of stage \(E_{i}\) are denoted by \(z^{i-1}\in\mathbb{R}^{T\times C_{i}\times N_{i}}\) and \(z^{i}\in\mathbb{R}^{T\times C_{i}\times N_{i}}\), respectively, with \(C_{i}\) and \(N_{i}\) as introduced in Sect. 3.

Fig. 2
figure 2

General model overview with four input images for a total number of \(I=3\) stages. PE: Patch embedding. TE: Temporal position encoding. AM: Placeholder for different types of attention modules that are introduced in Sect. 4.2, \(L_{i}\): Number of sequentially used attention modules for stage \(E_{i}\), TS: Temporal weighting in skip connections, green layers: Patch merging, purple layers: Upsampling by bilinear interpolation, PPM: Pyramid pooling module from UPer-Net, +: Element-wise addition of encoder and decoder features. Dotted lines indicate that the respective module is optional. The details about the decoder are shown in Table 1

Similar to Liu et al. (2021) we combine the Swin backbone with a FCN decoder to obtain per-pixel class labels. This decoder is based on UPerNet (Xiao et al. 2018), because in previous experiments we found this combination to outperform the combination of Swin encoder with a U-Net decoder. Similar to the introduced encoder, UPerNet consists of several stages that we indicate with \(D_{i}\) to distinguish them from the encoder stages. These stages have different spatial resolutions; in this way it is possible to use skip connections from the corresponding stages of the encoder. Throughout the whole decoder the input and the output to stage \(D_{i}\) are \(z_{D_{i-1}}\in\mathbb{R}^{T\times C_{dec}\times H_{i}\times W_{i}}\) and \(z_{D_{i}}\in\mathbb{R}^{T\times C_{dec}\times H_{i}\times W_{i}}\), respectively. The main component of the UPerNet stages are convolutional blocks (CB), which consist of a convolutional layer with kernel size \(k\), batch normalization (Ioffe and Szegedy 2015) and ReLu activation.

Whenever computations such as convolutions are applied this is done in parallel for all timesteps \(T\) by using shared weights for all timesteps. In the bottleneck layer, a Pyramid Pooling Module (PPM) is used, which extracts spatial features from four different feature map resolutions that are generated by different pooling operations; for more details about the PPM module we refer the reader to (Zhao et al. 2017). The extracted features serve as input to stage \(D_{I}\). As shown in Fig. 2 we count the decoder stages backwards to use the same stage number for corresponding encoder and decoder stages. Before each stage, the decoder and encoder features are combined via skip connections. In the original UPerNet the encoder features are transformed by a convolutional block (CB) with kernel size \(k=3\) to \(C_{dec}\) feature maps to align feature dimensions. After the CB, the encoder and decoder features are combined by element-wise addition. For our model, we replace the CB for the encoder features with a temporal weighting module that weights the spatial features by using the temporal features, see Sect. 4.3 for details. In the decoder stages, CB with \(k=3\) are applied and an upsampling layer follows (except of stage \(D_{1}\)). The details are shown in Table 1. The extracted feature maps from all decoder stages are used for the output generation, where all feature maps are upsampled to the same spatial resolution of \(\frac{H_{0}}{P}\times\frac{W_{0}}{P}\). Afterwards, the features maps from all stages are concatenated along the feature dimension resulting in \(I\cdot C_{dec}\) feature maps, which are then processed by a CB with \(k=3\) to yield an output dimension of \(C_{dec}\). A CB with \(k=1\) maps the feature vectors to raw class scores, which are bilinearly upsampled by the factor \(P\) to obtain scores at the the original image size of \(H_{0}\times W_{0}\). Finally the class scores are normalized by a softmax layer, resulting in the output \(z\in\mathbb{R}^{T\times O\times H_{0}\times W_{0}}\). For more details on UPerNet we refer the reader to (Xiao et al. 2018).

Table 1 Adapted UPerNet that is used as decoder of our model. Enc./Dec.: operation applied to feature maps from encoder/decoder. CB(k): convolutional block including a convolution with kernel size \(k\), batch normalization and ReLu activation. \(TS\): temporal weighting of the encoder features in the skip connections (Sect. 4.3), ’/’ indicates that the \(TS\) module can replace the CB. +: element-wise addition. Up(f): upsampling by a factor f with bilinear interpolation

4.2 Spatio-temporal attention

Our new light spatio-temporal attention module separates the computation of temporal and spatial features into two separate streams; the structure of the module is depicted in Fig. 3. The underlying assumption is that neighbouring input patches in space or time are more relevant for the classification of a patch. How this is done is shown visually in Fig. 4: In the spatial stream, which is indicated by the yellow cubes, spatial dependencies are computed within one window (same timestep), while in the temporal stream, indicated by the red cubes, temporal dependencies are computed between patches of all timesteps (same spatial position). This separation drastically reduces the number of computations of self-attention compared to using all available spatio-temporal input patches. For this reason we call our novel module \(l\text{-}STA\) (light spatio-temporal attention). In the following, we introduce two variants of the \(l\text{-}STA\) module: In \(l\text{-}STA_{a}\), temporal and spatial features are extracted by self-attention, while in \(l\text{-}STA_{c}\), spatial features are computed based on convolutions. We also present a \(STA\) module in which all spatio-temporal patches are jointly used to compute spatio-temporal features. This module, already introduced in (Voelsen et al. 2023), is used in baseline variant \(Swin_{S1}\).

Fig. 3
figure 3

\(l\text{-}STA\) module. Yellow blocks indicate spatial feature extraction, red blocks indicate temporal feature extraction. Blue rectangles: layer normalisation, +: addition, S: stacking of outputs to one feature matrix, F: fusion (explained in Sect. 4.2)

Fig. 4
figure 4

Computation of spatial-temporal attention in the \(l\text{-}STA\) module. Yellow cubes indicate patches that are used within one spatial attention block (\(W\text{-}MSA_{S}\)), red cubes indicate patches for one temporal attention block (\(W\text{-}MSA_{T}\)). In the \(STA\) module all visible cubes are used

Light spatio-temporal attention module:

In this module, the computation of temporal and spatial features is separated into two parallel streams, which both use self-attention (cf. Fig. 3). Similar to the mono-temporal Swin Transformer (Sect. 3), a number \(L_{i}\) of \(l\text{-}STA_{a}\) modules are applied consecutively and the output of the previous module \(z^{l_{i}-1}\in\mathbb{R}^{T\times C_{i}\times N_{i}}\) serves as input to both streams; when \(l=1\), the output of the patch merging is used as input. For the spatial stream this input is split into \(T\) feature tensors \(z_{t}^{l_{i}-1}\in\mathbb{R}^{C_{i}\times N_{i}}\) with \(t\in[1,\ldots,T]\). Afterwards, the SA module from the original Swin Transformer, explained in Sect. 3, is applied for all timesteps in parallel, including the W‑MSA (Eqs. 13). The outputs \(z_{t}^{l_{i}}\in\mathbb{R}^{C_{i}\times N_{i}}\) are stacked to form the output \(z_{Sp}^{l_{i}}\in\mathbb{R}^{T\times C_{i}\times N_{i}}\), with \(Sp\) indicating spatial feature extraction.

For the temporal stream the input is split into \(N_{i}\) feature tensors \(z_{n}^{l_{i}-1}\in\mathbb{R}^{C_{i}\times T}\) with \(n\in[1,\ldots,N_{i}]\) and \(N_{i}=H_{i}\cdot W_{i}\), which serve as input to \(N_{i}\) temporal attention modules (indicated in red in Fig. 3). The main structure of this module is the same one used for the \(SA\) module, the only difference being that no window-partitioning is required as the number of timesteps \(T\) is significantly smaller than \(N_{i}\). This results in the following computations for one temporal attention module:

$$\begin{aligned}\displaystyle\hat{z}_{n}^{l-1}&\displaystyle=MSA(LN(z_{n}^{l_{i}-1}))+z_{n}^{l_{i}-1}\\ \displaystyle z_{n}^{l_{i}}&\displaystyle=MLP(LN(\hat{z}_{n}^{l_{i}}))+\hat{z}_{n}^{l_{i}},\end{aligned}$$
(4)

with \(\hat{z}_{n}^{l_{i}}\in\mathbb{R}^{C_{i}\times T}\) representing the output of the \(MSA\)-layer and \(z_{n}^{l_{i}}\in\mathbb{R}^{C_{i}\times T}\) the output of the MLP. In the \(MSA\), self-attention is computed based on Eqs. 2 and 3, with the difference that \(Q,K,V\in\mathbb{R}^{C_{i}\times T}\). All \(N_{i}\) outputs \(z_{n}^{l_{i}}\in\mathbb{R}^{C_{i}\times T}\) are stacked to one output matrix \(z_{Te}^{l_{i}}\in\mathbb{R}^{T\times C_{i}\times N_{i}}\). To fuse the output \(z_{Te}^{l_{i}}\) from the temporal stream with the output \(z_{Sp}^{l_{i}}\) from the spatial stream they are concatenated to \(z_{ST}^{l_{i}}\in\mathbb{R}^{T\times 2C_{i}\times N_{i}}\). Finally another linear layer transforms the feature maps to the final output \(z^{l_{i}}\in\mathbb{R}^{T\times C_{i}\times N_{i}}\).

Hybrid light spatio-temporal attention module:

In the \(l\text{-}STA_{a}\), both parallel streams use self-attention to compute the spatial and the temporal dependencies. As discussed earlier, convolutions are expected to be well-suited to extract spatial features. For this reason, we introduce the hybrid light attention module (\(l\text{-}STA_{c}\)). Its basic structure is similar to the \(l\text{-}STA_{a}\) module, but we adapt the spatial stream by replacing the W‑MSA layer with a convolutional layer, which results in the following computations:

$$\begin{aligned}\begin{aligned}\displaystyle\hat{z}_{t}^{l_{i}}&\displaystyle=Conv(LN(z_{t}^{l_{i}-1}))+z_{t}^{l_{i}-1}\\ \displaystyle z_{t}^{l-i}&\displaystyle=MLP(LN(\hat{z}_{t}^{l_{i}}))+\hat{z}_{t}^{l_{i}}\end{aligned}\end{aligned}$$
(5)

with \(z_{t}^{l_{i}-1}\in\mathbb{R}^{C_{i}\times N_{i}}\) being the input to the spatial stream similar to the previous module and \(Conv()\) representing a convolutional layer with kernel size \(k=3\) and \(C_{i}\) output dimensions. This results in \(\hat{z}_{t}^{l_{i}}\), which is processed by layer normalisation and the same MLP as in the \(SA\) module.

The rest of the \(l\text{-}STA_{c}\) module is identical to the \(l\text{-}STA_{a}\) module: The results of the spatial and the temporal streams are combined by concatenation and employing a linear layer to obtain the final output \(z^{l_{i}}\in\mathbb{R}^{T\times C_{i}\times N_{i}}\).

Spatio-temporal attention module:

To compare the separate spatial and temporal feature extraction to a module in which spatio-temporal features are simultaneously computed, we introduce the spatio-temporal attention (\(STA\)) module (cf. Fig. 5). This module consists of two parts: First, \(T\) \(SA\) modules are run in parallel for the individual timesteps to extract spatial features, using Eqs. 13. This is similar to the spatial stream of the \(l\text{-}STA_{a}\) module. Afterwards, the \(T\) outputs of the \(SA\) modules are concatenated to one feature matrix \(z_{Sp}^{l_{i}}\in\mathbb{R}^{T\times C_{i}\times N_{i}}\).

Fig. 5
figure 5

\(STA\) module. Yellow blocks indicate spatial feature extraction, orange blocks indicate spatio-temporal feature extraction. Blue rectangle: layer normalisation, +: Addition, S: stacking of outputs to one feature matrix

In the second part, \(z_{Sp}^{l_{i}}\) serves as input to compute spatio-temporal self-attention. This is done by adapting the window-based self-attention to multi-temporal image data, which is indicated with \(W\text{-}MSA_{ST}\). Compared to the standard W‑MSA (cf. Sect. 3), the number of input patches is extended from \(M\cdot M\) (mono-temporal) to \(M\cdot M\cdot T\) (multi-temporal). This results in query, key and value matrices \(Q_{ST},K_{ST},V_{ST}\in\mathbb{R}^{TM^{2}\times C_{i}}\) serving as input to Eqs. 2 and 3. The number of computations increases by a factor of \(T^{2}\), because self-attention is computed between all combinations of input tokens. The following layers are similar to those of the \(SA\) module including the residual connections, layer normalisation and MLP, which results in the output \(z^{l_{i}}\in\mathbb{R}^{T\times C_{i}\times N_{i}}\). Compared to the \(l\text{-}STA\) module, where only input patches of the same timestep or the same spatial position are used within one self-attention layer (cf. red and yellow cubes in Fig. 4), in the \(STA\) module dependencies in space and time are included in one computation (all cubes that are shown in Fig. 4 depend on each other).

4.3 Temporal weighting in skip connections

The temporal weighting module for the skip connections (\(TS\)) is an extension of the normal skip connection and motivated by the success of the temporal weighting within skip-connections by Stucker et al. (2023) and Garnot and Landrieu (2021). For the standard skip connection (cf. Sect. 3) the encoder features are transformed to \(C_{dec}\) feature maps by a CB to have the same dimension as the decoder feature maps. This CB is replaced by the new temporal weighting module.

For the \(TS\) module the separated spatial (\(z_{Sp}^{L_{i}}\)) and temporal (\(z_{Te}^{L_{i}}\)) features from the last \(l\text{-}STA_{a/c}\) module in the corresponding stage are used, and an element wise multiplication is applied between these two input feature maps. With this operation the spatial features (\(z_{Sp}^{L_{i}}\)) are weighted by the temporal features (\(z_{Te}^{L_{i}}\)). This is followed by a convolutional block (CB), with a convolutional layer with kernel size \(k=3\) and \(C_{dec}\) feature maps, followed by batch normalization and ReLu activation. The resulting features \(z_{TS}\in\mathbb{R}^{T\times C_{dec}\times H_{i}\times W_{i}}\) are then combined with the corresponding decoder features by element-wise addition, which is similar to the basic model explained in Sect. 3.

4.4 Temporal position encoding

As the temporal position encoding showed improvements in the classification performance in previous experiments, we investigate this aspect further. For this purpose, we extend the temporal encoding based on the doy, which is typically used in vegetation analysis of one phenocycle, and add another temporal encoding that is based on the year of acquisition. The motivation behind this is that not only knowledge about the time within a year is an indicator for specific characteristics of LC classes, but also the year itself, as seasonal variations might differ. Of course, this extension requires a sufficient amount of training data for different years and limits the application to years within the training dataset.

The standard temporal encoding employed here is the encoding from our previous work (Voelsen et al. 2023) based on the temporal encoding from Garnot et al. (2020). It employs the acquisition date within a year of the used satellite images:

$$\begin{aligned}te^{doy}_{t,c}=sin\left(\frac{doy(t)}{\tau^{\frac{2c}{C_{in}}}}+\frac{\pi}{2}mod(c,2)\right),\end{aligned}$$
(6)

with \(doy\in[1,\ldots,365]\) being the day of the year, \(t\in[1,\ldots,T]\) the current timestep and \(c\in[1,\ldots,C_{in}]\) the feature index. By using this type of encoding, each doy is encoded into a unique feature vector. This also means that for the same day in different years, the temporal encoding is identical. The temporal position encoding is computed for each input timestep and results in \(TE=[te_{1},\ldots,te_{T}]\in\mathbb{R}^{C_{in}\times T}\), which is added to the output of the patch embedding by element-wise addition. For this purpose, \(TE\) is repeated \(N_{0}\) times to fit the dimensions of the feature vectors after patch embedding (\(z_{PE}\in\mathbb{R}^{C_{in}\times T\times N_{0}}\)) and to integrate the temporal information to every token.

As an alternative to the temporal position encoding based on the doy, we introduce another encoding based on the doy and the year of acquisition that is computed using the following equation:

$$\begin{aligned}\begin{aligned}\displaystyle te^{doy+y}_{t,c}=&\displaystyle\sin\left(\frac{y(t)}{\tau^{\frac{2c}{C_{in}}}}+\frac{\pi}{2}mod(c,2)\right)+\\ \displaystyle&\displaystyle\sin\left(\frac{doy(t)}{\tau^{\frac{2c}{C_{in}}}}+\frac{\pi}{2}mod(c,2)\right),\end{aligned}\end{aligned}$$
(7)

with the same parameters as in Eq. 6 and \(y=year-year_{0}\), with \(year\) the current year of acquisition and \(year_{0}\) the earliest year with any acquired image in the used dataset. The result of Eq. 7 is added to the output of the patch embedding, similar as for the standard temporal encoding.

4.5 Model variants

In this section we introduce several variants of the new method, the multi-temporal Swin Transformer (\(MTS\)) that use the introduced modules from Sect. 4.24.4. With the different combinations we investigate the effect of the individual modules on the performance. An overview of all model variants is given in Table 2.

Table 2 Overview of the different model variants. Att. mod.: Type of attention module that is used, \(TS\): whether the temporal weighting module is used in the skip connections (yes) or not (-), \(TE\): type of temporal position encoding

All model variants have a total number of three stages. For our new method we investigate three main adaptations: Self-attention or convolutional layers for spatial feature computation, temporal weighting within the skip connections (TS) and the temporal encoding variants (TE), which results in the six variants shown in Table 2. The first two variants use the standard temporal encoding based on the doy and no temporal weighting in the skip connections. Model variant \(MTS^{a}_{te(d)}\) uses self-attention for spatial feature computation, while variant \(MTS^{c}_{te(d)}\) uses convolutions. Variants \(MTS^{a}_{te(d)}\text{-}ts\) and \(MTS^{c}_{te(d)}\text{-}ts\) use the temporal weighting module (TS) in the skip connections. The results of the first four variants, all using the standard temporal position encoding based on the doy (Eq. 6), can be used to compare the new STA modules based on convolutions and self-attention, respectively, and to assess the impact of the TS module. The fifth variant, \(MTS^{c}_{te(y)}\text{-}ts\), is based on \(MTS^{a}_{te(d)}\text{-}ts\) but additionally uses the new temporal position encoding based on the year and doy (Eq. 7) and, thus, combines all new aspects of our methodology. Finally, variant \(MTS^{c}\text{-}ts\) does not use any temporal position encoding at all, serving as a baseline for assessing the impact of temporal encoding in general.

Unless otherwise noted, all model variants use the parameter setting from the tiny version of the Swin Transformer from (Liu et al. 2021): \(L=[2,2,6]\) as the number of \(l\text{-}STA\) modules in the different stages, \(h=[3,6,12]\) as the number of heads per stage, \(P=4\) as the patch size, \(C_{in}=96\) as the number of feature layers in the first stage, \(M=7\) as the window size and \(C_{dec}=512\) as the number of feature maps for the UPer-Net.

4.6 Training

During the training process the model weights are iteratively updated by the ADAM optimizer that minizes a loss function measuring the discrepancy between the labels of the training dataset and the predictions of the model. We use a weighted cross entropy loss \(L_{CrEn}\), considering class weights based on the current ability of the classifier to predict the class labels correctly to counteract class imbalance and differences in the distinctiveness of the classes (Wittich and Rottensteiner 2021). \(L_{CrEn}\) is based on the softmax predictions \(y_{n}^{c}\) for a pixel \(n\) to belong to class \(c\) and is computed for all \(N\) pixels inside the current minibatch:

$$L_{CrEn}=-\frac{1}{N}\sum_{n}\sum_{c}C_{n}^{c}\cdot ln(y_{n}^{c})\cdot cw_{c}.$$
(8)

In Eq. 8, \(C_{n}^{c}=1\) if the \(n^{th}\) pixel of the minibatch belongs to class \(c\), otherwise \(C_{n}^{c}=0\). As all our models predict multi-temporal output maps, the total number of pixels is \(N=mb\cdot T\cdot H_{0}\cdot W_{0}\) (\(mb\) indicating the number of input images for one minibatch), because the pixel-wise predictions from all \(T\) timesteps in the minibatch are considered in the same way. The class weights \(cw_{c}\) are set to 1 for all classes during the first epoch, which corresponds to using an unweighted softmax cross entropy loss. After the first training epoch, the last minibatch is classified using the current network parameters and the result is used to compute the intersection over union (\(IoU_{c}\)) for every class \(c\):

$$IoU_{c}=\frac{TP_{c}}{TP_{c}+FP_{c}+FN_{c}}.$$
(9)

In Eq. 9, \(TP_{c}\), \(FP_{c}\) and \(FN_{c}\) refer to the number of pixels that are true positives, false positives and false negatives, respectively, with respect to class \(c\). As these results highly depend on the minibatch used for the calculation (it may even happen that a class is not present in that minibatch), starting from epoch 2, we average the \(IoUs\) from the last 10 epochs (or from all available ones before epoch 11). Following (Wittich and Rottensteiner 2021), these \(IoU\) scores are then used to determine the class weights \(cw_{c}\) for the next epoch:

$$cw_{c}=(1-\Delta IoU_{c})^{\kappa}=[1-(IoU_{c}-\frac{1}{O}\sum_{h=0}^{O}IoU_{h})]^{\kappa},$$
(10)

where \(\Delta IoU_{c}\) is the difference between the mean \(IoU\) of all classes and the \(IoU\) of class \(c\), \(O\) denotes the number of classes, and the hyperparameter \(\kappa\) is used to scale the influence of classes with a lower \(IoU\) on the results. These class weights are used in the loss (Eq. 8) during the following epoch.

5 Experiments

5.1 Dataset

Our test site covers the whole area of the German federal state of Lower Saxony (47600 km2). The dataset comprises Sentinel‑2 images acquired between January 2019 and December 2022. We use Sentinel‑2 Level-2A data, which contain georeferenced bottom-of-atmosphere reflectance and cloud masks from the top-of-atmosphere reflectance of every pixel (Bertini et al. 2012). We use the four spectral bands with a GSD of 10 m (red, green, blue, near infrared). All bands are normalized to zero-mean and unit standard deviation by using \(v^{\prime}_{i,b}=(v_{i,b}-\mu_{b})/\sigma_{b}\), where \(v^{\prime}_{i,b}\) and \(v_{i,b}\) correspond to the corrected and the original grey value of pixel \(i\) in band \(b\) of an image, respectively, and \(\mu_{b}\) and \(\sigma_{b}\) denote the mean and standard deviation of band \(b\); the values of \(\sigma_{b}\) and \(\mu_{b}\) are computed based on a part of the dataset that covers the whole area and images acquired from 2019 to 2020.

To obtain the class labels to be used in training, information from the official German landscape model ATKIS is used (AdV, 2008). This database contains information about 113 different land use classes, which is too detailed for automatic classification. To define a suitable class structure for LC, several land use classes from the ATKIS database are merged; in the end, seven classes are differentiated: Settlement (stl.), Sealed area (sld.), Agriculture (agr.), Vegetation (veg.), Forest (for.), Water (wat.) and Barren land (bar.). In addition, the class others is used for areas without label information, which are present due to errors in the ATKIS database or in areas outside the state borders. Samples assigned to this class are disregarded both in training and evaluation. The ATKIS database is continuously updated, based on in-situ surveys and aerial flights that take place every three years. The updates are provided every three month (ends of March, June, September and December), resulting in four label maps per year. For the experiments in this paper, these reference label images are rasterized at the GSD of the satellite imagery, and each Sentinel‑2 image is combined with the label image closest in time to its acquisition date. This procedure, which will lead to some label noise because some more recent changes visible in the images are not yet contained in the ATKIS database, is applied both, in training and evaluation.

For computational reasons, the available data are split into tiles of 8 × 8 km2 (800 × 800 pixels; referred to as BE8 tiles in the following), which leads to a total number of 885 tiles covering Lower Saxony (Fig. 6). BE8 tiles that contain more than 5% cloud coverage are excluded based on the provided cloud mask. This results in a variable number of available images per year for different regions (between 7 and 50). For three tiles (shown in red in Fig. 6), the corresponding reference label image was corrected manually for different acquisition dates, resulting in 13 corrected BE8 label images. This manual work was carried out to obtain a reference for the evaluation that is not affected by label noise. This process results in a change in the class distribution and will be discussed in detail in Sect. 5.3.

Fig. 6
figure 6

Overview of the dataset located in Lower Saxony in Germany. The small figure on the bottom-left shows where Lower Saxony is located inside Germany. Squares show the available BE8 tiles of 8\(\times\)8 \(km^{2}\) each. Grey/green: potential training/validation tiles. Red: test tiles with manually corrected reference (dataset \(R_{1}\)). Black: test tiles without corrected reference (dataset \(R_{2}\))

5.2 Experimental protocol

5.2.1 Experimental setup

For all experiments, we split our dataset into a set of 813 BE8 tiles for training and 36 BE8-tiles for validation (gray and green tiles in Fig. 6, respectively). For testing, we use the remaining tiles. Here, we differentiate between the test dataset \(R_{1}\) consisting of the images with corrected labels for the three BE8 tiles mentioned in Sect. 5.1 (the red tiles in Fig. 6) and the test dataset \(R_{2}\) consisting of 36 BE8 tiles that were not used for training and validation (black and red tiles in Fig. 6). Note that the three BE8 tiles that are included in \(R_{1}\) are also included in the test dataset \(R_{2}\), but with the original labels instead of the corrected ones. \(R_{1}\) is not affected by the errors in the labels (label noise), but it contains a limited number of reference labels, so that small changes in the classification results might have a relatively large impact on the quality indices determined for evaluation. In addition, due to the geographical differences and the correction process, their class distributions differ clearly between each other and compared to the distribution of \(R_{2}\) (cf. Table 3). \(R_{2}\) forms a larger set of samples, but it is affected by label noise. The distribution of the class labels in the training, validation and test datasets is shown in Table 3.

Table 3 Class label distribution for the training (Tr.), validation (Val.) and both test datasets (\(R_{1}\): manually corrected BE8 tiles, \(R_{2}\): test dataset without manual correction)

Both for training and inference, our method requires multi-temporal input patches. To generate these patches, we consider a time period of one calendar year (January to December) to be close to the vegetation cycle, as many approaches show improvements especially for vegetation or crop classes when multi-temporal data are used (Ji et al. 2018; Rußwurm and Körner 2020). We split the year into \(T\) time intervals, and one representative image is selected for each interval to form the multi-temporal patch. Thus, for \(T=12\) there are 12 intervals, each covering one month, and thus each multi-temporal patch consists of 12 images. At test time, the image selected to be representative for an interval is the one acquired most closely in time to the middle of an interval. In this way, the intervals between the acquisition times of the images in one patch are as similar to each other as possible, and the evaluation is based on the same images for all experiments. For training, we found it beneficial to choose the representative image for any time interval randomly from all available images in that interval. Thus, even if the same area is chosen multiple times for training, the used images can still be different, which increases the variability of the whole training dataset.

Training is based on the method described in Sect. 4.6. To create the training patches, we randomly crop windows of \((H_{0},W_{0})=(256,256)\) pixels from the available training tiles and generate multi-temporal patches consisting of \(T=12\) timesteps in the way described in the previous paragraph. We apply random data augmentation, including rotations by 90, 180, 270° and horizontal and vertical flipping, which results in a large variety of available training patches. During training the ADAM optimizer (Kingma and Ba 2015) is used with the parameters \(\beta_{1}=0.9\) and \(\beta_{2}=0.999\). The learning rate is set to \(6\cdot 10^{-5}\). These values were found to perform best on the validation dataset and are also those used by Liu et al. (2021) for the Swin Transformer. Training is carried out in epochs. Each epoch consists of a series of iterations, each considering a minibatch of input patches. The minibatch size is set to 2 due to the limitations of the available GPU resources. The number of iterations per epoch is set to 5,000, so that in each epoch, 10,000 patches are used to update the parameters. Training continues for a maximum number of 100 epochs, but is stopped earlier if the validation accuracy does not improve for 10 epochs. For all architectures the learning rate is decreased by a factor of 0.7 every 10 epochs.

We use a value of \(\kappa=1\) for the weight in the loss in Eq. 10, because this value resulted in a good trade-off between the accuracies of the over- and underrepresented classes. The parameter for the temporal position encoding (Eqs. 6 and 7) is set to \(\tau=10000\).

At test time, the trained models are used to classify the images of all BE8 tiles in the test dataset. However, as these tiles are larger than the input patch size of the model, a sliding window approach using a horizontal and vertical shift of 128 pixels is applied for classification, each time applying the model to a patch of 256 \(\times\) 256 pixels generated in the way described in Sect. 5.2.1. This results in up to four predictions per pixel, and the resulting softmax scores per class are averaged over all predictions to obtain the final classification scores.

5.2.2 Evaluation protocol

For evaluation, the classification results achieved for the BE8 tiles of the test sets are compared to the reference on a per-pixel level, and a confusion matrix is determined taking into account the predictions for all timesteps equally. From that confusion matrix, several quality indicators are determined. For each class \(c\in O\), we compute the F1-score \(F^{1}_{c}\) based on the number of true positive (\(TP_{c}\)), false positive (\(FP_{c}\)) and false negative (\(FN_{c}\)) predictions for that class:

$$\begin{aligned}F^{1}_{c}=\frac{2\cdot TP_{c}}{2\cdot TP_{c}+FP_{c}+FN_{c}}\end{aligned}$$
(11)

Additionally, we report the overall accuracy (OA), i.e. the percentage of pixels with correctly predicted class labels, and the mean F1-score (mF1) over all classes as global metrics. The OA is somewhat biased by the imbalanced class distribution of the dataset (cf. Table 3). The F1-score is not influenced by the class imbalance, because the impact of \(F^{1}_{c}\) on the mF1 is equal for all classes, independently from the number of pixels corresponding to that class. These quality indicators are determined based on both test datasets, \(R_{1}\) and \(R_{2}\). Each experiment is repeated three times, each time starting from a different random initialization and using different random batches for training, to assess the influence of these random components on the results. We present the average values and standard deviations for all quality metrics computed over the three runs per experiment.

5.2.3 Test setup

We conducted different sets of experiments to evaluate our methodology. First, we investigate the accuracies achieved by the variant of our method combining all new developments and using the hybrid light spatio-temporal attention module, \(MTS^{c}_{te(y)}\text{-}ts\), for the three corrected tiles in more detail in Sect. 5.3. We conduct this analysis because we observed a rather large gap between the accuracy values achieved when evaluating on \(R_{1}\) and \(R_{2}\), respectively, in all of our experiments, and we want to find possible reasons for this behaviour. This will help to understand the presented evaluation metrics in a better way. In Sect. 5.4, we compare the variants of our method introduced in Sect. 4.5. With this set of experiments we investigate the \(l\text{-}STA_{a}\) and \(l\text{-}STA_{c}\) modules, analyzing whether convolutions are well suited for spatial feature extraction than spatial attention modules and assessing the impact of the temporal weighting module (TS) and the different temporal position encodings (TE) on the classification performance, the memory footprint, and the computation times. Based on the model variant selected to be the representative one in that section, we analyse the performance of that model for different parameter settings in Sect. 5.5, trying to find an optimal configuration for that method.

Finally, we compare that optimal variant of our method to three other baseline models in Sect. 5.6: A FCN model without any self-attention layers and a multi-temporal Swin Transformer (\(Swin_{S1}\)), both from our previous work (Voelsen et al. 2023), and the Utilise model model from Stucker et al. (2023). The FCN is a U-Net adapted for multi-temporal input and output. We use a variant in which spatial features are extracted by convolutions in parallel for all timesteps in the encoder and decoder. In the bottleneck we combine the temporal dimension with the feature map dimension to compute spatial-temporal features. \(Swin_{S1}\) is the best performing Swin Transformer variant from (Voelsen et al. 2023), in which the \(STA\) module instead of the \(l\text{-}STA\) module is used in the first stage and the \(SA\) module is used in stages 2–4 of the model. Utilise is a convolutional encoder-decoder network that uses a self-attention layer in the bottleneck (cf. Sect. 2) and is one of the few models with multi-temporal input and output. As this model was developed for cloud removal we change the last layer to a convolutional layer with \(k=1\) followed by a softmax layer to map the feature vectors to normalized class scores. All other details about the model are kept the same as in (Stucker et al. 2023). For all of these models we used the same training procedure as described in Sect. 5.2.1, but with a learning rate of 0.001 as this value performed best during some previous experiments with FCNs.

It has to be noted that training is done on GPUs with different capacities due to limits and availabilities of GPU resources. For this reason, when comparing processing times, we use the average inference time [ms] per input patch. These numbers are comparable because in all experiments, inference was carried out on the same GPU (NVIDIA GeForce RTX 3090, 24GB).

5.3 Analysis of the corrected test dataset \(R_{1}\)

In all our experiments we observed a gap in the accuracy scores between \(R_{1}\) and \(R_{2}\) (up to 10% in the mF1). This raises the question for the cause of this gap. One possible reason is the obvious difference in the class distributions of the test datasets \(R_{1}\) and \(R_{2}\) (cf. Table 3). Another possible reason is the label noise included both in the training data and in the test dataset \(R_{2}\): the classifier might learn some wrong patterns that could also be available in \(R_{2}\), leading to over-optimistic quality metrics in an evaluation on that dataset. Before presenting the actual empirical evaluation in subsequent sections, we want to investigate the accuracy metrics on \(R_{1}\) in more detail in order to get a better idea about the possible reasons for the observed performance gap. This analysis is based on variant \(MTS^{c}_{te(y)}\text{-}ts\), which we consider to be representative for our method, but the observed trend in the accuracy scores is similar for all variants. The evaluation results for \(R_{1}\) and \(R_{2}\) as well as for the individual tiles from \(R_{1}\) are shown in Table 4; some qualitative results are shown in Fig. 7. The F1-scores for the individual classes differ between the individual tiles and also between \(R_{1}\) and \(R_{2}\), especially for the classes Sealed area, Vegetation, and Barren land. For these classes, the accuracies on \(R_{1}\) are worse than those obtained for \(R_{2}\). The qualitative examples in Fig. 7 allow these results to be analysed in more detail.

Table 4 Results for LC classification using the variant \(MTS^{c}_{te(y)}\text{-}ts\) for the individual tiles in the corrected test dataset \(R_{1}\) and the complete datasets \(R_{1}\) and \(R_{2}\) in comparison. We also give average results for the corrected reference \(R_{1}\) for tiles 2 and 3. All numbers are averages over three test runs with corresponding standard deviations
Fig. 7
figure 7

Label maps for the corrected test dataset \(R_{1}\) (row 1 and 2), test dataset \(R_{2}\) (row 3), prediction for one timestep (June 2020) with model \(MTS^{c}_{te(y)}\) (row 4) and Sentinel‑2 image (row 5). In row 1 the whole BE8 tiles are shown, while all other rows show a smaller part in more detail (det.). The areas (1) -(5) marked by black ellipses show different examples for differences between \(R_{1}\) and \(R_{2}\) that are discussed in the main text. The colours correspond to: red – stl., grey – sld., yellow – agr., light green – veg., dark green – for., blue – wat., brown – bar.

The class Sealed area is correctly classified primarily for larger roads like highways or parking areas that only occur in tile 3 (cf. the highlighted example 1 in Fig. 7). Rural roads contained in tiles 1 and 2 are not predicted; as they are too narrow, sometimes such roads are not even included in the training data. This results in a poor performance for this class independently from the dataset, but in particular for tiles 1 and 2 and shows the limitations of the used image data: due to the rather coarse spatial resolution of 10 m it is challenging and in some cases not possible to delineate fine structures such as streets that might have a width of less than 10 m. Another example that results in a low F1-score on \(R_{1}\) is the area of pixels erroneously assigned to the class Barren land in the upper part on tile 1 (example 2 in Fig. 7). As there are only very few samples of that class in that tile, this relatively large area has a large negative impact on the F1-Score. It would seem that these examples indicate that some quality metrics achieved for tile 1 may not be representative for the entire area to be classified.

There are several examples in which the class labels on \(R_{2}\) were wrong and had to be corrected for \(R_{1}\). Most of these corrections involved the classes Agriculture and Vegetation, which are difficult to differentiate in some cases, even for a human operator. An example can be seen in the coastal area of tile 1 (cf. Fig. 7). In \(R_{1}\) the upper part is assigned to the class Vegetation while the lower part is assigned to Agriculture. In \(R_{2}\) it is the opposite. In tile 2, an area covered by forest at the boundary of several lakes (example 3 in Fig. 7) was labelled as Vegetation and Barren land in \(R_{2}\) and was corrected to Forest in \(R_{1}\). For both of these examples, the prediction is similar to the class labels from \(R_{2}\). Consequently, this erroneously improves the metrics for \(R_{2}\) while decreasing the metrics on \(R_{1}\). If the errors in the database shown in these examples are representative, they could be explained by similar patterns occurring in the training data, where they would be considered to be label noise.

Of course, there are also examples in which the prediction agrees with label corrections in \(R_{1}\). The lake in tile 2 (example 4 in Fig. 7) is not included in the database but predicted correctly. A part of the settlement area (example 5 in Fig. 7) was relabelled as Vegetation, which is also the predicted label. As these areas are relative small, they have only a small effect on the accuracy metrics. For some classes, e.g. Settlement, Agriculture or Water the performance on \(R_{1}\) and \(R_{2}\) is similar. These classes either have more samples (Agriculture) or are clearly distinguishable from the other classes (Settlement, Water).

To summarise, the F1-scores on \(R_{1}\) show a high variability depending on the tile that is analysed, and in general the dataset is very small so that relatively small areas of wrong classification results can have a large impact on the quality metrics. On the other hand, the analysis showed that the results on \(R_{2}\) are affected by the label noise in the reference, which leads to a positive bias in the performance metrics. Ultimately, the numbers achieved on \(R_{1}\) are more reliable and, thus, are our main focus in the subsequent analyses. It could be argued that in particular tile 1, in which water is very dominant and only very narrow roads occur, is not representative for the entire state; this is also obvious when looking at the average quality metrics achieved for tiles 2 and 3 in Table 4, which are closer to those achieved on \(R_{2}\) than those achieved on the entire dataset \(R_{1}\). Nevertheless, restricting ourselves to these tiles would further reduce the number of pixels on which the quality metrics are based. Thus, we focus on the complete set \(R_{1}\), but present results for \(R_{2}\) as well. It may be noted that even though the values of the quality indices is different, the tendency is very similar, in particular when ranking methods based on average indices.

5.4 Evaluation of the new \(MTS\) variants

5.4.1 Quantitative evaluation

In this section we evaluate the results achieved by our new method, comparing the variants introduced in Sect. 4.5. The mean accuracy metrics along with their standard deviations, the number of parameters and the computational times for inference on a single image patch are shown in Table 5 for both test datasets. Class-specific metrics achieved on test dataset \(R_{1}\) are shown in Table 6.

Table 5 Results for LC classification with the different model variants introduced in Sect. 4.5. Sp. att.: attention module used in the model (spatial feature extraction using a – attention or c – convolutions). \(TS\): use of the temporal weighting module in the the skip connections. \(TE\): Type of temporal position encoding. \(\#\)p.: number of parameters of the model in millions. t: inference time for a single image patch in \([ms]\). The quality indices are averages over three test runs, the numbers underneath these indices are the corresponding standard deviations. Best accuracy scores are indicated in bolt
Table 6 Class-specific quality metrics achieved on test set \(R_{1}\) for LC classification using the different model variants introduced in Sect. 4.5. The table presents averages over three test runs and the corresponding standard deviations

Variant \(MTS^{a}_{te(d)}\), which uses self-attention for spatial feature computation and the standard temporal encoding based on the doy, achieves the lowest accuracies compared to all other variants. This variant achieves a mF1 and an overall accuracies (OA) that are about 2% worse than those of the other variants, independently from the dataset used for evaluation (e.g. a mF1 of 65.4% and 72.3% on \(R_{1}\) and on \(R_{2}\), respectively). In addition, the standard deviations for the mF1 and OA for this variant are five to six times larger than those achieved by all other variants, mainly due to higher deviations in the individual F1-scores for Sealed area, Vegetation and Water (cf. Table 6). These large standard deviations indicate a certain instability of the model. When convolutions are used in the spatial dimension (variant \(MTS^{c}_{te(d)}\)), the mF1 improves by 1.9% and 3.7% on \(R_{1}\) and \(R_{2}\), respectively. The largest improvements are achieved for the classes Sealed area (+4%), Vegetation (+3%) and Water (+5.6%). A similar behaviour is observed for variant \(MTS^{a}_{te(d)}\text{-}ts\) that uses the temporal weighting module in the skip connections and self-attention for spatial feature computation. The mF1 improve by 1.7% and 1.3% on \(R_{1}\) and \(R_{2}\), respectively, compared to variant \(MTS^{a}_{te(d)}\). This is slightly less than the improvement for variant \(MTS^{c}_{te(d)}\), but the difference is very small and not statistically significant. In both cases, the standard deviations clearly decrease to values between 0.1–0.5%, indicating a higher stability of the two variants. Combining the use of convolutions and the \(TS\) module, which is done in variant \(MTS^{c}_{te(d)}\text{-}ts\), leads to a minor improvement in the mF1, which, however, is again not statistically significant (+0.1% mF1 on both datasets). We can say that the accuracies improve and the results are clearly more stable as indicated by the standard deviations when either the convolutions are used in the spatial dimension or the \(TS\) module is integrated. This corroborates our assumption that convolutions are well-suited for spatial feature extraction, but our results also indicate that there is no clear advantage of using them; adding the \(TS\) module to the variant using spatial self-attention has a positive effect of a similar magnitude.

Among the first four variants shown in Table 5, variant \(MTS^{c}_{te(d)}\text{-}ts\) achieves the best mF1 (67.4% and 76.1% on \(R_{1}\) and \(R_{2}\), respectively). When additionally using the new temporal position encoding based on the the doy and year of acquisition (variant \(MTS^{c}_{te(y)}\text{-}ts\) in Tables 5 and 6), we can yet again observe a small increase of the mF1 of 0.1–0.2%. This variant also achieves the best F1-scores for three out of seven classes. Nevertheless, these improvements are not statistically significant. The competitive performance of variant \(MTS^{c}\text{-}ts\), which does not use any \(TE\), shows that for our application the additional information about the acquisition date does not have a significant impact on the results.

Analysing the the number of parameters and the inference times in Table 5, we can observe that both of them increase when using convolutions instead of self-attention for spatial feature extraction, whereas all of the other modifications only have a minor impact. Using convolutions increases the inference time from 26 ms to 36 ms, whereas, for instance, using the \(TS\) module has only a small impact (from 26 ms to 27 ms). Using a variant based on self-attention and also using the \(TS\) module (\(MTS^{a}_{te(d)}\)) might be a good choice if inference time is a critical factor. Nevertheless, we consider variant \(MTS^{c}_{te(y)}\text{-}ts\), which combines all of our developments, to be representative for our method. It achieved the best performance on \(R_{1}\) with 67.6% mF1 and the best F1-scores for three out of seven classes while also being relatively stable according to the standard deviations of the quality indices. Thus, we used this variant as the basis for the experiments described in Sects 5.5 and 5.6.

To summarise, the \(TS\) module and the convolution-based module improve the results on both test datasets even if the improvement is not significant. The use either of these modules leads to more stable results as expressed by the standard deviations of the quality metrics. Using convolutions results in a larger memory footprint of the model and larger inference times. The combination of these two modules achieves the overall best performance even though the results are on an almost similar level for all variants except for the one that uses spatial self-attention and no \(TS\) module.

5.4.2 Qualitative results

Some examples of output label maps are shown in Fig. 8 for model variant \(MTS^{c}_{te(y)}\text{-}ts\). In this table, six out of twelve timesteps that are combined in one input timeseries are shown. The upper row shows the Sentinel‑2 images that were used for classification, starting from February until July. The appearance of the different classes clearly changes due to different illumination conditions, e.g. caused by clouds and their shadows, but also due to the seasonal variations that are visible especially for Agriculture and Vegetation. The second row shows the output generated for each of the input timesteps, and the third row shows the corresponding part of the test dataset \(R_{2}\) in comparison to the corrected reference (\(R_{1}\)).

Fig. 8
figure 8

Qualitative results for an example area included in \(R_{1}\) with model variant \(MTS^{c}_{te(y)} \textit{-}ts\). The colours correspond to: red – stl., grey – sld., yellow – agr., light green – veg., dark green – for., blue – wat., brown – bar. \(R_{2}\) and \(R_{1}\) are used for evaluation for all processed images that were acquired within a period of three months.

The output maps show relatively constant predictions over the whole timeseries; there are only minor differences, e.g. at object borders. Class boundaries are comparatively softer than those of the reference; these predictions are also less reliable, as the probabilities for several classes are close to each other. During all experiments the results for the class Sealed area were worse than those of the other classes, caused by the fine structures and some missing labels as discussed in Sect. 5.3. The highway that is visible in the images shows an example in which this class is predicted correctly. This is probably due to the larger width of a highway compared to rural roads and the access and exit streets that together form a cluster of pixels belonging to Sealed area. At the top of the images a part of the forest (black circle) is not included in the database, but visible in all the satellite images. This area is classified correctly as Forest in all output maps, which improves the accuracy on \(R_{1}\). A positive effect of using timeseries can be seen in the predictions that correspond to images with some clouds; the output maps are not affected by this at all, which indicates that the temporal dependencies help the model to use images of neighbouring timesteps to classify the occluded areas.

5.5 Parameter studies

In this section we analyse the impact of several hyper-parameters based on the model \(MTS^{c}_{te(y)}\text{-}ts\). The parameters we investigate are the number \(St\) of stages, the number \(C_{in}\) of input feature maps to the first stage and the patch size \(P\). The resultant quality indices, the numbers of parameters of the models, and the mean inference times for a single image patch are shown in Table 7. For all variants we only indicate the changed hyper-parameter compared to \(MTS^{c}_{te(y)}\text{-}ts\) from the previous sections.

First, we investigate the number of stages used in the model architecture. When using the default patch size of \(P=4\), one token in stage 4 contains the information about 32 \(\times\) 32 pixels of the input image, which corresponds to 320 m \(\times\) 320 m in object space at the GSD of 10 m. This raises the question if features at this resolution are meaningful for the classification task. We conduct two experiments with variants having four (\(MTS^{c}_{te(y)}\text{-}ts\text{-}S_{4}\)) and two stages (\(MTS^{c}_{te(y)}\text{-}ts\text{-}S_{2}\)), respectively. Note that \(MTS^{c}_{te(y)}\text{-}ts\) uses three stages. The results in the first three rows of Table 7 show that the number of stages has only a small influence on the quality of the results. There are small variations, e.g. variant \(S_{4}\) achieves a lower mF1 on \(R_{1}\) (−0.6%) but a slightly better mF1 on \(R_{2}\) (+0.3%) than \(MTS^{c}_{te(y)}\text{-}ts\). However, all these variations are within the range of the standard deviations and thus there is no significant difference. The number of stages has a large impact on the number of parameters and inference time (last two columns in Table 7). For instance, the inference time decreases by one third if two instead of three stages are used. Similarly, the number of parameters is reduced by more than a factor of 2. We can conclude that the deeper layers do not have a significant impact on the classification performance, probably because of the low spatial resolution in these stages when Sentinel‑2 data is used as input.

Table 7 Evaluation of the results for LC classification using variant \(MTS^{c}_{te(y)\text{-}ts}\) with different parameter settings. The numbers in the first row are identical to those in Table 5. \(St\): Number of stages, \(C_{in}\): Number of features in the first stage, \(P\): Patch size [pixel]. The quality indices are averages over three test runs with standard deviations. \(\#\)p.: number of parameters of the model in millions. t: inference time for a single image patch in \([ms]\).

The second set of experiments, the results of which are reported in the fourth and the fifth row of Table 7, are dedicated to the effects of using a different number \(C_{in}\) of input features. The default, also used in (Liu et al. 2021), is \(C_{in}=96\). We investigate the influence of this parameter by setting it to \(C_{in}=48\) (variant \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{48}\)) and to \(C_{in}=144\) (variant \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{144}\)). Similar to the number of stages, the results for all variants with different values for \(C_{in}\) are on a similar accuracy level. The differences are small and in the range of the standard deviations. The largest differences are achieved for variant \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{144}\) – the mF1 on \(R_{1}\) decreases by 0.7% while the mF1 on \(R_{2}\) improves by 0.7%. This might be an indicator that the larger model starts to memorise pattern of the label noise in the training data. Again, the number of parameters and inference time is affected. For \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{144}\) the inference time increases to 44 ms and the number of trainable parameters is increased by more than 60% compared to the default setting; when using a smaller number of features (\(C_{in}=48\)), the memory footprint is reduced to about 60% of the original one, while the computation time remains almost constant. To summarise, the usage of more input features does not improve the performance but increases the memory footprint and the computation times. For our application, using a  smaller number of input features than in the original Swin Transformer achieves similar accuracies and reduces the memory footprint.

In the last experiment we reduce the patch size from \(P=4\) to \(P=2\) for variant \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{48}\text{-}P_{2}\) (last row in Table 7). We expect this variant to achieve better results, especially for finer structures (e.g. streets), because in the patch merging step only 2 x 2 pixels are merged. Due to reasons of computational complexity and because of the competitive results achieved by variant \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{48}\), we use \(C_{in}=48\). We also had to reduce the minibatch size to one in the training process. We do not further reduce the number of stages, because a reduction of \(P\) has the same effect as omitting a downsampling layer in terms of the receptive field of a token in the bottleneck layer of the network. Variant \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{48}\text{-}P_{2}\) achieves the overall best results on \(R_{1}\) (67.9% mF1 and 82.4% OA). Nevertheless, the improvement is not significant compared to \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{48}\) (+0.1% mF1) or \(MTS^{c}_{te(y)}\text{-}ts\) (+0.3% mF1). The accuracy for the class Sealed area stays on a constant level and our expectation that a smaller patch size improves the performance for fine structures like streets is not fulfilled. Compared to \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{48}\), the number of parameters stays the same but the computation time increases as more input patches are used in all layers of the model.

To summarize, the results of these experiments show that smaller models achieve competitive performance while reducing training and inference time. For all experiments, the differences in the accuracy metrics are small and not statistically significant. The variant which uses a small patch size of \(P=2\), \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{48}\text{-}P_{2}\), achieves the best performance on \(R_{1}\), which is the reason why we used this variant for the comparison to baseline methods presented in the next section.

5.6 Comparison to baseline methods

In this section we finally compare our method to other baseline variants. As baselines we use the FCN and \(Swin_{S1}\) models from our previous work (Voelsen et al. 2023) and the Utilise model from Stucker et al. (2023) as introduced in Sect. 5.2.3. We compare those baselines to variant \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{48}\text{-}P_{2}\), which is the variant that performed best on dataset \(R_{1}\). The resultant evaluation metrics as well as the number of parameters and the mean inference times are shown in Table 8. The numbers for our method (\(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{48}\text{-}P_{2}\)) are identical to those in Table 7.

Table 8 Results for LC classification for our method (\(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{48}\text{-}P_{2}\)) and three baseline methods. The quality indices are averages over three test runs with standard deviations. \(\#\)p.: number of parameters of the model in millions. t: inference time for a single image patch in \([ms]\).

The FCN, which does not use any self-attention layers, achieves the lowest accuracies with a decrease of −2.3% mF1 on \(R_{1}\) and −1.2% mF1 on \(R_{2}\), compared to our method. These differences are three to four times larger than the standard deviations and show that our method performs better than the \(FCN\). The advantage of the FCN is the reduced inference time while still having a comparable number of parameters to our method. The results highlight the advantage of using a combination of self-attention with convolution as it is used by all other variants.

\(Swin_{S1}\) is the largest model of the comparison, having nearly 60 million parameters. As the \(STA\) module only computes spatial-temporal attention in the first stage the computational time is similar to our model. On dataset \(R_{1}\) \(Swin_{S1}\) performs slightly better than the \(FCN\) but the difference is not significant. Our method outperforms \(Swin_{S1}\) by +1.8% mF1 on \(R_{1}\) which is a significant improvement when considering the standard deviations of maximal 0.6%. On \(R_{2}\) the overall best results are achieved. The latter results might again be due to learning patterns of the included label noise in the training data that we observed also for other models with a higher number of parameters, e.g. variant \(MTS^{c}_{te(y)}\text{-}ts\text{-}C_{144}\) (cf. Sect. 5.5).

The Utilise model from Stucker et al. (2023) achieves slightly better accuracies on \(R_{1}\) compared to \(FCN\) and \(Swin_{S1}\). Due to the higher standard deviation of Utilise, the improvements are not significant and our method still performs better by +1.3% in mF1 on \(R_{1}\). The main reason for the higher standard deviations are unstable classification results especially for the class Sealed area, which could not be detected at all for one of the three experiments for Utilise. This results in the higher standard deviations for the mF1 between 2–3%. For the other classes, Utilise achieves competitive performance compared to our method. The number of parameters used in Utilise is drastically smaller than those of all other methods (only 1 million). On the other hand, the inference time increases by nearly a factor of two compared to our method.

These results show that the combination of self-attention and convolution is well suited for the application of multi-temporal LC classification and achieves significantly better results than the \(FCN\) and the \(Swin_{S1}\) models. The \(Utilise\) model achieves slightly lower accuracies than our methods, but these differences are not significant. Due to the characteristic of multi-temporal image data, which results in 4D input data (time, spectral bands, height and width) the GPU consumption as well as training and inference time are still limiting factors. With our light spatio-temporal attention module we are able to compute spatial-temporal features in all stages of the model while the training and inference time are at a similar or lower level compared to the other baseline variants with self-attention.

6 Conclusion

In this paper we have introduced a new model architecture for multi-temporal land cover classification with satellite image time series. It is based on a light spatio-temporal attention module (\(l\text{-}STA_{c}\)) that computes spatial features using convolutions and temporal features using self-attention in separate streams of the module. Furthermore, it uses a temporal weighting module in the skip connections of the encoder-decoder structure that weights the spatial features based on the temporal features from all timesteps and a temporal position encoding based on the acquisition doy and year. Our experiments show that the usage of convolutions to compute spatial features and the temporal weighting module both result in higher accuracies and a more stable performance compared to the model that does not use any of these adaptations. The combined usage of both modules and the temporal weighting module based on the doy and year of acquisition result in the overall best performance with slightly better accuracy scores. However, the differences between these variants of our method are not statistically significant. Additional parameter studies showed that models with fewer parameters, e.g. by using fewer stages or fewer input features, achieve competitive performance compared to the much larger variants while reducing training and inference time. A comparison to other baselines showed that our method outperforms a purely convolutional (FCN) model by 2.1% and the \(Swin_{S1}\) by 1.8% mF1 on a corrected test dataset, which is both a significant improvement. We also achieve better results than the Utilise model from (Stucker et al. 2023), although due to the more unstable performance of Utilise this difference is not significant. The average inference time was reduced by almost a half for our model compared to Utilise (38 ms for our method, 65 ms for Utilise). We can conclude that the combined usage of convolutions and self-attention is well suited for the task of multi-temporal LC classification. Using the \(l\text{-}STA_{c}\) module, which computes spatial and temporal features in separate streams, we achieve better performance than with the \(STA\) module (\(Swin_{S1}\)) while the computational time stays on a constant level. We think that this is the case because the \(l\text{-}STA_{c}\) module is used in all stages. Our adaptations of the temporal position encoding did not result in significant improvements of the results.

Future research will further investigate the performance of the different model variants for different datasets and for other applications in the field of SITS. By using other datasets, e.g. from (Toker et al. 2022), which include more reliable training samples for class changes within the timeseries, the prediction of class transitions within the model can be investigated further and the method can be adapted to the task of change detection. The extension to other classification tasks and different types of input data, e.g. aerial images with a much lower GSD, is also a of interest. Especially for aerial images we expect that the model needs to be adapted, e.g. by using more stages, as the receptive field in object space will drastically change with the smaller GSD. A combined usage of satellite and aerial image data is another potential area of future research, as satellite data provide images with high temporal resolution while the higher spatial resolution of the aerial images can be used to achieve better classification performance for fine structures. Another interesting aspect is the integration of SITS of varying length. Our approach is based on a fixed number of input images and the timeseries always begin in January and end in December. An extension regarding the number and the period of the input timeseries might further improve the classification performance. Additionally, such a model would be more flexible regarding the classification of the latest available satellite image. Other aspects to investigate would be a further adaptation of the model architecture, e.g. by replacing the convolutional decoder with self-attention based layers, similar as our approach for the encoder. Additionally, we want to analyse more adaptations regarding model size, as our results show competitive performance for lighter model variants.