1 Introduction

Understanding the evolution of land use (e.g., constructions and changes of the land cover) is crucial in fields like urban planning, agriculture, natural resource management, anticipating housing market prices, and even autonomous driving and flying. For the latter, visual positioning systems that do not rely on GNSS are a concrete application example (Daedalean 2021). One of the critical components in a purely vision-based positioning system is an up-to-date map of the environment. Especially in emergencies, it is crucial to have a map with maximal confidence to find safe landing sites, and hence, regular map updating missions must be conducted, often involving resource-intensive survey flights. A system that is able to anticipate the changes could assist in indicating the locations of future survey flights and potentially save unnecessary flights in regions with no change.

Predicting urban transformations is a complex endeavor that requires advanced image understanding. Nonetheless, current research predominantly employs traditional non-data-driven methods or relies on pixel-wise Multi-layer perceptrons (MLPs). In response, we strive to bridge this research void by introducing a data-centric methodology, which enables the effective training of fully convolutional neural networks specifically designed to anticipate urban change.

In the present work, we aim to forecast where and when changes in building footprints are going to happen. Change forecasting is a binary segmentation problem with labels “change” and “no change”, where the forecasting range (i.e., the time span between the acquisition of the query image and the actual change) is defined implicitly, by selecting training samples with a fixed forecasting range. We use satellite images as the primary input data source because they provide global, uniform coverage. The SpaceNet7 dataset provides an adequate compromise that balances our requirements. The publicly available, annotated subset comprises 60 locations with a total extent of 960 km\(^{2}\) and a ground sampling distance (GSD) of 4 m. The dataset consists of up to 24 image time-steps per location, with a temporal resolution of 1 month, which allows us to analyze the model’s performances with respect to different time range. We develop a 2-stage training strategy (as shown in Fig. 1) that is centered around a U-Net segmentation backbone (Ronneberger et al. 2015).

Fig. 1
figure 1

Modular model architectures and general workflow. In Stage 1, we train a feature extractor (a U-Net with ResNet50 encoder) in a Siamese setup to detect the pixel-wise urban changes \(\hat{p}_d\) from two satellite images. In stage 2, the feature extractor is repurposed to be used for the change forecasting task, which is to forecast the change as \(\hat{p}_f\) and also predict when it will happen as \(\hat{p}_e\). The “Head"-CNN is not transferred, but trained separately for each stage

  • In Stage 1, we train a change detection network with a Siamese network layout based on a U-Net backbone, using as input pairs of satellite images of the same location, with different time stamps.

  • In Stage 2, we keep the backbone of Stage 1 and use it as a feature extractor for change forecasting. Moreover, we slightly adapt the architecture to also produce a time range forecasts, in which the model needs to anticipate within which time window the change will occur. We implement this task via a multi-task learning setup, where the model needs to predict an ordinal label that indicates when a change is expected to happen, in addition to the binary change label.

2 Related Work

The exploration of change detection is a well-explored task in remote sensing. Early approaches relied on hand-crafted workflows to detect changes in satellite images, while later, classical machine learning approaches were used to automatically classify the hand-crafted features (Singh 1989; Hussain et al. 2013; Le Saux and Randrianarivo 2013; Metzlaff 2015; Wessels et al. 2016). With the rise of deep learning techniques, researchers started to employ neural networks that are capable of learning features themselves (El Amin et al. 2017; Zhu 2017; Liu et al. 2020). Siamese Networks that share the same feature extractor across multiple images turned out to be a suitable inductive bias for several computer vision tasks, including stereo matching (Zagoruyko and Komodakis 2015) and object tracking (Bertinetto et al. 2016). But also in remote sensing, researchers employed Siamese feature extractors (Zhan et al. 2017), often as part of an end-to-end trainable network (Daudt et al. 2018b, a; Arabi et al. 2018). Recent studies have extended this concept to multi-task scenarios (Liu et al. 2019), network architectures based on attention mechanisms (Chen et al. 2020) and multi-scale features (Yang et al. 2021).

There have been a few attempts to predict future land cover with traditional machine learning. Iacono et al. (2012) make the rigid assumption that the land use/land cover (LULC) class is a discrete state depending solely on its previous state. In this way, they are able to apply Markov chains to model state changes over time. However, this relatively strong assumption may not always be valid and limits the use of auxiliary data. Land transformation models (Pijanowski et al. 2002; Tang et al. 2005; Newman et al. 2016; Pijanowski et al. 2020) are established methods for LULC class forecasting that allow the inclusion of additional social, political, and environmental drivers and process them via an MLP. Chu et al. (2010) used Markov Chains (MC) to forecast land use changes, while later, Nguyen et al. (2020) proposed an approach that employs satellite imagery as a driving variable to forecast LULC changes with MLP Markov Neural Networks.

To the best of our knowledge, there exists no published research about deep learning models for change forecasting, apart from one notable exception: (Rußwurm et al. 2020) employs a recurrent neural network (RNN) to model time series data and forecast low-resolution satellite observations—e.g., the MODIS NDVI—in an autoregressive manner. That method does not directly predict urban changes, rather it predicts future satellite observations that may, or may not, enable a subsequent change detection.

3 Methodology

The question of which pixels in a satellite image will change is ill-posed: change is only defined w.r.t. a finite time interval, but a single image does not delimit a time interval (contrary to conventional change detection, where the interval is the time between the two acquisition dates). Therefore, we must define a time horizon and pose the simpler and more meaningful question of whether or not a change will occur within a given fixed time frame. In a machine learning context this can be accomplished by training only on image pairs with the relevant time window, thus implicitly establishing the forecasting range. The limitation to a fixed forecasting range comes with a disadvantage, namely that one can only use a subset of the overall data for training that has the appropriate temporal spacing. In other words, one ignores much of the possibly available data. To still exploit all samples, we pretrain the backbone on a change detection task (Stage 1), where different time spans can be mixed. Subsequently, we finetune the pretrained backbone for the change forecasting tasks (Stage 2), as shown in Fig. 1 and described in the following subsections.

3.1 Stage 1: Change Detection

In this stage, we train a conventional Siamese network (Daudt et al. 2018a; Arabi et al. 2018; Yang et al. 2021) to detect changes in temporally ordered pairs of satellite images. The network follows the U-Net architecture with ResNet50 encoder (He et al. 2016), with shared weights in the two branches and output feature dimension 16 per branch. The feature maps of the two U-Net branches are concatenated and fed into a classification head with two hidden convolutional layers with \((3\times 3)\) kernels and depth 16, to obtain the final pixel-wise predictions. On the one hand, we can use all the available training pairs at this stage. On the other hand, this type of pretraining already allows the adaptation to the satellite image domain, in contrast to the traditional pretraining that is usually performed on the ImageNet domain. As a loss function, we use the binary cross-entropy loss.

3.2 Stage 2: Forecasting

For the Forecasting tasks, the pretrained U-Net backbone of Stage 1 is employed as the base. A new classification head with the same structure as for the change detection task is trained from scratch to perform binary change forecasting and time range forecasting, as described in the subsequent Sects. 3.2.1 and 3.2.2.

3.2.1 Change Forecasting

The objective of this task is to predict whether a change will occur within a specific, fixed forecasting range. To accomplish this, our model has a single output per pixel, namely a score \(\hat{p}_f\) that indicates how likely it is that a building footprint change is expected to occur. To cover different forecasting ranges, we train nine separate classifiers for ranges of 1, 3, 6, 9, 12, 15, 18, 21, and 24 months, using in each case only the corresponding subset of the training data. We minimize the standard binary cross-entropy loss defined as

$$\begin{aligned} \mathcal {L}_{binary} = BCE(\hat{p}_f, y_c), \end{aligned}$$
(1)

where \(y_c\) is the binary “change” / “no change” label that result from comparing the built-up masks of two-time stamps.

3.2.2 Time Range Forecasting

The objective of this model is to classify pixels as belonging to one of two categories: “early change” (i.e., changes that occur within 1–12 months), or “late change" (i.e., changes that occur within 12–24 months). We employ a time range forecasting model that has three logit outputs: \(q_e\) for “early”, \(q_l\) for “late”, and an auxiliary output \(q_0\) for the “no change” class. We use the auxiliary output to obtain additional supervision signals as described below.

When directly applying multi-class cross-entropy to the problem of classifying pixels into the “no change”, “early change”, and “late change” categories, the “no change” class will typically dominate the learning process due to its much higher relative frequency. To address this issue, we split the problem into two sub-problems.

The first sub-problem is a binary decision between an “early change” (within 1–12 months) and a “late change" (within 12–24 months). The predicted score \(\hat{p}_e\) is a measure of the likelihood that a given pixel will undergo an “early change”, and is trained with a cross-entropy loss w.r.t. the ground truth label \(y_e\) (1 for early change, 0 for late change). Note that this loss is only calculated on pixels that exhibit a change as per the GT label.

$$\begin{aligned} \hat{p}_e = \frac{\exp (q_e)}{\exp (q_e) + \exp (q_l)} \end{aligned}$$
(2)
$$\begin{aligned} \mathcal {L}_{time} = BCE(\hat{p}_e, y_e). \end{aligned}$$
(3)

The second sub-problem is the 24 months version of the change forecasting task described above in Sect. 3.2.1, since it aims to classify whether a change within 24 months will occur. The predicted score \(\hat{p}_c\) measures the likelihood that a given pixel undergoes a change at all within the maximum time interval. The loss is again a standard cross-entropy between the prediction and the binary “change” / “no change” label \(y_c\).

$$\begin{aligned} \hat{p}_c = \frac{\exp (q_e+q_l)}{\exp (q_e+q_l) + \exp (q_0)} \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {L}_{binary} = BCE(\hat{p}_c, y_c). \end{aligned}$$
(5)

Finally, we merge the two losses with the mixing weight \(\lambda\) to obtain the overall loss for the model. Empirically, \(\lambda \approx 10^3\) is a suitable value.

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{time} + \lambda \mathcal {L}_{binary} \end{aligned}$$
(6)

Splitting into two sub-problems with separate losses makes the optimization problem more well-defined. Dividing the task into two sub-problems each with its own losses, results in a better formulated optimization problem. The time range forecasting term can only be calculated on a small portion of the available pixels (the ones that indeed exhibit a change as per the GT), but it provides an almost perfect class balance between “early” and “late” changes. On the other hand, the loss provided by the binary change label is calculated from the complete set of pixels but suffers from a class imbalance issue. Overall, the combination of the loss terms improves the performance of the resulting model.

3.3 Label Imbalance

Change detection, and by extension also change forecasting, typically suffers from severe label imbalance. We thus employ two balancing strategies, in both stages of the training workflow.

First, we oversample change examples. Image patches i are sampled with probability

$$\begin{aligned} p_{i} = \frac{a + N_{i}}{\sum ^M_k a + N_{k}}, \end{aligned}$$
(7)

where \(N_i\) is the number of changed pixels per patch i; M is the number of samples; and a is a distribution smoothing constant, set to \(a \approx 50\), a value that was found empirically.

Second, we noticed that thresholding the scores at 0.5 (respectively at 0.33 for the 3-class time range forecasting) results in a poor precision-recall trade-off with a heavy bias towards the majority class “no change” (we further elaborate on this effect in Sect. 5). To counteract the bias, we determine the threshold in a data-driven manner: separately for every training batch, we find the threshold that maximizes the F1 score. To reduce the effect of stochasticity in the training mini-batches, we compute the final threshold value as a moving average over the last 500 training batches. Empirically, the approximation is very good: the discrepancy between the threshold estimated in this manner and the oracle threshold determined from the test set is vanishingly small.

3.4 Implementation Details

Our method is implemented in PyTorch (Paszke et al. 2019) and is publicly available.Footnote 1 To train the model we use the Adam optimizer (Kingma and Ba 2014) with default parameters and a base learning rate of \(10^{-4}\). We augment the samples by random cropping, small affine transformations, mirroring, and color jittering. We use a batch size of 16, except for the largest forecasting ranges of 21 and 24, where we found batch size 4 to perform best, likely as a consequence of the small size of the corresponding data subset.

For all models in Stage 2, we first freeze the backbone and train the head for 5000 batch iterations, then we reduce the learning rate by a factor of 10 to \(10^{-5}\) and train the entire model end-to-end.

4 Data and Experimental Setup

To validate our method, we make use of the SpaceNet7 dataset. It has been published by Van Etten et al. (2021) and was created for the building tracking competition featured at NeurIPS 2020. The dataset consists of 60 globally distributed labeled locations (see Fig. 2), each containing a series of 24 Planet Labs RGB satellite image mosaics of 4×4 km2, with consecutive mosaics, acquired one month apart. The ground sampling distance of the images is 4 m, and the total covered area is 960 km2. The dataset also contains a set of manually labeled building footprints, where each image of the time series is labeled individually.

Fig. 2
figure 2

Spatial distribution of the SpaceNet7 dataset and train/validation/test split

For this work, we derive a dataset that considers image pairs from the same locations but at different time steps and obtain change masks by subtracting the corresponding built-up area masks from each other. We note that, for simplicity, we have omitted cases where buildings have been removed. The reason is that our goal was a dataset that showcases urban development, as opposed to the opposite scenario where buildings are removed for good. The latter is far less common and involves completely different visual cues. During visual inspection, we found that many of the apparent destruction labels were actually caused by misalignment errors between the manually digitized footprints at different times, rather than actual building destruction. Using this approach, we obtain about 16,000 unique image pairs, which we further split into 264,000 non-overlapping pairs of training patches of size 224\(\times\)224 pixels. However, for the task of change forecasting, one implicitly defines the forecasting range by the choice of training sub-datasets that have a consistent forecasting range. For example, when we want to forecast for the smallest possible range (i.e., one month), the subset of pairs that are one month apart contains 5000 samples, whereas, for the largest forecasting range of 24 months, we only obtain 200 samples. Moreover, the dataset also exhibits a severe label imbalance, which is quite common for change detection datasets (Daudt et al. 2019). The average fraction of changed pixels across all samples amounts to 0.3%, and only one in seven patches have >0.5% positive labels. Figure 3 summarizes the size and imbalance of the sub-datasets.

To ensure a representative and balanced evaluation, we split the dataset stratified with respect to the individual continents, resulting in a training set with 42 locations (70%), a validation set for parameter tuning with 6 locations (10%), and a test set to calculate the final metrics with 12 locations (20%). The exact division of the geographical areas is illustrated in Fig. 2

Fig. 3
figure 3

Comparison between the number of available samples and the balance of the dataset depending on the size of the chosen forecasting range

5 Results and Discussion

5.1 Change Forecasting

Fig. 4
figure 4

Performance of the change detection and change forecasting models in terms of F1 score of the foreground class

In the comparison of Fig. 4, we plot the F1 scores w.r.t. the forecasting range of our method [denoted as “Forecasting (ours)”] and compare it to a baseline pretrained in single-image mode on ImageNet [“Forecasting (ImageNet)”], rather than on the satellite image change detection task. Moreover, we also provide the change detection performance trained from scratch [“Detection (vanilla)”] and pretrained on ImageNet [“Detection (ImageNet)”], as an upper bound for what the forecasting model can be expected to achieve: it would be unreasonable to expect the model to perform better in forecasting changes using a single image than in detecting changes when a second image is also available. Note that there is only one change detection model for all ranges, while there are separately fine-tuned change forecasting models for different prediction ranges.

Our model exhibits a consistent gain of 2–3 percentage points over the baseline approach. The advantage is most pronounced for the 1 month range, where our model outperforms the baseline with 8.0 vs. 1.1%. This is a particularly challenging forecasting range for learned models because it suffers from the most severe imbalance in the labels: there are very few positive pixels, as within 1 month only a few construction projects proceed to a point where a building is clearly recognizable. It appears that a better initialization of the network weights yields higher robustness against such extreme label distributions, such that forecasting performance actually matches the change detection model. With the increasing prediction range, the changes get more substantial, meaning that the detection task gets easier, whereas forecasting in the absence of a second image remains equally hard. Consequently, the change detector benefits from less imbalanced labels and significantly improves as the temporal range grows, while the performance of change forecasting remains at a respectable 10% across most of the range, increasing to 15% for the 24 months range.

Fig. 5
figure 5

Precision and recall of the foreground class of our change forecasting model w.r.t. the prediction range. The shown numbers correspond to the F1 score of the “ours” model in Fig. 4

Fig. 6
figure 6

Precision-recall curve for the change forecast model with time horizon 24 months

Fig. 7
figure 7

Comparison of model prediction for the 24 month prediction range and the corresponding true change, overlaid on the panchromatic input image. Even with a single input image, the model finds many locations of building changes in the 24 months prediction range

Fig. 8
figure 8

Typical failure cases: a significant amount of false positives are caused by areas that resemble construction sites (orange markers). Understandably, our model struggles when there is no indication of imminent construction at all—producing false negatives (blue marker). Moreover, our model fails to detect small, but distributed changes (yellow markers). Samples are from the 24 months prediction range

In Fig. 5, we show the precision and recall scores that correspond to the F1 scores previously discussed. The trade-off between precision and recall for each interval is influenced by the nature of individual sub-datasets and the classification threshold. We hypothesize that the observed trend of higher precision for the first 12 months and higher recall for the second 12 months may be attributed to the reduced class imbalance in longer forecasting ranges. Under the assumption that the models focuses to a significant degree on detecting construction sites, it seems reasonable that the recall would gradually increase. This is because construction sites foreshadow urban changes. With increasing forecasting range these sites are more likely to reach a point where new buildings have been erected, leading to more positive labels and thus higher recall.

Moreover, we provide the breakdown of the precision-recall trade-off curve in Fig. 6 for the classifier fine-tuned to the 24 months prediction range. The curve exhibits a bias toward favoring recall over precision—meaning, a significant amount of recall must be relinquished to see an increase in precision larger than 20%. This may indicate that in some cases the image evidence is sufficient to anticipate an imminent change, but not to localize it. Intuitively, this seems to make sense, as grading and earthworks in early stages do reveal the intention to construct, but not the location of the individual buildings within the plot.

Figure 7 illustrates successful predictions by our model. The model tends to identify the rough location of future changes correctly, mostly on the basis of detecting construction sites. The shape of the model predictions generally does not align exactly with the ground truth. This is not surprising given the inherent ambiguity of the task—from the early earthworks and preparations it is not possible to determine the precise outlines of the future buildings. Moreover, CNNs by construction tend to produce blurry outputs in the presence of uncertainty. Fortunately, knowing the location and the approximate extent of the land cover change is sufficient for many downstream tasks.

In Fig. 8 we display typical failure cases, which further help to understand which visual cues the model relies on for its predictions. It is apparent that the model anticipates new buildings at early-stage construction sites, but it seems to also have acquired a rudimentary understanding of urban development and sprawl, as it tends to predict the construction of new buildings in cluttered or empty wastelands that lie in the vicinity of existing buildings.

5.2 Time Range Forecasting

To isolate the time range prediction from the binary change forecasting, we restrict the following evaluation to pixels that do truth labels. In Table 1, we present the direct comparison of our pretraining method and the standard ImageNet pretraining in terms of accuracy (Acc) and average F1 score (aF1). The results show that our custom pretraining approach indeed improves performance by 3% in both accuracy and F1 score, further supporting the effectiveness of our proposed methodology.

We present the confusion matrix for time range forecasting in Fig. 9. It shows that with our setup it is in principle also possible to classify future change events into a group that will happen sooner and another one that will only happen later. We note that the “early” class exhibits a precision of 60.0%, while the precision of the “late” class amounts to 68.2%, suggesting that the later changes might be easier to detect than the earlier ones. Table 2 presents a comparison of our approach to models specifically trained for time range forecasting and change forecasting, respectively. For the time range forecasting, we use the same evaluation procedure as before, e.g. restricting the evaluation to pixels that do exhibit a change according to the ground truth labels. Moreover, we provide the accuracy and average F1 score over the “early” and “late” changes. Additionally, we display the F1 score, precision (Pre), and recall (Rec) metrics for the “change” class from the change forecasting task. The results indicate that our multi-task approach is beneficial for the time range task, but not for the binary change forecasting, as it trades off recall for precision.

Table 1 Comparison of our pretraining approach against the ImageNet pretraining approach on the time range forecasting task
Fig. 9
figure 9

The confusion matrix of our approach for time range forecasting with classes early changes (1–12 months) and late changes (13–24 months) which equates to an accuracy of 64.0%. The precision for the “early" class amounts to 60.0%, while the one for the “late" class amounts to 68.2%

Additionally, we present an empirical examination of the mixing weight parameter \(\lambda\) in Table 3. The analysis shows that the model performs best at 1000, but breaks for values <10 and >1000.

Table 2 Combining both loss functions boosts the time range forecasting task, while it deteriorates the performance in the change forecasting task
Table 3 Analysis of the mixing weight parameter \(\lambda\)

6 Conclusion

The main contribution of this work is a model to forecast where new buildings are likely to appear in the near future. Our goal has been to present a contribution to this little-explored topic in the light of modern deep learning technology. Besides setting a first baseline for change forecasting with deep convolutional networks, we have designed a 2-stage transfer learning procedure that employs change detection from paired images as a proxy task for learning features tailored to the analysis of high-resolution satellite images of urban and periurban regions. We have shown that such a pretraining improves change forecasting across a range of time horizons and that it is particularly helpful for a short horizon of 1 month, where the imbalance between unchanged and changed areas is particularly extreme. Moreover, we have also shown that it is possible, to some degree, to forecast how far into the future a change is going to happen.

Clearly, the presented approach does not perfectly solve the problem, mainly due to the fact that forecasting a future construction event from a single image is an ill-posed and very challenging problem. On the one hand, while there obviously are visual cues for future construction activity, none of these cues are guaranteed and unambiguous. For instance, earthworks may point at future construction, but they can take place for other reasons, e.g., landscaping. Furthermore, if a new estate will be constructed, there may already be signs like access roads or earthworks 2 years before, but there could also still be grassland or even agricultural fields, etc. Besides these conceptual limits, there are also technical challenges like the difficulty to obtain a large enough foreground set of comparatively infrequent events, such as new construction and the associated imbalance of the available labels.

We consider our work as a first attempt and hope that it may encourage further research and development toward more powerful and sophisticated forecasting methods. While we have tried to pioneer the use of deep learning for this type of forecasting, our standard convolutional design only scratches the surface of what is possible. Especially if multiple pre-event images are available, it would seem natural to explicitly model the temporal evolution of urban development—for instance with recurrent or attention-based architectures—to better exploit the temporal characteristics of the data. One way to overcome the scarcity of data may also be to synthetically introduce changes in images if one manages to bridge the domain gap between real and synthetic examples. Going for longer-term forecasts significantly beyond the next 2 years is a formidable challenge in terms of both data and methods, but would open up opportunities for a whole new set of applications such as long-term infrastructure planning tasks.