Introduction

Historical maps contain abundant geospatial information, such as roads, vegetation, water systems, and settlements. Before the development of remote sensing technology, historical maps were the primary means of recording geographical features over time. Roads, as important transportation infrastructure components, effectively describe the distribution characteristics of human activity areas. Therefore, extracting roads from historical maps is highly important for studying the evolution of road networks (Zhang et al. 2021), traffic accessibility (Daniel et al. 2022), geospatial registration (Wu et al. 2022a), and other aspects. With the increased size, quantity, and complexity of scanned historical maps, manual road extraction from historical maps is a very time-consuming task (Chiang et al. 2013). It is necessary to use the map element characteristics (such as colour and shape) to achieve automated road recognition on historical maps. However, due to the influence of map quality and preservation status, historical maps may have problems such as damage, fading, or deformation, resulting in missing or unclear road representations. In addition, historical maps usually adopt different cartographic techniques than modern maps, and road symbol styles are also different, which increases the difficulty of interpretation and extraction, making automatically extracting geographical features from historical maps challenging. Unlike photographic and painted images, maps have unique cartographic designs (Jia et al. 2024), and traditional image processing methods are difficult to apply directly to historical map image processing. Chiang et al. (2005); Chiang and Knoblock (2009) introduced colour segmentation to road extraction from historical maps, preserving the informational integrity of the original roads and facilitating subsequent analysis and processing. However, the segmentation results are highly dependent on the colour space and are susceptible to interference when the colours are similar to those of other elements. Building on this, Chiang et al. (2011) employed line segment tracking to identify the geometric shapes of roads from historical maps, improving the automation and accuracy of road extraction. Nevertheless, this technique still requires partial manual intervention and lacks an effective identification strategy for dashed lines. Callier et al. (2012) also utilized linear feature detection to identify roads in historical maps, combining it with a region-growing algorithm to address the issue of missed dashed lines. This type of method is sensitive to noise or irregular textures in the image, meaning that in cases of poor image quality, omissions or false identifications may occur. Additionally, line segments cannot fully represent curved road features, requiring additional curve detection and identification computations.

In recent years, the widespread application of deep learning methods based on CNN in image processing, especially in extracting road information from various data sources using CNN, has achieved remarkable advantages (Li et al. 2022; Dai et al. 2023; Liu et al. 2023). Increasingly, researchers have begun practicing deep learning methods in tasks related to historical maps (Wu et al. 2022b; Zhao et al. 2022; Martinez et al. 2023; Chen et al. 2024). However, deep learning requires a large quantity of labelled data for training, which poses significant challenges in using CNN models on historical maps. Therefore, improving the ability to extract geographic features from historical maps with limited labelled data has become an important research direction (Chiang et al. 2020).

To bridge the gap mentioned above, we propose an attention-based generative adversarial learning approach for road extraction from historical maps (AU3-GAN). Centred on generative adversarial networks (GAN) (Goodfellow et al. 2014), our method comprises a generator and a discriminator. The generator produces road samples based on the input historical map data, and through the “game” played between the generator and the discriminator, it becomes challenging for the discriminator to distinguish whether these samples originate from real data distributions. During this process, we enhance UNet3 + (Huang et al. 2020) with an attention gate mechanism (Schlemper et al. 2019) and employ it as the generator, aiming to strengthen the sample generation capability and mitigate the effects of boundary blurring, thus improving the extraction accuracy of roads from historical maps. Experimental results demonstrate that AU3-GAN outperforms previous CNN-based methods in terms of road extraction precision from historical maps under conditions of limited training samples, providing valuable insights for road feature extraction from various map types.

The remainder of this paper is organized as follows. “Related Works” presents an overview of the research advancements related to the content of this paper. “Methods” introduces the method we designed. “Analysis of Experiments” discusses the experimental results and provides an analysis. In “Summary”, we summarize our research findings and future work plans.

Related Works

Road Extraction Using CNN

Recently, deep learning techniques, represented by CNN, have been successfully applied to extract road information from various data sources, continuously improving extraction accuracy. Wang et al. (2022) proposed a network structure based on dual decoders and introduced an attention mechanism to extract roads of different scales from high-resolution remote sensing images. Lu et al. (2022) designed a multitask road-extraction network framework that simultaneously processes three typical road extraction tasks: road segmentation, centreline extraction, and edge detection. Wang et al. (2023) developed a framework for road semantic segmentation, geometric information extraction, and digital modelling based on light detection and ranging (LiDAR) data. They improved the road semantic segmentation network to enhance the distinction between road surfaces and other infrastructure. Additionally, Yang et al. (2022) combined aerial images with trajectory data to utilize multimodal knowledge for road extraction. The proposed method has been well-verified in industrial applications. Historical maps, as important data sources, have attracted increasing attention from researchers practicing deep learning methods in related tasks. Can et al. (2021) tested the performance of UNet and ResNet in road extraction from historical maps. The experimental results showed that UNet is better suited for processing historical maps. They also noted that road extraction accuracy from historical maps could be further improved through data augmentation in the future. Jiao et al. (2022; 2024) analysed the characteristics of black layer symbols in the Swiss Siegfried map and used UNet combined with contemporary vector data to generate training data for road extraction from historical maps. The results were better than those obtained using models trained solely on real data. Ekim et al. (2021) and Avcı et al. (2022) used UNet +  + to extract roads from historical maps of Turkish regions during the Second World War. The latter introduced an attention mechanism to achieve higher extraction accuracy than the former. These studies have demonstrated the feasibility of utilizing U-shaped networks for automatically extracting roads from historical maps. However, in terms of full-scale feature fusion, both UNet and UNet +  + lag behind UNet3 + . Notably, UNet3 + not only maintains exceptional performance but also reduces the number of network parameters, thereby enhancing computational efficiency and minimizing the risk of overfitting (Huang et al. 2020).

Combining Data: GAN and Historical Maps

One limitation of deep CNN is the high production cost of training data, which poses a significant challenge when using CNN models on historical maps. One method to alleviate this issue is to fuse existing data to automatically generate training data. For instance, Li (2019) used an open-source OpenStreetMap (OSM) as input and a GAN model to generate realistic historical map images from existing geographical datasets. Christophe et al. (2022) converted remote sensing images into historical maps through map image style transfer, enabling data feature sharing between different types. Furthermore, Andrade and Fernandes (2022) focused on generating remote sensing images from historical maps as a data source. However, these studies only visually transform multisource data and historical maps without utilizing the transformed results for further analysis.

Methods

Figure 1 illustrates the technical line of the proposed AU3-GAN for training and evaluating road extraction from historical maps. The process comprises three steps: (1) Local Laplacian filtering (LLF)-based preprocessing to enhance image quality, (2) GAN optimization using the training samples and employing the optimized GAN generator to extract roads from historical maps, and (3) performance quantification for the proposed method using common metrics.

Fig. 1
figure 1

Technical line of AU3-GAN

Data Preprocessing

We used LLF to preprocess the original historical map images. By applying the Laplacian operator to each image pixel, we conducted edge detection and filtered the image based on the edge information (Aubry et al. 2014). The high sensitivity of this method to image edges effectively maintains and enhances edge information, resulting in clearer and sharper edges. Additionally, it enhances the contrast of bright and dark areas in the image, reinforcing image features. As the entire process is conducted locally on pixels, it adapts well to local variations in the image, reduces global computation, and improves processing efficiency. Figure 2 shows a comparison between the preprocessed historical map image and the original image.

Fig. 2
figure 2

Comparison of the data preprocessing results: a original historical map image and b after LLF preprocessing

AU3-GAN Framework for Road Extraction

Figure 3 shows that the GAN architecture we employed consists of two subnets: the generator and the discriminator. In the first stage, we fix the discriminator and train the generator. Using a stable discriminator, the generator continuously generates segmented images (“pseudo images”). The discriminator judges whether the generator created the image. Due to the weak performance of the generator in the early training stages, the “pseudo images” are easily identified by the discriminator. As training progresses, the performance of the generator improves continually. We train the discriminator with a fixed generator in the second stage. We aimed to improve the discrimination ability to accurately identify pseudo images by feeding both pseudo images and labelled images into the discriminator. By iteratively repeating these two stages, the performances of both the generator and the discriminator gradually increase, ultimately resulting in a GAN capable of effectively extracting road features.

Fig. 3
figure 3

General architecture of the GAN

We represent the generator as G, the discriminator as D, the sample data as x, and the corresponding label data as y. The objective function of the GAN is constructed by comparing the minimum value of the generator’s loss function (\({\text{min}}_{G}\)) and the maximum value of the discriminator’s loss function (\({\text{max}}_{D}\)). The GAN objective function for road extraction from historical maps can be described in Eq. (1) as follows:

$${L}_{\text{GAN}}\left(G,D\right)={E}_{x,y}\left[\text{log}D\left(x,y\right)\right]+{E}_{x}\left[\text{log}\left(1-D\left(x,G\left(x\right)\right)\right)\right]$$
(1)

Furthermore, we employ binary cross-entropy as the loss function, which measures the accuracy of the model’s predictions. A lower loss indicates that the model’s predictions are closer to the true labels, as shown in Eq. (2):

$${L}_{S}\left(G\right)={E}_{x,y}\left[-y\bullet \text{log}G\left(x\right)\right]-\left(1-y\right)\bullet \text{log}\left(1-G\left(x\right)\right)$$
(2)

The final objective function for road extraction from historical maps is formulated by combining Eqs. (1) and (2), as shown in Eq. (3):

$${G}^{*}={\text{argmin}}_{G}{\text{max}}_{D}{L}_{\text{GAN}}\left(G,D\right)+\lambda {L}_{S}\left(G\right)$$
(3)

Figure 4 illustrates the detailed architecture of the AU3-GAN. The generator produces road samples based on the input historical map data, aiming to prevent the discriminator from being able to distinguish whether these samples come from the real data distribution. In contrast, the discriminator aims to accurately determine whether a road sample originates from genuine labelled data or was generated by the generator network. The design of the generator network structure is crucial for accurately extracting roads from historical maps. Drawing inspiration from current research advancements in this field, we choose the lightweight and efficient UNet3 + as the underlying architecture. Through full-scale skip connections, UNet3 + combines high-level semantics with low-level semantics from different scales of feature maps, enabling the network to capture richer contextual information and improve segmentation accuracy. Additionally, UNet3 + employs a deep supervision strategy, learning hierarchical representations from multiscale aggregated feature maps. This approach provides direct supervision to each decoder stage, facilitating the learning of better feature representations, accelerating convergence, and enhancing the generalization capabilities of the model.

Fig. 4
figure 4

Detailed structures of generative and discriminative networks comprising the AU3-GAN

Considering the characteristics of full-scale skip connections, we introduce an attention gate mechanism with a lower computational burden into UNet3 + to focus on target elements and suppress irrelevant backgrounds for identification, as shown in Fig. 5. We perform 2D convolution on both the feature maps \({f}_{x}\) and gated signals \({g}_{x}\) in the skip connections, followed by the addition to generate an attention map. The spatial attention matrix is obtained using the ReLU activation function, 2D convolution, and sigmoid gating function. Finally, we multiply the feature maps with the attention matrix to assign a coefficient to each pixel, thereby determining the focal information in the full-scale feature maps and improving the accuracy and stability of road identification.

Fig. 5
figure 5

Structure of the attention module

Simultaneously, we design a fully convolutional network as the discriminator, which takes both the original map image and the segmented image generated by the generator (or the label image) as input and outputs a probability map with a size of \(H\times W\times 1\). The pixel of the output probability map represents the possibility of being a real label, as shown in Fig. 4.

Performance Evaluation Metrics

Four metrics were used to evaluate the accuracy of the proposed method for road extraction from historical maps: precision, recall, F1 score, and mean intersection over union (MIoU). These metrics can be calculated from the number of false positives (FP), false negatives (FN), true positives (TP), and true negatives (TN), as shown in Eqs. (4)–(7):

$$\text{Precision}=\frac{\text{TP}}{\text{TP }+\text{ FP}}$$
(4)
$$\text{Recall}=\frac{\text{TP}}{\text{TP }+\text{ FN}}$$
(5)
$$F1=\frac{2 \times \text{ Precision }\times \text{ Recall}}{\text{Precision }+\text{ Recall}}$$
(6)
$$\text{MIoU}=\frac{1}{k}\sum_{i=0}^{k-1}\frac{{\text{TP}}_{i}}{{\text{TP}}_{i}+{\text{FP}}_{i}+{\text{FN}}_{i}}$$
(7)

Analysis of Experiments

Data

We utilized the Third Military Mapping Survey of Austria-Hungary (Can et al. 2021), which accurately documents the terrain of most parts of Central and Eastern Europe between 1884 and 1918, as our experimental data. This survey, focusing on military applications, particularly emphasizes road delineation. The dataset contains 4676 map patches with a size of 256 × 256 pixels and their corresponding digitized road labels. It includes seven different road classes and one background class, which is denoted in black. Figure 6 shows that we randomly split the training set, validation set, and test set at a ratio of 7:2:1, obtaining 3273 training data, 935 validation data, and 468 test data, respectively.

Fig. 6
figure 6

Dataset example, including the colour and proportional quantity of each road category

Configuration

The test platform was a 64-bit Ubuntu 18.04 operating system equipped with eight GeForce RTX™ 2080 Ti GPUs (11 GB of VRAM). The PaddlePaddle v2.3 framework was used to build the algorithm. The batch size was set to 8 according to the characteristics of the GPUs, the initial learning rate of the network was 0.0125, and stochastic gradient descent (SGD) was used as the optimizer. The momentum was 0.9, the weight decay rate was 4 × 10−5, and the training epoch was set to 10,000 iterators. We tested various values for λ in Eq. (3), with the results illustrated in Fig. 7. Based on these findings, we ultimately set λ to 0.4.

Fig. 7
figure 7

The impact of hyperparameter settings on experimental results

Results and Discussion

Figure 8 visually compares the extraction results of different road types using UNet, UNet3 + , and AU3-GAN. While all three methods effectively recognize road features from historical maps, UNet exhibits significant limitations in handling complex backgrounds and fine boundaries due to its exclusive focus on integrating deep-level features, thereby lacking the ability to explore information across all scales. Although UNet3 + addresses some of these limitations by utilizing full-scale skip connections and comprehensive depth supervision to fuse multiscale information, it remains sensitive to occlusions and disconnections, resulting in missed roads obscured by annotations or other map elements (as indicated by the white solid boxes in Fig. 8). Additionally, UNet3 + yields numerous FP (highlighted by the white dashed boxes in Fig. 8), significantly impacting its accuracy. Furthermore, the extraction results in some road connection areas (i.e., road intersections) lack connectivity. In contrast to the U-shaped networks, AU3-GAN better preserves road information by focusing on important features or regions using an attention mechanism. This attention mechanism mitigates the impact of noise and outliers, enhancing model performance and leading to superior road extraction accuracy.

Fig. 8
figure 8

Visualization of road extraction results from historical maps: a original historical map image, b labelled image, c UNet results, d UNet3 + results, and e AU3-GAN results

Table 1 presents the average training time and the average training time per epoch for UNet, UNet3 + , and AU3-GAN under identical experimental conditions. As UNet3 + introduces additional modules and connections to the base UNet architecture, the increased network complexity necessitates more computational resources and time during the training process. AU3-GAN, which utilizes UNet3 + as a component of its generator, possesses a network structure with greater complexity and a higher number of parameters compared to UNet3 + , thus requiring more time to complete the same training tasks. However, it is worth noting that training speed is not the sole metric for evaluating network performance. While AU3-GAN exhibits a slightly slower training speed, it offers superior segmentation accuracy.

Table 1 Comparison of training time of different methods

As shown in Fig. 9, a quantitative analysis using the evaluation metrics presented in the “Performance Evaluation Metrics” section demonstrates the advantages of the AU3-GAN approach over UNet and UNet3 + . Additionally, we conducted a comparative analysis of the three methods using the MIoU metric across different numbers of training samples, as shown in Fig. 10. The results indicate that under equal conditions, AU3-GAN achieves a higher MIoU with a smaller quantity of training data. In contrast, the other two methods require a significant number of training samples to reach this level. Furthermore, the relatively flat trend line of AU3-GAN in Fig. 10 suggests its stability as the number of samples increases compared to the significant fluctuations observed in the other two methods (especially UNet). Despite its advantages, the AU3-GAN method can still be affected by complex backgrounds and challenges related to similarity with other elements such as contours, administrative boundaries, and water body boundaries.

Fig. 9
figure 9

Comparative statistics of the evaluation metrics for road extraction methods from historical maps

Fig. 10
figure 10

Comparison of training data and MIoU

Ablation Analysis

To further analyse the role of the generator architecture and attention module in AU3-GAN for road extraction from historical maps, we designed and implemented two additional experiments to vertically analyse AU3-GAN performance. These experiments are as follows: (1) keeping the generator’s feature extraction module unchanged, we used UNet3 + as the core and removed the attention module from the original generator (AN-GAN) and (2) keeping the attention mechanism unchanged, we replaced the generator’s feature extraction module with a ResNet-based (He et al. 2016) encoder–decoder (AR-GAN) consisting of 4 encoders and 3 decoders, as shown in Fig. 11. The quantitative evaluation results of the ablation experiments are presented in Table 2. The performance of the AN-GAN without the attention module decreased to some extent, and all relevant evaluation metrics were lower than those of the AU3-GAN. The AR-GAN, which cannot capture local information, also failed to achieve good extraction accuracy, suggesting that the strength of the generator’s feature extraction capability throughout the process determines the performance of the GAN architecture. Additionally, the attention mechanism allows the feature extraction network to better focus on target elements and reduce interference from irrelevant elements.

Fig. 11
figure 11

Structure of the encoder–decoder network

Table 2 Quantitative comparison of the evaluation metrics for the ablation experiments

Summary

Historical maps, as valuable spatiotemporal data resources, have become increasingly important for research into the automatic extraction of geographical features as the quantity of data increases. In this paper, we propose a deep learning method for extracting roads from historical maps. Compared with existing research, our method offers the following innovations. (1) We have analysed the shortcomings of existing deep learning methods in the extraction of roads from historical maps, focusing on key issues such as data volume and precision. In response, we have designed targeted methodologies to address these challenges. (2) We apply the GAN to the study of road extraction from historical maps, enabling the generation of sample data with limited samples. As training and optimization progress, the generated results become increasingly similar to real data. (3) We introduce an attention mechanism to UNet3 + to enhance its feature extraction capability and perform quantitative performance analysis. The experimental results demonstrate that our proposed method exhibits excellent extraction accuracy, providing a reference for extracting other types of geographical features.

In the future, we will continue to delve deeply into both data processing and methodological innovation, committed to enhancing the quality and breadth of historical map research. Our core objective is to integrate advanced cartographic knowledge and technical expertise to meticulously preprocess historical maps. During this process, we will focus on eliminating irrelevant features that may interfere with data analysis, thereby significantly improving data quality and ensuring the accuracy and reliability of our research outcomes. Concurrently, to validate the universality and adaptability of our approach, we plan to apply this methodology to a diverse array of historical maps, encompassing various styles. This includes, but is not limited to, maps from different cultural backgrounds, historical periods, and cartographic techniques. Through this cross-style and cross-temporal application, we aim to comprehensively test the effectiveness of our methods and tailor them to the specific characteristics of different types of maps, optimizing and refining our approach accordingly.