1 Introduction

In Taiwan, earthquakes are a common occurrence, including approximately three events each year surpassing a magnitude of 6.0. These significant seismic events have a profound impact on both the economy and public safety. A key aspect of seismic engineering is to evaluate earthquake risks and implement preventive strategies. This involves analyzing building characteristics such as structure, age, and height to determine their resilience to earthquakes. Consequently, compiling comprehensive building data is vital for effective earthquake risk assessment.

However, acquiring structural details of buildings poses a greater challenge compared to other types of building information. Structural components include the foundation, walls, columns, beams, and trusses, all bearing the load of the building. Manually annotating these details is an expert-intensive and time-consuming task. While governments often provide this data, the level of detail and completeness can vary, especially in developing countries, and may not always keep pace with new construction. Therefore, creating a system for swiftly identifying building structures is critical.

Remote sensing offers a rapid and broad approach to gather building data, but it falls short in identifying specific structural information. Acevedo et al.  (2017) have combined satellite imagery with Google Street View to manually identify building heights and types, and to analyze roof shapes for wider data collection. Remote sensing is adept at capturing regional attributes such as the density of buildings but still depends on manual processes for identifying specific structural details of individual buildings (Huo 2019)

With the rapid evolution of computer technology, machine learning has become increasingly prevalent in image recognition, extending to the identification of building structures. Techniques like SVM, used by researchers such as Pittore and Wieland (2012) in conjunction with satellite images, enable detailed recognition at the pixel level. However, to further refine and automate the identification of building structures, additional imagery detailing the exterior of buildings is necessary. Traditional machine learning approaches, when applied directly to architectural images, face numerous challenges, including varying angles and lighting conditions (Chen et al. 2017; Zhang 2018; Bilal and Hanif 2019).

Convolutional Neural Networks (CNNs) have become a cornerstone in building recognition, leveraging their capability to extract complex, high-dimensional features (Shi 2021). With the advancement of deep learning, CNNs have found applications in areas like object detection, semantic segmentation (Wei 2016), and image classification (Ezat et al. 2020). Since 2017, they have been integrated into building recognition models. For example, Yibo Liu and his team developed a deep learning-based framework for hierarchical building detection using CNNs to identify buildings from remote sensing data (Liu 2018). Similarly, Kang and colleagues employed CNNs with OpenStreetMap data to identify eight building types in North America (Kang 2018). However, CNNs encounter unique challenges in building recognition. They often overemphasize background features in building images, where buildings constitute just about 10% of the image, diminishing the model’s accuracy. The architectural diversity and density in certain regions further impact CNNs’ efficacy.

CNN-based object detection has proven effective in identifying building structures, efficiently recognizing multiple buildings within a single image and swiftly predicting a range of building structures. This approach is especially beneficial for images containing numerous buildings. However, challenges persist, such as variations in foreground-background ratios, difficulties in detecting small targets, and occlusions.

In response to the incomplete building structure data in Taiwan, we propose using Google Street View and object detection technologies for rapid, automated structure recognition. The high density and variety of buildings in Taiwanese cities, often leading to images with multiple, variably sized, and occluded buildings, pose a significant challenge. To overcome these hurdles, we introduce the YOLOX-CS model. This model incorporates Convolutional Block Attention Module (CBAM) (Woo 2018) convolutional blocks to better detect smaller structures and uses Illustration enhancement for data augmentation, improving the recognition of obstructed buildings.

Fig. 1
figure 1

Example data from the Taipei City Historical Usage License Summary

2 Datasets

Google Street View offers comprehensive street-level imagery across the globe, complemented by an API Google (2021) that allows developers to craft custom applications. This capability is pivotal for us to employ object detection models on Street View images, aiming to swiftly compile a structural map of buildings in Taiwan to facilitate earthquake risk assessment.

To enhance the accuracy of our model in discerning various building structures, it’s crucial to amass a diverse collection of building images for training purposes. Utilizing the Google Street View API, we acquire these images based on specific latitude and longitude coordinates. The subsequent phase involves meticulous manual filtering and annotation to categorize the structural types of the buildings. The coordinates and structural data for this endeavor are sourced from official records like “Taipei City Historical Usage License Summary” (Taipei City Government Open Data Platform 2020) and “Taichung City Buildings_WGS84” (Taichung City Government Open Data Platform 2019) data, provided by the government.

2.1 Taipei City Historical Usage License Summary

The “The Taipei City Historical Usage License Summary,” curated by the Taipei City Construction Management Office, encompasses building data of Taipei City spanning from 1949 to 2019. This dataset, available in XML format, includes 24 fields and undergoes annual updates (Fig. 1). However, we had to discard some older records that no longer align with Taiwan’s current address system, ultimately retaining 60,387 valid data entries.

In these valid data sets, two critical fields were extracted: the structure of the building and its address. The structural data assists in further annotation tasks, while the address information is converted into geographical coordinates using the geolocation services provided by Taiwan Geospatial One Stop. During this conversion process, we encountered several challenges: 7520 entries couldn’t be precisely geolocated (due to multiple possible coordinates), 6347 entries lacked a definitive location, and 17,350 entries faced issues with duplicated locations. After filtering out these discrepancies, 29,170 entries remained viable for use.

A closer analysis of these remaining entries revealed a highly skewed distribution of building types. Reinforced concrete structures dominate, comprising 82.4% of the dataset, followed by strengthened brick constructions at 11.9%, with other types constituting less than 2%. This imbalance necessitates the acquisition of additional data to enhance the diversity and balance of the dataset.

2.2 Taichung City Buildings_WGS84

The “Taichung City Buildings_WGS84” dataset, compiled by the Urban Development Bureau of Taichung City Government, offers comprehensive details on the region’s buildings and is updated annually. Presented in Shapefile format, each entry in this dataset includes 13 fields (Fig. 2). The description field, formatted in HTML, provides the essential building structure information we require. Additionally, the geometry field in GeoJSON MultiPolygon format delineates the polygonal shapes of the buildings. Given the straightforward nature of building shapes, we opted to use the centroids of these polygons to represent their geographical coordinates. The dataset encompasses a total of 300,183 records.

This dataset showcases a varied array of building structures: reinforced concrete accounts for 64.9%, steel frame constructions for 22.1%, and brick constructions for 10.6%. After filtering out entries without specified building structures, we utilized the geometry and structure data to generate a distribution map of Taichung City’s building structures (Fig. 3).

Fig. 2
figure 2

Example data from Taichung City Buildings_WGS84 (a) Original Shapefile format (b) Example of data in the description field in HTML format

Fig. 3
figure 3

Taichung City building structure distribution map. The X-axis represents longitude, and the Y-axis represents latitude

2.3 Image retrieval and annotation

Upon analyzing the two datasets, we identified that four building structures—Reinforced Concrete (RC), Steel Frame Reinforced Concrete (SRC), Brick Building (BB), and Steel Frame (SB)—are prevalent in Taiwan and represent common construction types. We thus targeted these structures for our model training. Representative examples of these structures are illustrated in Fig. 4.

We then utilized latitude and longitude data to acquire Street View images. Considering the costs associated with the Google Street View API, we employed a systematic approach to randomly extract data from these four structural types to minimize expenses and image requests. Our image capture settings included parameters like size = 640 x 640, field of view (fov) = 120, and pitch = 30, which we validated as optimal for building recognition. The photo orientation was determined based on the closest street to the target building, focusing on capturing the building itself. We discarded images that either did not feature the target building or where background structures predominated, resulting in a dataset of 6,394 records. The breakdown of these records is as follows: 1763 for Reinforced Concrete, 295 for Steel Frame Reinforced Concrete, 1392 for Brick Building, and 2944 for Steel Frame.

In the final step, we handpicked images depicting the target buildings and annotated them with their corresponding building structure types, thereby finalizing our dataset for training the model.

Fig. 4
figure 4

Representative images of the four building structure types in the dataset: a reinforced concrete b steel reinforced concrete c brick building d steel building

3 YOLOX-CS

YOLO (You Only Look Once) (Redmon et al. 2016), a widely popular object detection model in recent years, has been acclaimed for its rapid prediction speed and high accuracy, making significant strides in various fields. Unique in its approach, YOLO employs a single neural network to tackle regression problems instead of classification, enhancing its training efficiency and ability to detect a diverse range of objects. Notably, YOLO’s swift inference capabilities, operating at milliseconds, make it ideal for real-time image recognition, vastly broadening its application scope.

From its inception, YOLO has evolved through multiple iterations. Versions like YOLOv4 and beyond have shown outstanding performance in both speed and accuracy, with YOLOX (Ge 2021) particularly excelling. YOLOX amalgamates elements from YOLOv3 (Redmon and Farhadi 2018), YOLOv4 (Bochkovskiy et al. 2020), YOLOv5 (Jocher et al. 2021), and incorporates techniques such as the Anchor-free framework and Simplified Optimal Transport Assignment (SimOTA), optimizing the model’s effectiveness. Catering to different requirements, YOLOX offers a range of models—YOLOX-S, YOLOX-L, YOLOX-M, etc., based on YOLOv5. For our research, we selected YOLOX-S as our principal training model for its excellent balance between speed and accuracy.

However, challenges arise due to the varied angles of street view images and the dense building structures in Taiwan. Target buildings in these images are frequently obscured by elements like trees and streetlights, resulting in incomplete and inconsistent building outlines. Moreover, the presence of multiple buildings in a single image, some occupying smaller portions with less prominent features, poses a risk of being overlooked by the model, thus affecting its accuracy.

To boost the model’s proficiency in detecting smaller targets, we introduced YOLOX-CS. This innovation integrates the CBAM module into the YOLOX-S network structure, preserving its performance while enhancing its capability to discern smaller objects. Figure 5 illustrates the network architecture of YOLOX-CS.

Fig. 5
figure 5

YOLOX-CS network structure, with the CBAM module embedded in the original YOLOX-S

3.1 YOLOX-S

YOLOX-S, built upon YOLOv5-S, incorporates a standard network structure with three primary components: Cross Stage Partial Darknet (CSPDarknet), Feature Pyramid Network (FPN), and YOLOHead. The CSPDarknet serves as the primary feature extraction network in YOLOX-S, utilizing residual convolutions to boost accuracy effectively. This network also integrates Cross Stage Partial Network (CSPnet) and Focus structures, along with the Spatial Pyramid Pooling (SPP) network, to broaden its receptive field. The FPN enhances feature extraction by amalgamating effective feature maps from the core network, enriching the representation of multi-scale features. YOLOHead, functioning as YOLOX-S’s classifier, employs a Decoupled Head architecture and introduces an Anchor-free approach, improving both convergence speed and accuracy. The SimOTA method is employed to ensure the best possible prediction outcomes.

Compared to YOLOv5-S, YOLOX-S introduces several key enhancements:

  1. 1.

    Activation Function: YOLOX-S adopts the Sigmoid Linear Unit (SiLU) activation function in its backbone and neck, slightly slowing inference speed but providing enhanced non-linear fitting capabilities.

  2. 2.

    Decoupled Head: YOLOX-S’s YOLOHead uses a Decoupled structure, segregating the classification (Cls), bounding box localization (Reg), and foreground-background differentiation (Obj) branches for operations prior to concat. This design expedites model convergence and boosts overall performance.

  3. 3.

    Loss Function: YOLOHead has three distinct branches, each with its specific loss function: Binary Cross Entropy (BCE) for Cls and Obj branches and Intersection Over Union (IoU) for the Reg branch. The final Loss function is as follows:

    $$\begin{aligned} L = \frac{{L_{\text {cls}} + \text {reg}_{\text {weight}} \cdot L_{\text {reg}} + L_{\text {obj}}}}{{N_{\text {pos}}}} \end{aligned}$$
    (1)

    The aggregate loss, balanced by a reg_weight for the reg loss, averages across positive samples.

  4. 4.

    Anchor-free: Post-Decoupled Head, YOLOX-S generates feature vectors, replacing the original feature maps and markedly reducing parameter needs. It also incorporates scale data from the original feature maps via downsampling.

  5. 5.

    SimOTA: This technique assigns labels to potential positive samples and pinpoints the predicted boxes closest to the label boxes. By transforming label assignment into an optimal transport problem, SimOTA enhances the detection algorithm’s inference speed and training efficiency without compromising accuracy, and it does so without requiring extra parameters.

3.2 Convolutional block attention module

CBAM, a versatile and lightweight model, seamlessly integrates into various CNN architectures, effectively enhancing the model’s focus on significant regions within the feature maps. Du et al. (2021) Ding and Zhang (2021) In our research, we incorporated CBAM into the YOLOX-S framework to bolster its capacity for detecting smaller architectural targets. Figure 6 depicts the structural layout of the CBAM model.

Fig. 6
figure 6

Convolutional block attention module (CBAM) module structure

At its core, CBAM is composed of two distinct modules: the channel attention module and the spatial attention module. The channel attention module is designed to pinpoint and accentuate meaningful features, whereas the spatial attention module focuses on identifying the specific locations of these significant features within the feature map. The computation of feature maps in Fig. 6 can be expressed by formula (2). A feature map \(F \in R^{C \times H \times W}\) is given as input, and CBAM infers a 1D channel attention map \(M_{c} \in R^{C \times 1 \times 1}\) as well as a 2D spatial attention map \(M_{s} \in R^{1 \times H \times W}\) in a sequential manner. Where \(\otimes\) represents element-wise multiplication.

$$\begin{aligned} \begin{gathered} {F}'= M_{c}(F)\otimes {F} \\ {F}''= M_{s}({F}')\otimes {F}' \end{gathered} \end{aligned}$$
(2)
Fig. 7
figure 7

Example of the illustration enhancement method

4 Illustration enhancement

To boost the model’s proficiency in detecting buildings partially concealed by obstructions, we introduced the “Illustration Enhancement” data augmentation technique. This method introduces obstructions in some training data and uses post-processing to synthesize images of buildings obscured by these obstructions, thus improving the model’s ability to identify occluded objects. Our analysis of the YOLOX-S model’s recognition abilities revealed its shortcomings in identifying buildings obscured by trees. Consequently, we aim to use the “Illustration Enhancement” technique to improve the model’s detection of buildings masked by foliage. We have decided to forgo additional data augmentation methods on images that have already been processed using Illustration Enhancement to accurately assess its effectiveness.

Tree images sourced from textures.com (Texture 2022) were used as synthetic obstructions. These were then digitally composited with building images using Photoshop CS6. To ensure a balanced representation of obscured buildings across the four building structure types, we randomly selected images from each type for this compositing process. Figure 7 illustrates the before-and-after effects of applying Illustration Enhancement.

5 Results

5.1 Training

Initially, we set out to evaluate the impact of our “Illustration Enhancement” data augmentation method. This involved applying different data augmentation techniques, namely Illustration Enhancement, horizontal flipping, random noise, and Gaussian blur, to the original dataset. We then assessed the efficacy of each method in terms of its ability to improve building structure recognition. Following this, our focus shifted to assessing the proposed YOLOX-CS model. This evaluation entailed training the YOLOv4, YOLOX-S, and YOLOX-CS models with both the original dataset (without Illustration Enhancement) and the dataset processed with Illustration Enhancement.

The metric chosen for evaluation was mAP (mean Average Precision), a standard measure in object detection models. This metric calculates the average of the Average Precision (AP) for all classes. The AP is computed by measuring the area under the precision/recall curve across 11 recall intervals, as shown below:

$$\begin{aligned} AP=\frac{1}{11} \sum _{r\in \left\{ 0,0.1,...,1 \right\} }^{} P_{interp} (r) \end{aligned}$$
(3)

All model training sessions were conducted on an NVIDIA GEFORCE® GTX 2080Ti 11GB graphics card. For a detailed overview of our training methodologies, refer to Table 1.

Table 1 Detailed training strategies of the models

5.2 Comparison

The results from Table 2 reveal the distinct advantage of Illustration Enhancement in building structure identification compared to other common data augmentation techniques. From Table 3, it’s evident that embedding CBAM in YOLOX-CS led to a marginal decrease in precision for the BB and SRC types, yet overall, YOLOX-CS outshined YOLOX-S. This advantage is visually represented in Fig. 8, where YOLOX-CS detected many small-scale architectural targets that eluded YOLOX-S. Significantly, as depicted in Fig. 9, all three models demonstrated improved mAP after being trained on the Illustration Enhancement-enhanced dataset.

Table 2 Training results of YOLOX-CS using each data augmentation
Table 3 The precision of the three models
Fig. 8
figure 8

Improved recognition of small objects after embedding the CBAM module. a Buildings detected by YOLOX-S. b Buildings detected by YOLOX-CS

Fig. 9
figure 9

Enhanced recognition of buildings obstructed by trees after using Illustration Enhancement. From top to bottom, the building structures are RC, BB, and SRC. a YOLOX-CS prediction results on the original dataset. b YOLOX-CS prediction results on the Illustration Enhancement dataset

6 Discussion

The initial section of this chapter delves into a comprehensive analysis of the experimental outcomes associated with the Illustration Enhancement data augmentation technique and the YOLOX-CS object detection model. It emphasizes the enhancements these methodologies bring to the identification of architectural structures. The latter section addresses the inherent limitations in our study’s automated approach to architectural structure recognition, which could potentially impact the model’s performance and its applicability in real-world scenarios.

6.1 Analysis of empirical results

Primarily, the recognizability of buildings in images is often hindered by obstructions from other objects, a challenge more pronounced in the case of smaller structures. The Illustration Enhancement approach effectively mitigates this issue. In comparison with alternative data augmentation strategies, Illustration Enhancement demonstrates superior performance in recognizing obscured buildings. Although there’s a slight decrement in recognizing unobstructed buildings, the overall accuracy rate sees a notable improvement. Integrating Illustration Enhancement with other augmentation methods could potentially further amplify the model’s capabilities. The marked enhancement in recognition performance, as evidenced by training three distinct models on the Illustration Enhancement dataset, underscores its efficacy for Taiwanese architectural datasets.

In the specific context of small-scale building recognition, the incorporation of CBAM into YOLOX-S led to an improvement in the model's mean Average Precision (mAP), albeit with a decline in precision for certain categories (refer to Table 3). Further analysis revealed that YOLOX-CS identified numerous unmarked smaller buildings, contributing to a relative precision drop compared to YOLOv4. This issue likely stems from the inclusion of unmarked structures in the dataset generated from street view imagery. To address this, more stringent dataset processing is needed to minimize errors in marking and omissions, particularly in underrepresented categories.

Strategies for further enhancing model performance include expanding the architecture of the backbone network and tailoring models specifically for architectural structure recognition. Beyond architectural imagery, incorporating additional attributes such as building height and age could provide supplementary insights for the model’s predictions. The overarching aim is to develop a rapid, automated system for architectural structure recognition, enabling efficient assessment of building-related risks prior to seismic events.

6.2 Limitations and improvement

This study encounters certain constraints and outlines avenues for future enhancements:

  1. 1.

    Image Dependence: Our methodology, tailored for Google Street View imagery, necessitates specific standards for image resolution, field of view (fov), and angle of elevation. For analogous outcomes using this model, it's imperative that the images align closely with those from Taiwan’s Google Street View, including aspects like exposure and shooting direction.

  2. 2.

    Challenges in Image Selection: In sourcing images through the Google Street View API, issues such as absence of the target building, incomplete structures, or excessive background buildings were common. This necessitated manual image selection, a process not in line with the automation ethos. Future enhancements might include capturing multiple images of a target building from varying horizontal angles, thereby elevating the success rate of recognition.

  3. 3.

    Incorporating Additional Factors: While the study successfully established an architectural structure map, a comprehensive seismic risk assessment requires factoring in additional elements, such as building height and unauthorized extensions. Subsequent research could explore integration with remote sensing techniques to swiftly construct a detailed seismic risk assessment map, considering a broader range of influential factors.

7 Conclusion

The relentless evolution of object detection technology has unlocked new avenues for swiftly identifying architectural structures. However, current object detection models grapple with certain challenges, notably in processing small and occluded targets. In response to these challenges, our study introduces the Illustration Enhancement data augmentation method and the YOLOX-CS model. The former significantly bolsters the model’s capacity to identify occluded targets, while the latter not only retains superior performance but also enhances the detection of small-scale targets. The experimental findings indicate that both approaches exhibit commendable performance on the Taiwanese building dataset.

A key contribution of our research lies in the development and validation of novel methods tailored for architectural structure recognition, confirming the efficacy of both Illustration Enhancement and YOLOX-CS in boosting recognition precision. These advancements pave the way for the creation of a fast and accurate system for architectural structure identification, aimed at enhancing the assessment of potential seismic risks. Furthermore, we envisage the applicability of these methodologies in other regions with architectural styles akin to those in Taiwan, thus broadening their scope of practical deployment.