1 Introduction

Deterioration accumulation is inevitable during the life-cycle service of bridges subjected to harsh environments, and the failure of bridges will result in considerable losses of both human life and property. Monitoring the bridge condition and detecting their damages are essential to ensure their serviceability and safety. Traditionally, visual inspection conducted by experienced inspectors is the main method adopted for this mission (Xu and Xia 2012). Nevertheless, the visual inspection is labor-intensive, time-consuming, subjective, and hard to reflect real structure condition alteration in time (Sun et al 2020). Therefore, structural health monitoring (SHM) systems are developed and installed on some bridges with the aims to timely find structural damage or degradation (Housner et al 1997).

SHM lies in sensing and communication technologies, and the recent advancements in both technologies provide chances to acquire monitoring data at an unprecedented speed and amount. Analyzing the accumulated monitoring data to realize SHM has naturally become the priority of SHM research. The methods developed for analyzing the monitoring data can be distinguished into two categories: model-based methods and data-driven methods (Sun et al 2020). The former attempts to update the finite-element model (FEM) of the undamaged bridge in terms of some key parameters against the measurement data, and the differences between its predictions and the measurements indicates the existence of damages (Xiao et al 2015; Zhu et al 2015). However, it is a hard task due to simplifying assumptions when modeling bridge structures and uncertainties of material and geometric properties (Sun et al 2020). Data-driven methods regard the mission as a statistical pattern recognition problem (Farrar and Worden 2012) and have been applied substantially, but their complexity and computation requirements are generally of polynomial order concerning data size (Sun et al 2020). Moreover, computer vision (CV) technologies are also used to detect local damage, such as cracks, spalling, delamination, and rust, and to extract global information, like displacement, acceleration, loads, from images or videos captured by cameras. The CV technologies usually face disturbances caused by light, distortion, weather, and occlusion in the outdoor environment.

The appearance of machine learning (ML) provides a possible solution for the troubles mentioned above. As a branch of artificial intelligence, ML aims to develop trainable algorithms to learn from data, based on which predictions can be made (Pan et al 2017; Pan et al 2018). The artificial neural network (ANN) is a classical ML method and has been applied to civil engineering since 1989 (Adeli and Yeh 1989). However, using ML requires knowledge and experience in designing features for a specific SHM, which may not be practical as the monitored systems become more complex (Zhao et al 2019). In recent years, along with the significant improvement of network architecture and computing capacity, deep learning (DL), which aims to automatically extract features from raw data via stacked blocks of deep neural network (DNN) layers (Cha et al 2018; Mosalam et al 2019), has drawn researchers’ attention and has been successfully applied in various areas including CV, natural language processing and audio recognition. Each layer in a DL model will learn a new feature from the data, and thus DL is an end-to-end system that does not need human intervention in the design of features, which makes DL-based SHM applicable widely with minimum knowledge about the specific features (Azimi et al 2020). Some DL models, such as fully connected neural network (FCN), long short-term memory network (LSTM), convolutional neural network (CNN), autoencoder (AE), deep belief network (DBN), deep Boltzmann machine, and generative adversarial network (GAN) have already shown their reliability in analyzing vibration data. The efforts trying to improve the robustness and generalization of CV techniques using DL have also obtained desirable achievements, which accelerate the development of vision based SHM.

Although there are several reviews about recent advances in SHM (Ahmed et al 2020; Azimi et al 2020; Jeong et al 2020; Sun et al 2020; Bao and Li 2021; Dong and Catbas 2021; Pal et al 2021; Sofi et al 2022; Zhang et al 2022a), this article focuses on the application of DL in bridge health monitoring in the last 4 years and tries to provide promising directions after summarizing current challenges and trends. The remaining of this paper is organized as follows: Section2 and Section3 summarize the general studies of DL in vibration- and vision-based SHM, respectively. In Section 4, the SHM systems with DL successfully applied in practice are listed and commented. Finally, this paper ends with a summary of current limitations and some directions required to be noted and pursued.

2 Vibration-based structural health monitoring

Vibration-based SHM has been investigated since the 1990s as its advantages including the global coverage of structural topology and natural availability of vibration signals (Li et al 2015). All of the methods in this area are founded on the premise that any damage occurring on structures will result in changes in vibration signals (Xu 2018), which makes identifying and locating damages viable by monitoring structural vibration signals.

For data-driven methods in this area, damage identification can be disposed as a statistical pattern recognition problem (Farrar and Worden 2012). Some characteristic indexes are selected and extracted from the measured data, based on which the state of the structure is classified into the scenario with the closest values. These methods have been substantially applied, but some insufficiencies should not be ignored. For instance, because the computation burden coming from big data, the traditional methods usually consider small datasets that are assumed to be sampled from a particular distribution (Sun et al 2020). Another drawback is that their complexity and memory requirement are generally of polynomial order to data size (James et al 2011).

As an end-to-end system, DL provides a powerful solution for the challenges and gained so much attention from the engineering community, because of its excellent capability in extracting features from raw data. This section will review the application of DL in data driven SHM methods from the perspective of data preprocessing, damage classification, and data novelty detection. The studies taking spatial information, which is essential but not emphasized before, are concluded at the end of this section.

2.1 Data preprocessing

2.1.1 Anomaly identification

At present, various sensors in a SHM system are the main source of vibration data. However, sensor failure, transmission interruption, and so forth, inevitably ruin the data, which seriously affects the data analysis results. To detect structural damage and assess structural condition correctly, identifying and eliminating anomalies in monitored signals are the first task.

Anomaly identification for vibration signals can be disposed as time series classification, for which the effectiveness of one-dimensional (1D) CNN has been proved by some researchers (Jian et al 2021; Zhang and Lei 2021). Visualizing the time signals into images is an effective approach to leverage two-dimensional (2D) CNN for this mission. And drawing their curves directly is the most frequently adopted method, while other feature engineering approaches, like the spectrogram analysis, the probability density function and so on, were also employed to enhance the network’s robustness (Jian et al 2021; Shajihan et al 2022). For example, Tang et al (2019) visualized time series data in time and frequency domain and stacked them as a single dual-channel image before it was inputted into a 2D CNN to classify data anomalies (see Fig. 1).

Fig. 1
figure 1

The anomaly detection method proposed by Tang et al (2019)

It is noteworthy that imbalanced time series are common in practical engineering as normal data are always larger than abnormal ones. CNN is difficult to reach high classification accuracy in class-imbalanced situations since it is based on class-balance hypothesis (Yin and Gai 2015). To overcome the problem, Liu et al (2022) developed a GAN- and CNN-based data anomaly detection framework, which includes three modules: (i) three-channel input based on visualization, fast Fourier transform (FFT) and Gramian angular field of time series signals; (ii) GAN trained to extract features from normal samples; (iii) CNN employed to distinguish the types of anomalies. Adopting the focal loss function was also a method to soften the class imbalance-induced classification bias (Du et al 2022).

Although satisfying performance has been achieved in classifying anomaly patterns, differentiating sensor faults and structure damages had not been considered until Li et al (2021a) proposed an isolation strategy. In the proposed strategy, a fully connected stateful LSTM network, which was an improvision of LSTM by adding fully connected layers, was used to predict acceleration signals of the selected sensor, and the residual between the prediction and the measured values was regarded as an anomaly index. An anomaly occurring on all sensors of a substructure indicates the existence of structural damage. Otherwise, a fault in one of the sensors is found.

2.1.2 Missing data recovery

Not only can sensor failure and communication error lead to data loss, but also the anomalous signals are often processed as missing data, yet too much missing data, have unfavorable influences on SHM results. Correction analysis between different sensors, built by the partial least square method (Lu et al 2017), nonparametric copulas (Chen et al 2019), and so on, is the frequently used methods to recover the missing data. The DL potential in mining the relationship between the inputs and making predictions also makes it widely adopted for this mission.

The strain data measured before the occurrence of data loss was converted into a grayscale image and then used to train a CNN so that the net could recover the strain responses of the failed sensors according to the remaining sensors’ data (Byung et al 2020). Fan et al (2019) proposed a novel CNN architecture with bottleneck architecture and skip connection to construct the nonlinear relationships between the incomplete signal and the complete true signal and proved its outstanding capability for data recovery, even when the signals have severe data loss ratios up to 90%. Liu et al (2020) verified the accuracy of LSTM in recovering temperature data and demonstrated that incorporating more intact sensor data and selecting the sensor data highly correlating with the missing data as the input would further improve the recovery accuracy. Li et al (2021b) proposed a “divide and conquer” strategy for this mission. The core concept of the strategy was the prediction of the subsequences of the measured data, which were decomposed by empirical mode decomposition (EMD), rather than directly predicting the time series, as the decomposition could assist in the modeling of the irregular periodic changes of the measured signals using LSTM.

After the spatial correlations among the sensors were considered, temporal correlations drew the attention of Jeong et al (2019), and a bidirectional recurrent neural network using spatiotemporal correlations to recover missing data was developed. Jiang et al (2021a) also used GAN to directly compute the missing data based on the remaining observed data with the spatial-temporal relationship considered.

2.2 Damage scenario classification

Simulating damage scenarios composed of different locations and discrete levels in FEM or experimental structures is the most frequent method adopted recently to generate training samples, based on which various DL models were trained to classify the signals from unknown bridge state. Ibrahim et al (2020) investigated the impact of noise on the performance of several ML algorithms in classifying structural damage severity according to acceleration data and demonstrated that CNN could resist noise better than the K-nearest neighbor, support vector machine, and traditional high-pass filter noise cancellation methods.

Integrating vibration signals from multi-sensors into a multi-channel time sequence and segmenting it via a sliding window is one of the approaches to leverage 2D CNN in this area (Khodabandehlou et al 2019; Lee et al 2020a). Teng et al (2020) combined acceleration signals collected by 13 accelerometers and inputted them into 2D CNN to conduct structural damage identification (see Fig. 2). The CNN trained using finite element analysis data reached 94% accuracy for damages in the numerical model and 90% for damages in the real steel frame.

Fig. 2
figure 2

The time series integration scheme adopted by Teng et al (2020)

Encoding time series into images through various algorithms, such as wavelet transform (WT) (Mangalathu and Jeon 2020), continuous wavelet transform (Chen et al 2021), FFT (He et al 2021a), Fourier amplitude spectra (Duan et al 2019), is another method to employ 2D CNN. Mantawy and Mantawy (2022) encoded time-series data, including accelerations, drift rations, and both, into images using three approaches: Gramian angular summation field, Gramian angular difference field, and Markov transition field (MTF) (see Fig. 3). The comparison showed that CNN trained on MTF encoded images reaches 100% accuracy during the training phase and more than 94% for the testing phase.

Fig. 3
figure 3

Time series encoding scheme adopted by Mantawy and Mantawy (2022)

Considering that the overall dynamic vibration of a structure may be insensitive to local damage, He et al (2021b) employed wavelet packet transform to extract more sensitive damage features from structural acceleration signals and used recurrence analysis to obtain the periodicity, non-stationarity, and chaos of the signals, whose result can be visualized by a recurrence graph. Then, the recurrency graph was fed into a CNN to classify the structural damage conditions. Compared with traditional methods, the proposed method showed excellent accuracy in identifying the location and degree of minor damages.

Most of the methods mentioned above rely on the stationary assumption, which fails in practice since non-stationary ambient excitations are inevitable. Li et al (2021c) proposed a new recurrence plot, named un-threshold assembled recurrence distance matrix, to reveal intrinsic dynamic characteristics of the structure (see Fig. 4). Different from traditional single-label model that regards each combination of damage location and level as one objective class, they developed a multi-label CNN to decouple the identification process of damage location and levels. Every sub-branch of the net was trained using an independent dataset to evaluate the damage level at each location before the damage location was identified by fusing information from all of the sub-branches.

Fig. 4
figure 4

Flowchart of structural damage identification using the multi-label CNN model Li et al (2021c)

Inspired by the excellent performance of 2D CNN as shown above, 1D CNN was also employed to detect tiny local structural stiffness and mass changes according to the acceleration records from a single sensor and achieved perfect performance (Zhang et al 2019a; Sharma and Sen 2020). For example, Teng et al (2021) trained seven 1D CNNs using the acceleration signals collected by corresponding sensors and fused all of their classification results at the decision level to obtain the integrated detection results. Compared with data-level fusion, in which all acceleration signals were integrated into a multi-channel time sequence, the decision-level fusion improved the classification accuracy by 10% and 16–30% for the numerical and experimental models, respectively.

Except for CNN, other DL models, like deep residual network (Alazzawi and Wang 2022), ANN (Hormozabad and Soto 2021), recurrent neural network (RNN) (Jena and Parhi 2020), stacked Autoencoder (SAE) (Silva et al 2021) can also exert their influence in this purpose. A sequence of windowed samples extracted from acceleration responses was used to train a LSTM for damage scenario classification by Sony et al (2022) (see Fig. 5). The experimental results demonstrated that the method outperforms 1D CNN on the Z24 bridge. Xiao et al (2021) optimized a deep autoencoder (DAE) using gray relational analysis to extract high-level features from raw signals, according to which a classifier, Softmax, was trained for the classification. Considering the difficulties in optimizing the weights of deep neural network, Pathirage et al (2018) developed a framework with two components for this mission. The first component was used to reduce the dimensionality of the vibration signals while preserving the necessary information, and the second part is to learn the relationship between the features and the damages. Rastin et al (2021a) presented a two-stage method for this mission, in which a deep convolutional GAN trained using the intact state data was firstly used to detect the existence of damage that could be quantified by the discriminator’s output. The detected damage was then localized via a conditional GAN trained by labeled data from damaged states.

Fig. 5
figure 5

The sequence of windowed samples extracted for the training of LSTM (Sony et al 2022)

Other structural modal information, such as natural frequencies and mode shapes (Pathirage et al 2019; Wang et al 2021), are also sensitive to damage. Yang and Huang (2021) introduced the flexibility curvature index that did not need the information of intact structures as the input of a CNN to realize damage identification. Nguyen et al (2020) trained a CNN using the images from the damage index of the gapped smoothing method to classify the damage location in a numerical beam.

To solve the problem of poor anti-noise ability faced by traditional methods, Guo et al (2020) developed a damage identification method based on DBN. After three restricted Boltzmann machines were pre-trained using the damage index, modal curvature difference, a Softmax classifier and a neural network were employed to identify the damage location and degree, respectively. The experimental results showed that DBN had strong anti-noise ability, compared with backpropagation neural networks.

Fig. 6
figure 6

Flowchart of the novelty detection strategy proposed by Mousavi and Gandomi (2021b)

Compared with the number of degrees of freedom of a structure, the number of sensors in a SHM system is often finite or even insufficient. Continuous deflection of a bridge measured by fiber-optic gyroscope, which could cover the whole structure, was thus mentioned, and an 1D CNN was employed to analyze it for damage classification by Li et al (2020) and Li and Sun (2020). Distributed optical fiber sensor based on Brillouin optical time-domain analysis technology exhibited a great facility to measure strain distributions along the whole surface of structures, but its low signal-to-noise ratio limited its application in crack detection. Song et al (2020) thus employed SAE to extract features from its raw data and trained a Softmax classifier to decide whether micro-cracks exist.

2.3 Novelty detection and quantification

Despite the excellent success of DL models achieved in damage scenario classification, the lack of training data restricts their application in practice. Preparing sufficient training data is not only laborious and uneconomical in the laboratory, but also impossible in real engineering. Generally, only normal vibration data can be obtained from new structures since damages cannot be applied to structures under commercial operation, which encourages the application of unsupervised learning. For this kind of methods, DL models are trained to reconstruct the vibration signals. Because only signals from intact structures are available, the reconstructed signals will be away from the measurements when damages exist, which means the reconstruction error is sensitive features indicating the existence of damages.

After seasonal patterns were removed by variational mode decomposition (VMD) algorithm, Mousavi and Gandomi (2021a) used the natural frequency and corresponding Johansen cointegration residuals of a structure to train RNN, and the prediction errors for new measurements were regarded as the index of damages (see Figs. 6 and 7). They also trained a bidirectional LSTM with healthy structure signals denoised by VMD and their Mahalanobis distances calculated by minimum covariance determinant for this mission (Mousavi and Gandomi 2021b). The method required only a couple of low structural natural frequencies. Therefore, it is recommended for cases when the measurements from the environmental and operational variations are not available.

Fig. 7
figure 7

Prediction results and errors in the numerical example conducted by Mousavi and Gandomi (2021b)

Apart from the reconstruction errors, Lee et al (2021) trained a one-class CNN to detect novelty in acceleration data that had been transformed into images through WT. It was found that the minimum damage the method could find was at least a 15% reduction of the stiffness. Based on the essential features extracted from acceleration history by variational autoencoder, Ma et al (2020) adopted the features of Euclidean distance between the first segment and others as the damage index, whose curve could be used to observe whether there was a sudden change caused by the damage.

To quantify damage, Rastin et al (2021b) trained a convolutional autoencoder using the multi-channel signals acquired from a healthy structure to extract sensitive features and calculated the distance between the features and the reference vectors, but a threshold for the distance needs to be specified according to engineers’ experience. Silva et al (2019) trained an AE to eliminate the influence of environmental factors, and then the structure damage was quantified by calculating the residual between its inputs and outputs.

2.4 The function of spatial information

The methods mentioned above use either spatial relation (e.g., using CNN) or temporal relation (e.g., using LSTM) only rather than the combination of them, which may improve the damage identification accuracy significantly. CNN and gated recurrent unit (GRU) were combined by Yang et al (2020) to model both spatial and temporal relations for damage detection and the enhancement it brings was also demonstrated. CNN was utilized to model the spatial relations and the short-term temporal dependency among sensors while its output features were fed into the GRU to learn the long-term temporal dependency jointly. Fu et al (2021) fused the features extracted by CNN and LSTM by FCN for bridge damage scenario classification (see Fig. 8). The combined model, named CNN-LSTM, reached 94% accuracy for damage localization and only 8.0% of the average relative identification error for damage severity identification, both better than CNN. Dang et al (2021) combined underlying features extracted by autoregression model, discrete wavelet transform, and EMD from measured acceleration signals, and inputted them into the proposed hybrid DL framework, named 1D CNN-LSTM for damage identification. Through three case studies, they demonstrated that the framework achieved accuracy as high as 2D CNN but with lower time and memory complexity. Zhang et al (2022b) leveraged LSTM-FCN by assigning the time series of cable forces and their ratios between cable pairs under intact conditions as the input and the identity number of cable as the corresponding labels to recognize damaged cables.

Fig. 8
figure 8

The CNN-LSTM-based damage identification proposed by Fu et al (2021)

Graph neural network (GNN) provides another approach to model spatial correlations among sensors. Li et al (2021d) developed a spatiotemporal graph convolutional network to analyze spatiotemporal correlations among cable forces, in which the spatial dependency of the sensors was represented as a directed graph with cable dynamometers as vertices. The learnable adjacency matrix was used to capture the spatial dependency of the locally connected vertices and a 1D CNN was operated along the time axis to capture the temporal dependency. Son et al (2021) mapped cable tension to graph vertices and the connection relationship between sensors to its edges, and trained a GNN framework, the message passing neural network, to localize the damaged cables and estimate their area.

3 Vision-based structural health monitoring

Vibration-based methods rely on dynamic responses measured by contact sensors, such as accelerometers, strain gauges, and fiber optic sensors, which are expensive in installation and maintenance. The appearance of non-contact sensors, including digital and high-speed cameras, unmanned ground vehicles, and mobile sensors, which are more cost-effective and easier to deploy, provides another promising solution for the SHM of bridges and has attracted much attention in recent years. Unlike contact sensors, non-contact sensors yield images or videos that require advanced image processing techniques to interpret. Traditional image processing methods rely on various edges or boundary detection techniques, such as Sobel edge detector, morphological detector, and template matching to extract features from the images. However, these methods often result in ill-posed problems due to disturbances created by environmental conditions including light, distortion, weather, shade, and occlusion in outdoor environment (Yao et al 2014).

CV aided by DL helps researchers and engineers overcome the challenges due to their reduced sensitivity to external disturbances and excellent capability in feature extraction. Dong and Catbas (2021) presented a general overview of CV-based SHM at the local level (SHM-LL) and global level (SHM-GL). The former includes applications such as crack, rust, and loose bolt detection or quantification, while the latter means displacement measurement, structural behavior analysis, load monitoring, and damage identification. The relation between SHM-LL and SHM-GL is bidirectional: (i) the process of understanding the input-output structural behavior, which is one of the tasks of SHM-GL, can benefit from the condition assessment from SHM-LL; and (ii) the global condition evaluation and damage detection from SHM-GL can assist the SHM-LL to understand how localized conditions and damage affect the complete system (Dong and Catbas 2021). This section will review the recent applications of DL in CV-based SHM from the two perspectives.

3.1 SHM-LL

3.1.1 Image classification

Identifying whether defects exist in the image and classifying images according to the defects they contain are effective ways of detecting surface damage. Benefiting from DL’s excellent performance in image classification, many researchers pay attention to its application in SHM and have obtained impressive achievements.

Quqa et al (2022) trained a CNN to classify the images of the welding joints of a long-span steel bridge as damaged or undamaged. Ebenezer et al (2021) developed an ensemble of three CNN models, custom CNN, Xception, and AlexNet, using the majority voting scheme to improve the classification accuracy for the concrete deterioration in bridges, and a validation accuracy of 87.1% was achieved. Transfer learning is an effective way to accelerate the DL models’ training and improve their accuracy even with fewer training data. Several pre-trained nets, including VGG-16 (Perez et al 2019), Inception v3 (Zhu et al 2020), GoogLe Net have been used for this purpose (Holm et al 2020; Chen 2021; Savino and Tondolo 2021). Savino and Tondolo (2021) fine-tuned eight pre-trained CNNs, including AlexNet, SqueezeNet, ShuffleNet, ResNet-18, GoogLeNet, ResNet-50, MobileNet-v2, and NASNet-mobile, to conduct concrete surface damage classification, and the GoogLeNet reached 94%, the highest accuracy. The appearance of the attention mechanism further improves the performance of DL models, and some new methods integrating it have yet been proposed. For example, a convolution-based multi-damage recognition neural network combined CNN with an attention network and hybrid pooling layers was developed by Shin et al (2020) to classify the five damage types and an accuracy of 98.9% was achieved. Cui et al (2021a) proposed a geometric attention regulation method, in which the bearing location information was marked by a bounding box worked as an attention mechanism to indicate the important part of the input image. The experiments proved that the method could enhance CNN’s performance effectively.

Most of the existing methods perform well in detecting surface defects according to optical images, but there is still a lack of systems that are able to identify subsurface damages, such as concealed cracks (particularly, bottom-up cracks) and debonding between paint and steel surfaces. To overcome the trouble, Ali and Cha (2019) tried to feed thermal images into a deep inception neural network to detect subsurface damage of a steel truss bridge, including corrosion and debonding between paint and steel surface (see Fig. 9).

Fig. 9
figure 9

The subsurface damage detection method proposed by Ali and Cha (2019)

Despite the advantages CNN shows in the area of image classification, environmental impacts still hinder its application in practice. To further improve the accuracy, Qiao et al (2021) designed a new algorithm, called EMA-DenseNet, by adding the expected maximum attention (EMA) module to a DenseNet. Besides, a new loss function considering the connectivity of pixels was designed to reduce the breaking point of fracture prediction. The experiments showed that the mean pixel accuracy, mean intersection over union, precision, and frames per second of the Net reached 87.42%, 92.59%, 81.97%, and 25.4, respectively.

Another trouble CNN faces is that its receptive field generally is so small that many stacked layers are needed to cover the whole image. Compared with CNN, transformer has great flexibility in modeling global context and introduces less inductive bias, but its self-attention mechanism brings heavy computational cost. To address this issue in classifying defects of reinforced concrete bridge, Wang and Su (2022) proposed a hybrid network by inserting a transformer into the CNN backbone, and the multilayer perceptron following them generated the final classification results. Experimental results showed 0.949, 0.896, 0.776, 0.844, 0.745 and 0.899 F1_score for the six damage types, respectively, which are greater than the four networks: EfficientNet B1, RegNetX-800MF, MobileNet V3, and ReXNet.

3.1.2 Object detection

Different from image classification, the techniques developed for object detection provide tools to identify several types of damage contained in the same image. Region-CNN (R-CNN) and you only look once (YOLO) are the most models adopted for this purpose.

Deng et al (2020a, 2021) applied Faster R-CNN and YOLO v2 to label cracks and handwriting contained in raw images, respectively, and the comparative study showed that YOLO v2 performs better in terms of both accuracy and inference speed. Cui et al (2021b) trained YOLO v3 to identify wind erosion areas on the concrete surface, and an accuracy of 96.32% was achieved. Zhang et al (2020) transferred YOLO v3 with fully pre-trained weights from a geometrically similar dataset to detect four types of concrete damages (i.e. crack, pop-out, spalling, and exposed rebar), and proved that it outperforms the original YOLO v3 and Faster R-CNN with ResNet-101. Mondal et al (2020) compared the performance of four Faster R-CNN models, including Inception v2, ResNet-50, ResNet-101, and Inception-ResNet-v2, in detecting four different damage types and found that Inception-ResNet-v2 significantly outperforms the other networks in the mission.

3.1.3 Semantic segmentation

Semantic segmentation that can label each pixel of the image with the pre-defined labels enables researchers to mark the damage location and shape more precisely. Ye et al (2019) demonstrated the superiority of DL-based methods in crack segmentation by comparing the performance of the FCN called Ci-Net and that of traditional edge detection algorithms. Rubio et al (2019) used FCN to segment delamination and rebar exposure from bridge inspection images, but the method could not accurately detect small damages.

For increasing the segmentation accuracy for cracks in images with complicated backgrounds, non-uniform illumination, irregular shapes, and interference, various modifications have been explored for standard networks. A crack-like kernel, which is rectangular rather than square, was introduced by Lee et al (2020b) to SegNet so that it could extract features representing cracks more precisely. Miao et al (2019) inserted a combined sequence-and-excitation (SE) and ResNet block into a U-Net to improve its performance in segmenting spalls and cracks. Jiang et al (2021a) proposed HDCB-Net, a network with the hybrid dilated convolutional block (HDCB), to expand the receptive field of convolution kernel and to avoid the gridding effect generated by the dilated convolution. Furthermore, a two-stage strategy was proposed to realize fast crack detection: in the first stage, YOLO v4 was employed to filter out images without cracks and generate coarse region proposals, from which the HDCB-Net then detected pixel-level cracks in the second stage.

The digital images acquired through unmanned aerial vehicles (UAVs) often suffer from motion blur, which may degrade the corresponding crack detectability. Bae et al (2021) proposed an end-to-end deep super-resolution crack network for resolve this problem. In the first stage, a super-resolution image was generated for the corresponding raw images using a CNN with residual groups and upscaling layers, which was segmented by a DAE composed of CNN in the following stage. The validation test on concrete bridges demonstrated that 24% improvement in detection accuracy was achieved, compared with the crack detection results using raw digital images.

Considering the significant imbalance between background and crack pixels, which results in good performance in classifying background pixels while performing poorly in identifying cracks, Sajedi and Liang (2019) investigated three different optimization strategies, including UW (uniform weights) -MAP (maximum a-posteriori probabilities), MFW (median frequency weight) -MAP and UW-ML (maximum likelihood), in improving a fully convolutional encoder-decoder neural network’s robustness against the imbalance, and found that UW-ML strategy achieved the best results among them. Han et al (2020) designed a crack segmentation network combining U-Net with a ternary classifier, which significantly reduced the false positive rate, to overcome the same challenge. Deng et al (2020b) adopted the weight balanced intersection over union (IoU) loss function rather than cross-entropy loss or focal loss in the training process of the link atrous spatial pyramid pooling (ASPP) network, in which a modified ASPP module was introduced to LinkNet for segmenting tiny damages.

3.1.4 Damage quantification

After damages are detected, quantifying their severity becomes another important mission for evaluating structures’ condition correctly. For cracks, width and length are the most typical parameters. The segmentation results display cracks clearly and, thus, have become the basis of most crack quantification methods. After the binary maps of cracks were obtained by a dual-scale CNN, Ni et al (2019) proposed a crack width estimation method based on the Zernike moment operator, but its performance for cracks narrower than 2 pixels and under adverse conditions (e.g., dark lighting) seems not very well, and time-consuming is another drawback. Yang et al (2021) employed CNN combined with U-Net to extract crack pixels and their midline. The non-uniform width along the crack was extracted according to the proposed crack-width direction identification method, and pixel calibration experiments were then conducted to establish the nonlinear mapping model among pixel size, shooting distance, and focal length, based on which the actual width of the cracks could be obtained. The results of the verification experiments showed that the recognition precision has achieved at 0.01 mm.

Counting the proportion of the pixels belonging to diseases in all pixels is a workable method to quantify damages like corrosion. Wang et al (2020) proposed a standardized structural health evaluation method and based on it to quantify the damages in the photos of a steel box girder, which were synthesized into panoramas by image stitching technology, and a U-Net was employed to segment the diseases in it. For bolt losing quantification, the Hough line transform -based image processing algorithm was designed to estimate the bolt angles according to the bolt images cropped by R-CNN (Huynh et al 2019). Huynh (2021) designed an autonomous vision-based bolt-looseness detection method with a Faster R-CNN-based bolt detector, an automatic distortion corrector, an adaptive bolt-angle estimator, and a bolt-looseness classifier. Then, the method was applied in a realistic joint of the Dragon Bridge in Danang, Vietnam.

3.2 SHM-GL

3.2.1 Vibration monitoring

Apart from visible damages, vision-based methods are also efficient ways to provide vibration signals to identify invisible damages. Deng et al (2020c) developed an intelligent non-contact remote sensing method in which a uniaxial automatic cruise acquisition device was designed to collect image sequences from bridge surface before they were inputted into a three-dimensional (3D) CNN to identify the envelope spectrum of the holographic deformation. Then, the deflection curvature difference was used to identify the change of damage location and degree. Their experiments demonstrated that the holographic deformation is higher sensitive in damage identification than the limited number of measuring points.

Furthermore, cable forces estimation of urban bridges, according to the drone-captured video, has been realized by Zhang et al (2021). Firstly, a pre-trained FCN was adopted to identify bridge cables and further extract their displacement. Then, EMD was employed for extracting cable vibration signals and eliminating the effect of drone motion. Finally, natural frequencies of the cables were obtained by performing Fourier analysis on extracted cable vibration and further adopted for cable force estimation.

In traditional vision-based vibration measurement methods, template matching algorithm and corner detection algorithm are usually used to track and locate the target, but they are sensitive to the quality of images, which often is poor due to insufficient illumination or fog. Xu et al (2021) thus proposed a distraction-free displacement measurement approach by integrating DL-based Siamese tracker with correlation-based template matching. The DL-based Siamese tracker applied deep feature representations and learned similarity measures for image matching and also considered adaptive template updates with time. The method was then implemented on a short-span footbridge and a long-span road bridge, where its potential to handle challenging scenarios including illumination changes, background variations, and shade effects, was demonstrated. Shao et al (2021) combined the MagicPoint network and the SuperGlue network to achieve target-free full-field 3D vibration displacement measurement and demonstrated the combination’s accuracy compared with traditional sensors, while the combination is more cost effective. Furthermore, they (Shao et al 2022) employed a phased-based video motion magnification algorithm to achieve a higher accuracy of tiny vibrations at the submillimeter level.

3.2.2 Component identification

After various damages are detected, the rating of a structure needs to be provided by a comprehensive assessment in which importance of different components should be considered (Zhu et al 2010). This requires spatially relating identified damages with structural elements. However, inspection images, especially captured by aerial inspection platforms, usually contain complex scenes, wherein structural elements mix with a cluttered background. Extracting structural elements from complex images and sorting them is thus meaningful for SHM.

With a small dataset labeled by inspectors, Karim et al (2021) transferred a Mask R-CNN to segment multi-class bridge components from the videos captured by an UAV. False negatives were recovered by the temporal coherence analysis and a semi-supervised self-training method was developed to engage experienced inspectors in refining the network. The model’s performance reached 91.8% precision, 93.6% recall, and 92.7% F1-score.

Point clouds in 3D space can also provide sufficient information for this purpose. Kim et al (2020) extracted a high-resolution set of point clouds from the full-scale bridge by subspace partition and employed PointNet to classify the points in each subspace. Kim and Kim (2020) compared the performance of three DL models, PointNet, PointCNN, and dynamic graph CNN (DGCNN), in the classification of a point cloud of the bridge components and found that the mean interval over the unit of DGCNN was 86.85, which is higher than the others (see Fig. 10).

Fig. 10
figure 10

Identification results of points clouds in the research of Kim and Kim (2020)

3.2.3 External load

Moving vehicles are one of the main sources of live loads on bridges, and gathering their information is essential for SHM. Bridge weigh-in-motion that exploits bridge components, e.g., decks, girders, and vertical stiffeners, as weighting scales, is the most frequently adopted solution for this purpose, and DL brings efficient solutions for some of its drawbacks.

Zhang et al (2019b) proposed a novel methodology for the mission, in which a Faster R-CNN transferred from ImageNet was employed to detect different types of vehicles frame by frame. Multiple objects tracking algorithm tracked vehicles among different frames and generated the information sequence about each vehicle’s coordinate, type, lane number, and frame number. Then, the image calibration method based on moving standard vehicles was developed to calculate the vehicle length and speed. After acquiring the parameters, the spatiotemporal information could be obtained by vehicle location and the hypothesis of constant speed (see Fig. 11).

Fig. 11
figure 11

The framework for obtaining the spatiotemporal information of vehicles by Zhang et al (2019b)

However, the weight of vehicles cannot be obtained using the method proposed by Zhang et al (2019b). Jian et al (2019) combined CV with the influence line theory to acquire the time-spatial distribution of the vehicle loads on bridges. YOLO V3 was used to identify vehicle positions, types, and axle numbers. Then, vehicle weight was calculated by combining the strain influence line calibrated by field tests and the strain time-history. However, since only three scenarios of vehicle distribution were taken into consideration, the method may face obstacles in complicated traffic scenarios. To overcome this problem, a least square-based identification method that can utilize the redundant strain data measured by a network of strain sensors was proposed to distinguish complicated traffic modes and reduced the recognition errors through solving the overdetermined inverse influence equations (Pathirage et al 2019).

An approach for obtaining spatiotemporal information of vehicles on bridges based on 3D bounding box reconstruction was also proposed by Zhu et al (2021), in which CNN and YOLO were used to detect vehicles and get their 2D bounding box. A 3D bounding box reconstruction method based on the relationship between 2D and 3D bounding box was then developed to get the size and position of vehicles, and the spatiotemporal information of the vehicle could be finally obtained by using multiple objects tracking algorithm.

4 Application of DL in real bridges

The capability of DL encourages the exploration of various approaches that are able to overcome the challenges in traditional SHM, but most of them were verified just in simulation or laboratory. It cannot be denied that more details, like the platform used to collect images and the programs with user interface, need to be taken into consideration for promoting the application of these methods in practice (Xu 2018). This section summarized some efforts devoted to dealing with important details and the systems with DL that have been applied in actual structures.

A framework for autonomous bridge inspection using a UAV was proposed and applied to the Pahtajokk Bridge by Mirzazade et al (2021). Planning the most efficient flight path that could cover the damaged field with the minimum number of images was the first step. Then, three CNN models, SegNet, Inception v3, and U-Net, were trained to conduct bridge component detection, damage area recognition, and crack segmentation, respectively. The third step was to generate a dense point cloud for the damaged areas via intelligent hierarchical dense structure from motion and align it to the overall point cloud for the construction of the digital model of the bridge. Finally, damages were quantified based on the global coordinates of the detected damages.

Kruachottikul et al (2021) described a DL-based visual defect inspection system for reinforced concrete bridges, which consisted of four components. A mobile phone that could take photos was the first part. The second part identified images with defects via a modified ResNet-50, and the defects was classified using another modified ResNet-50 in the third part. Finally, damage severity was quantified by an ANN in the last part. The system’s accuracy for defect detection, classification, and severity prediction were 90.4%, 81%, and 78%, respectively, which had been accepted by Thailand’s Department of Highways for practical use.

Jang et al (2021) developed a ring-type climbing robot system composed of multiple cameras, a climbing robot, and a control computer. The raw images captured under close-up scanning conditions were proposed through feature control-based image stitching, DL-based semantic segmentation, and Euclidean distance transform-based crack quantification algorithms, based on which a digital crack map of the target bridge pier could be established. The test results conducted on the Jang-Duck bridge in South Korea revealed that the method successfully evaluated cracks of the bridge pier with a precision of 90.92% and recall of 97.47%.

Considering the difficulty to approach some parts of bridges by workforce, such as the bottom of decks, He et al (2022) proposed a smart unmanned surface vessel (USV) system for damage detection (see Fig. 12). A novel anchor-free network, CenWholeNet, which focused on center points and holistic information, was proposed, and a parallel attention module was introduced into the model innovatively. For the platform, a USV system without the global positioning systems (GPS) navigation, supporting real-time transmission of lidar and video information was designed.

Fig. 12
figure 12

The system developed by He et al (2022)

Vehicle-assisted monitoring is a promising alternative for rapid and low-cost bridge health monitoring compared with instrumentation installed on bridges. Sarwar and Cantero (2021) developed an indirect bridge monitoring system, in which a DAE was trained by the vertical acceleration responses of a fleet of vehicles passing over a healthy bridge. Then, the Kullback-Leibler divergence between the measured and the reconstructed signals was used for damage detection and severity quantification.

Mobile devices such as smartphones can be not only a sensing platform but also a computing platform to conduct on-site damage detection. However, due to the limited computing resources of mobile devices, the size of the DNN needs to be reduced. Ye et al (2022) developed pruned crack recognition network by reducing DNN size via the pruning method and designed a DL-based crack detection program for smartphones. In order to conduct crack detection by Internet of Things (IoT) devices in real-time, Kim et al (2021) proposed OleNet by fine-tuning the hyperparameters of LeNet-5. Compared with other pretrained DL models, including VGG16, Inception, and ResNet, OleNet achieved the maximum accuracy of 99.8% in the minimum computation. Shrestha and Dang (2020) developed a program integrated with CNN to realize accurate and real-time bridge vibration classification according to the multi-channel time-series signals acquired by the built-in accelerometers of smart phones.

5 Conclusions

In this paper, the applications of DL models in SHM, particularly damage detection of bridges, have been summarized systematically. It is easy to find that the excellent capability of DL models in addressing obstacles in the traditional SHM methods of the bridges has been demonstrated by the applications not only in laboratories but also in real bridges. Each of the DL models promotes the realization of a more intelligent SHM. However, it cannot be denied that drawbacks exist in every method. Some of the challenges can be listed as follows:

  1. 1.

    Most of the current studies consider only one type of monitoring data in damage detection. If this type of monitoring data is anormal, the damage detection will fail no matter how good the damage detection method is.

  2. 2.

    Although several attempts have been conducted to realize the targets by unsupervised learning, most of the applications still rely on pre-defined damage scenarios and training data, which pose a considerable requirement of engineering experience and labor.

  3. 3.

    The conditions of laboratories, where the majority of methods were validated, are idealized. The robustness of DL models needs to be further enhanced to combat environmental interference in practice, such as the vibration induced by external loads and motion blur when UAVs are employed.

  4. 4.

    The weak connection between the two levels of vision-based SHM results in difficulties in comprehensive condition assessment, for which visible defects and invisible damages need to be considered at the same time.

After considering the limitations listed above and recent achievements in DL, the following directions are promising and worthy to be further investigated:

  1. 1.

    Fusing multiple types of information collected by SHM system: With advances of multiple types of sensors, the SHM system can provide multiple types of structural information. Fusing and leveraging the multiple types of information in structural condition assessment via DL methods is a promising way to enhance the methods’ practicality.

  2. 2.

    Building larger training databases collected from the real world: Training DL models with the data containing actual interference is an efficient path to improve their robustness, and the availability of advanced sensors and UAVs nowadays makes it possible to build larger databases consisting of real samples.

  3. 3.

    Utilization of mobile and IoT devices: Mobile devices, such as smartphones, can be not only a sensing platform with various built-in sensors, including magnetometer, gyroscope, accelerometer, and GPS, but also a computing platform. Leveraging them by deploying lightweight DL models makes on-site damage detection available. In addition, the IoT devices, which emerge with the innovation in data transmission and cloud-based computation, provide an efficient way to obtain and integrate different types of structural data, which will prompt a cost-minimized and automatic SHM.

  4. 4.

    Digital twin: In order to make a reliable assessment reflecting the true condition of structural elements, an ensemble of multi-scale DL models is needed to interpret and integrate the data from both the local level and global level of SHM. Digital twin that tries to replicate physical entity in digital world (Lin et al 2021) provides a powerful platform for this mission, in which various damages can be reconstructed and evaluated at the same time. Integrating SHM and digital twin may be a promising way to realize the smart civil structure, even smart city.