Introduction

Bridges are essential pieces of transportation infrastructure that are built over bodies of water, valleys, or other obstacles. Concrete-built bridges are extremely prevalent [1]. With many concrete-type bridges, the corresponding workload for health inspection and maintenance is constantly growing [2]. Concrete cracks frequently arise throughout the building and operating processes due to internal and external elements like temperature, shrinkage, or foundation deformation [3]. If cracks are allowed to expand, it will lead to loss of the concrete structure and produce safety accidents, and once an accident occurs on a bridge, it is often a mega accident. Most current bridge damage detection methods [4] rely on visually inspecting the structure or doing non-destructive tests on the bridge itself. In terms of the inspection procedures that are now in place, the qualified expertise of the employees is essential [5], especially for many high bridges spanning the towering mountain peaks. It is challenging for the staff to work at height in relatively lousy weather. Therefore, it is of great significance to propose a reasonable and effective method to detect bridge surface cracks.

Methods for detecting cracks are plentiful, and the one described here is not restricted to detecting cracks on the bridge surface. Priewald et al. [6] proposed a method to reconstruct arbitrary defect profiles of steel plates using magnetic leakage measurements to reconstruct arbitrarily shaped cracks' two-dimensional reconstruction. Vincitorio et al. [7] presented a lens-free Fourier digital holographic interference technique for optical crack detection, which applied non-uniform thermal loads to the study object, effectively improving the accuracy of rough surface crack detection. Kumar et al. [8] traded off between crack localization and smooth wavelet coefficients by choosing the optimal wavelet analysis scale and obtaining high-resolution beam deflection measurements using photographic hair, suitable for localization detection with multiple cracks. All the above methods used the physical properties to detect cracks, and their application requires compliance with specific conditions.

The digital image-based approach has been widely employed in crack detection [9]. The main idea is to exploit the grayscale difference between crack pixels and the bridge surface background. Therefore, to detect bridge cracks and segment them from the bridge surface, substantial feature engineering is required [10]. Crack detection via image processing can be challenging because cracks are often uneven in shape and size and other noises in the acquired images, such as illumination conditions, shadows, flaws, and concrete spalling. Consequently, a variety of image-based crack detection techniques have been proposed, including camera-based image processing [11], IR-based image processing [12], Ultrasonic image processing [13], and Laser image processing [14]. The listed methods [12,13,14,14] for image processing achieved satisfactory performance on a self-built dataset. However, the practical applicability of these methods is still obscure [15], as they depend on the image's illumination conditions, image resolution, and noise level.

In recent years, with the explosion of deep learning, the development of computer vision technology has been pushed to its peak. Image classification [16], object detection [17], and semantic segmentation [18] have achieved breakthroughs. Image classification is mainly to determine what objects are in each image. Object detection aims to determine the location and class of objects that appear. Semantic segmentation aims to identify the category to which each pixel point belongs. The application of object detection technology to detect cracks on the surface of bridges has somewhat eliminated the safety hazards of working at the height [19]. Bridge crack detection in real-time ensures the accuracy of crack identification while improving detection efficiency, significantly reducing time and labor costs.

Before using deep learning for object detection, traditional object detection algorithms can be divided into three stages: region selection, feature extraction, and feature classification. First, the location of the possible object in the image is selected, and the sliding window algorithm is used for selection. Then, the object location is obtained, and feature extraction is performed using manually designed extractors. Finally, the extracted features are typically categorized using algorithms such as SVM [20] or Adaboost [21]. The process of detecting surface cracks on bridges using the traditional object detection algorithm is shown in Fig. 1. After entering the deep learning development phase, deep neural networks with a large number of parameters may extract more valuable features, which has prompted the application of deep learning for object detection. Faster R-CNN [22] proposed the idea of anchor, which pushes object detection to its first peak. The YOLO algorithm [17] was proposed to achieve one-stage detection without an anchor. Feature pyramid network [23] for object detection showed a superior feature extraction network using feature pyramids, and Mask RCNN achieves instance segmentation while improving detection performance. In 2020, YOLO v4 improved upon previous versions, strengthening the detection performance. The method of detecting cracks on bridge surfaces using the detection algorithm is to use the cracks as the object and locate the cracks in the image using a bounding box.

Fig. 1
figure 1

Detection of bridge surface cracks using traditional object detection algorithms

The most effective crack detection algorithms based on object detection are Faster R-CNN and YOLO. Faster R-CNN belongs to the classical two-stage algorithm, which usually focuses on finding the locations of object occurrences and getting suggestion boxes in the first stage and then focuses on classifying the suggestion boxes and finding more accurate locations in the second stage. YOLO belongs to the one-stage algorithm, which combines two steps of the two-stage algorithm into one. The prediction of finding the place and category of object appearance is completed in one stage. In general, one-stage algorithms provide faster detection but at the expense of accuracy. Some researchers employed object detection algorithms to detect cracks. Suh et al. [24] introduced a polytype crack detection method based on Faster R-CNN, which improved the original structure of Faster RCNN and used seven images of various structures for validation, enabling real-time detection and localization of multiple forms of cracks. Jiang et al. [25] used the SSDLite algorithm for crack detection. Mandal et al. [26] designed a system based on YOLO v2 for pavement detection. Nonetheless, it is a pity that the accuracy of these methods is not as good as it should be and needs to be ascended.

The primary objective of this paper is to detect bridge surface cracks using advanced object detection algorithms in deep learning. First, a bridge surface crack data set was constructed for training and validation. And the YOLO v4 [27] was applied to detect cracks in the surface of concrete structural bridges. The YOLO v4 was considered to have a complex structure and numerous parameters. Thus, some lightweight networks were substituted for the feature extraction network of YOLO v4. The bridge crack detection methods based on lightweight YOLO v4 were tested, and the model evaluation indexes were recorded. Moreover, an improved YOLO v4 bridge crack detection method was proposed to balance the speed of bridge surface crack detection and the loss of detection accuracy caused by the introduction of lightweight networks. Lastly, we evaluated and tested the proposed method and discussed the detection details.

Methodology

Dataset of bridge surface cracks

Extracting high-level, abstract features from raw data is complicated. Han et al. [28] proposed that a typical example of a deep learning model is a multilayer perceptron. One perspective that explains deep learning is learning the correct representation of data. Thus, the quality of the images plays a vital role in the final training results of the model. The experimental data were taken from the bridges around Guizhou University. About 800 photos of bridge cracks were collected manually using a digital camera with a resolution of 1664 × 1664 pixels and JPG format. We collected images of cracks from different weather, including sunny, cloudy, and rainy days. For the feature extraction network, the resolution of 1664 × 1664 pixels is significant, which will cause excessive memory overhead and increase the pressure on the hardware. Consequently, image compression software was used to compress the photographs to 416 × 416 pixels, and some of the collected images of the bridge's surface fractures are displayed in Fig. 2.

Fig. 2
figure 2

Partial sample image of bridge surface crack data set

For object detection algorithms, model overfitting is a frequent phenomenon. Although the overfitting of the network can be mitigated by using various techniques, none of them is as effective as increasing the richness of the data. Data augmentation can be attempted for the collected pictures of cracks on the bridge's surface in two ways: geometric transformation and data source expansion. Geometric transformations can enrich the position and scale of objects appearing in the image, thus satisfying the model's translation and scale invariance.

Sometimes to expand the dataset, the detection object can be fused with other background images to increase the richness of the dataset by replacing the object background. YOLO v4 algorithm refers to the CutMix augmentation strategy [29] and uses the Mosaic data augmentation by using four images for stitching. The idea is to read four images at a time, flip, scale, change the color gamut of each of the four images, and combine them into a single image according to a specific ratio. The Mosaic data augmentation method is unstable for bridge surface cracks, and the model training effect is average. The data enhancement operation used in this paper was to rotate the image so that the cracks were located at different positions.

Data labeling is the key to training the detection model, and the quality of labeling determines the accuracy. In this study, for data labeling, the labelimg tool was used to manually create image labels to produce a bridge surface crack dataset in PASCAL VOC data format. A total of 4033 images were labeled, as shown in Fig. 3. The labelimg tool will generate XML files based on the rectangular boxes, including crack category, size, and location. We should avoid other objects entering the rectangular box during the data labeling, leading to a high false detection rate. For longer cracks, multiple rectangular boxes are used to mark the cracks, and the location of the cracks is as much as possible in the middle of the rectangular box. In addition, the dimensions of the rectangular boxes are kept as consistent as possible.

Fig. 3
figure 3

Bridge surface crack image mark

Detection based on lightweight convolutional networks

The YOLO v4 is an upgraded version proposed on the original YOLO v3 [30]. The detection idea is similar to YOLO v3, using three feature layers for classification and regression prediction. YOLO v4 has been optimized in data processing, training methods, activation functions, network structure, etc. The network structure of the YOLO v4 algorithm is shown in Fig. 4, which mainly includes the feature extraction network CSPDarknet53, the spatial pyramid pooling network SPP to strengthen feature extraction, and the feature fusion network PANet, and the final prediction network YOLO Head. In this paper, YOLO v4 was applied to detect cracks on the bridge surface, which could be accurately detected, and the location of the cracks could be determined.

Fig. 4
figure 4

Network architecture of bridge crack detection method based on YOLO v4

YOLO v4 relied on convolutional networks for feature extraction, which was typically computationally expensive. It is hard to depend on the foundational networks to provide real-time operations. Therefore, the CSPDacknet53 was replaced with some lightweight networks. DenseNet [31], MobileNet v1 [32], MobileNet v2 [33], MobileNet v3 [34], and GhostNet [35] were selected as the backbone that could minimize parameters and accelerate the execution of the model. In the selected lightweight network, the DenseNet establishes a tight connection between the front and back layers to achieve feature reuse in the channel dimension, enabling better bridge crack detection with reduced parameters and computational effort. The network structure of DenseNet is shown in Fig. 5. In the Dense Block, each black dot stands for a convolutional layer, black lines represent the flow of data, and the input of each layer is the sum of the outputs of the layers that came before it.

Fig. 5
figure 5

DenseNet network architecture

MobileNet used the new convolution method as the basic unit instead of the traditional one. The new convolution method divides the process into two steps: point-by-point convolution and channel-by-channel convolution, reducing redundant computation. If the feature map size of bridge cracks is X × Y × N, the required export bridge cracks feature map is X × Y × M, the kernel size is 3 × 3, and the Padding is 1. The total computation of standard convolution is shown in Eq. (1).

$$ \begin{array}{*{20}c} {P1 = X \times Y \times N \times 3 \times 3 \times M} \\ \end{array} . $$
(1)

The process of channel-by-channel convolution is as follows: first, for a channel with input features X × Y, a 3 × 3 convolution kernel is used to perform a dot product summation to obtain the output of a channel X × Y. Then, for all input channels N, the output of size X × Y × N is obtained using N 3 × 3 convolution kernels. The total computation of the channel-by-channel convolution connecting a point-by-point 1 × 1 convolution is shown in Eq. (2).

$$ \begin{array}{*{20}c} {P2 = X \times Y \times N \times 3 \times 3 + X \times Y \times N \times M.} \\ \end{array} $$
(2)

The ratio of the two convolution calculation methods, as shown in Eq. (3), showed that the overall calculation of depth-separable convolution is about nine times smaller than that of standard convolution of its lightweight convolution significantly reduces the calculation of the convolution process.

$$ \begin{array}{*{20}c} {\frac{P2}{{P1}} = \frac{1}{M} + \frac{1}{9} \approx \frac{1}{9}} \\ \end{array} . $$
(3)

GhostNet is an effective neural architecture built on the Ghost module, generating more feature maps using fewer parameters. Ghost module can be used as a plug-and-play component to upgrade existing convolutional neural networks. By evaluating the generalization capability of GhostNet, the authors conducted experiments on the COCO dataset, applying GhostNet as a feature extraction network in the Faster RCNN and RetinaNet detectors. The experiments showed that GhostNet could achieve the same results as MobileNet on the one-stage RetinaNet and the two-stage Faster R-CNN framework to achieve similar mAP as MobileNet v2 and MobileNet v3.

Improved bridge surface crack detection method

Anchor optimization

Anchor first appeared in Faster R-CNN, which is essentially a series of prior boxes of varying size and width, uniformly distributed over the feature map, using features to predict the class of these Anchors and the offset from the presence of natural object borders. Compared to its manual search for the width-height distribution of the labels, one can try to cluster a suitable set of anchors directly on the labels on the training set by using clustering. The default Anchor for the YOLO v4 object detection algorithm is a generic size calculated using the clustering algorithm on the Coco dataset. Since the object size ratios in the Coco dataset vary widely, the size of the anchor also varies. The research object of this paper was a self-built dataset. The shape of cracks in different images does not differ much, so the corresponding anchor size can be calculated according to the actual shape size of cracks. We used the K-means clustering algorithm to cluster and analyze the self-built bridge crack dataset to obtain suitable initial candidate boxes for bridge crack detection [36]. Since the Euclidean distance calculation method leads to certain deviations in the results, the calculation method shown in Eq. (4) is used for clustering.

$$ D\left( {{\text{box}},{\text{centroid}}} \right) = 1 - {\text{IOU}}\left( {{\text{box}},{\text{centroid}}} \right), $$
(4)

where the intersection of union (IOU) is a standard for measuring the accuracy of detecting objects. In our study, the number of preselected boxes was chosen to be nine to balance the detection speed and accuracy.

Improved network architecture

The YOLO is a classic algorithm that initially utilized the idea of regression to accomplish classification and location localization tasks directly using the first-order network at a high-speed rate. Subsequently, YOLO v2 [37] and YOLO v3 have improved detection accuracy and speed. They can achieve 3 to 4 times the forward speed when the accuracy is the same as other detection algorithms, so it is widely used in the industry. YOLO v4 has made a series of optimizations based on YOLO v3 so that the MAP and FPS in COCO dataset testing can be further improved.

Considering that the YOLO v4 object detection algorithm was not developed for bridge surface crack structures, our study improved the YOLO v4 for the characteristics of bridge surface cracks. First, the improved YOLO v4 algorithm simplified the bridge surface crack feature extraction network structure and replaced the Mish activation function with the ReLU6 activation function. ReLU6 restricts the maximum output to six to have an excellent numerical resolution, even on mobile devices with low precision. To balance the speed and precision of the network, the structure of the crack feature extraction network was adjusted to make it more suitable for the produced crack data features, reducing the extraction of the feature layer from three to two. Then, the YOLO v4 network used the SPP structure to increase the feeling of the wild. According to the characteristics of the bridge surface cracks, we removed the SPP structure. Finally, the PANet network was simplified to the basic FPN structure. FPN transmits deep semantic information to the underlying layer to supplement shallow semantic information, resulting in high-resolution, strong semantic features. Different feature maps were used for the region of interest (RoI) of various sizes, with RoI of long cracks extracted on the deep feature maps and RoI of short. The improved network structure is shown in Fig. 6.

Fig. 6
figure 6

Network architecture of proposed bridge crack detection method

Attention mechanism

The attention mechanism was applied to enhance feature extraction ability from bridge surface cracks and increase the global sensing domain of the feature extraction network. It effectively reduces the influence of environmental background and obstacle occlusion on bridge crack detection and avoids false and missed detection. Jie et al. [38] proposed SENet first perform the squeeze operation on the feature map obtained by convolution to acquire the global features at the channel level. Li et al. [39] proposed SKNet, which was an improved network of SENet. It could get the global features according to different images to get convolutional kernels with dissimilar importance information than the Inception structure according to the flexibility. The efficient channel attention (ECA) module proposed by Wang et al. [40]. They contend that using SENet for channel attention mechanism prediction has side effects that capturing all channel dependencies is inefficient and unnecessary. The idea of the ECA module is to remove the fully connected layer in the original SE module and learn directly by a 1D convolution on the features after global average pooling. As shown in Fig. 6, we introduced attention mechanisms to the two effective feature layers extracted from the backbone network and the results after upsampling.

Model training tricks

Multi-scale detection is one of the most active research topics in the field of target detection today. Multi-scale is also a major difference between target detection and image classification tasks. The crack classification problem is usually for the same scale, while for bridge crack detection, the model needs to detect cracks of different scales, which requires the model to be robust to scale. Long cracks are more accessible than short ones due to their large area and rich features. However, in the real situation, short cracks tend to occupy a larger proportion, and once a bridge has short cracks, the bridge needs to be maintained. The image pyramid structure in digital image processing is to scale the input image to multiple sizes, and each scale calculates the feature map separately and performs subsequent detection. This study used a multi-scale training method to set several different input sizes of bridge crack images. One scale was randomly selected from the multiple scales during training. The input bridge crack images are scaled to that scale and fed into the network.

The model training process obtained the pre-trained model weights file from the public dataset because of the small dataset. The migration learning was applied to the model training based on the lightweight convolutional neural network. The migration learning approach can prevent the backbone part from being too random, thus enhancing the training effect. The migration learning first freezes the front-end of the feature extraction network to train 60 epochs, with an initial learning rate of 0.001. The learning rate was automatically adjusted according to the cosine annealing algorithm. The training was completed by using a pre-trained model that has been trained with strong feature extraction capability to perform feature extraction first and then fine-tuning the parameters. Additionally, label smoothing was added to the model training to prevent the model from being trained with overconfidence in predicting labels leading to poor generalization ability.

Evaluation and result analysis

Evaluation metrics

For an object detection model, we need specific rules to evaluate its goodness and thus select the model on merit. The model output for bridge surface crack detection is unstructured, and the number, location, and size of the output cracks are not known in advance. For bridge surface cracking, we can judge the detection quality from the fit of the prediction box to the ground truth box and generally use the intersection of union (IOU) to quantify the fit. The IOU was calculated as shown in Fig. 7, and the IOU is obtained by the ratio of the intersection and the concatenation of the two boxes by the model. A considerable value of IOU indicates that the two boxes overlap well. For IOU, we usually pick a threshold to determine whether the predicted box is correctly or incorrectly framed. When the IOU of two boxes is greater than the threshold, it is considered a valid detection.

Fig. 7
figure 7

IOU calculation process

Since there are two types of labels in the image, background, and object, and the prediction boxes are divided into correct and incorrect, four types of samples will be generated in the evaluation, which are True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). To evaluate the performance of the model, we introduced generic evaluation metrics, which are precision, recall, and F1. The calculation was shown in Eq. (5), Eq. (6), and Eq. (7).

$$ \begin{array}{*{20}c} {{\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}},} \\ \end{array} $$
(5)
$$ \begin{array}{*{20}c} {{\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}},} \\ \end{array} $$
(6)
$$ \begin{array}{*{20}c} {F1 = \frac{{2 \times {\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}}.} \\ \end{array} $$
(7)

For bridge crack data samples, the true positive is the prediction box correctly matched with the label box, and both will have an IOU greater than the set threshold. False positive was the background of the picture detected as bridge cracks. False negative is a crack that the model would have detected, but it did not. True negative means that the model detects nothing. Sample bridge surface cracks are discriminated, as shown in Fig. 8. In addition, to reflect the advantages of the proposed bridge crack detection method, the number of parameters, model size, FLOPs, and FPS score of the model are given for reference. The size of the model is obtained by adding 1 MB to the hard disk space occupied by the completed training weights file.

Fig. 8
figure 8

Example of positive and negative sample discrimination

Comparison of experiment and result analysis

To further validate the effectiveness of the proposed method, we compared Faster R-CNN, YOLO v3, YOLO v4, YOLO v4-MobileNet v1(V1), YOLO v4-MobileNet v2(V2), YOLO v4-MobileNet v3(V3), YOLO v4-DenseNet(DN), YOLO v4-GhostNet(GN). The experiments were all run under the experimental platform environment in Table 1. After the experiments were completed, we recorded the evaluation metrics for all methods. The experimental results show that the YOLO algorithm significantly outperforms the Faster RCNN algorithm in terms of detection speed and has a lead in precision for the self-built dataset in this paper. Applying the lightweight backbone network in YOLO v4, although the precision is slightly reduced, the memory size occupied by the model is significantly reduced, and the detection speed has obvious advantages. As shown in Fig. 9, the precision, recall, and F1 of the proposed crack detection method are 93.96%, 90.12%, and 92%, respectively, which are higher than other methods. Further, Table 2 displays that the model takes up only 23.4 MB, and the detection speed can reach 140.2 FPS.

Table 1 Experimental platform
Fig. 9
figure 9

Evaluation results of crack detection

Table 2 Comparison of nine methods

Real image test

After the model was trained, we tested its detection effect with images. To reflect the universality of the model, the images we used for training include a lot of noise, such as light, wetness, and shadows. The test results are shown in Fig. 10, where the proposed method can accurately detect bridge surface cracks in the image regardless of noise such as light conditions, shadows, and wet surface conditions. In Fig. 10, the red box part shows the detected cracks, and the object category and the score are indicated in the red box. In the model training trick, we have introduced multi-scale training, which increases the diversity of multi-scale objects. Even if the image size of the input model was different from the training set, the crack was still accurately detected, indicating that the overall generalization performance of the model can meet actual requirements.

Fig. 10
figure 10

Image test results of bridge surface crack detection

Model deployment

In realistic application scenarios, visual detection is video-based, particularly in models installed at terminals to achieve real-time detection effects. We used JETSON XAVIER NX to deploy models to verify that the proposed method can run in real-time on edge devices with limited computing power. This model could be applied to unmanned aerial vehicles, portable inspection devices, small robots, intelligent robots, smart cameras, and other devices. The edge devices' specifications for deploying and testing were as follows: NVIDIA Carmel ARMv8.2 CPU, 384-core Volta GPU. After training the model, the corresponding model files were optimized and compiled on the edge device. The format of the model was then converted from Pytorch to Onnx and then to TRT. At the time of conversion, the model calculation is half-precision computation. In the end, the API call was implemented using a C++ program. We simulated the effect of real-time detection using video detection. The video was taken with an iPhone 8 plus camera, and the duration was 16 s with a resolution of 720 × 1280 pixels. The test results are shown in Fig. 11. The trained model could better detect the crack area in the video.

Fig. 11
figure 11

Video test results

Discussion

Non-maximum suppression and overfitting

Current object detection algorithms often have more than one candidate box output for the same real object to guarantee the recall rate. Since the redundant candidate boxes affect the detection accuracy, filtering out the overlapping candidate boxes using non-maximal suppression methods is necessary to get the best prediction output. The most basic NMS method uses high-scoring borders to suppress low-scoring borders with high overlap. However, this may lead to problems with missed detections. As shown in Fig. 12, there are five candidate boxes in the image, but C and D overlap. Since D has a lower score than C, the D candidate frame is treated as a False Positive in the evaluation, thus reducing the model's accuracy. Undeniably, the quality of C is higher than that of D. The ideal output result is C rather than D, so we would like to suppress the candidate box D.

Fig. 12
figure 12

Schematic diagram of the NMS processing

The methods of non-maximum suppression are very diverse. This study tested the non-maximum suppression technique described in YOLO v4 during the crack image prediction procedure. We found that CIOU [41] was more effective. CIOU took the distance between object and anchor, overlap rate, scale, and penalty term into consideration, making the object box regression more stable and not fizzle out during training.

Since the birth of deep learning, the pace of solving model overfitting has never stopped. For object detection algorithms, model overfitting is a frequent phenomenon. As shown in Fig. 13, the error variations of the training and validation sets during the training process are depicted. At the beginning of training, the error of the validation set declines as the error of the training set declines. However, when the number of training steps exceeds the optimal iteration, the error in the training set is still decreasing. In contrast, the error in the validation set gradually increases, and overfitting occurs at that point. The best training iteration occurs when the validation reaches its global minimum.

Fig. 13
figure 13

Overfitting in deep learning

This research used some contemporary mainstream methods to hold back model overfitting. During the bridge surface crack dataset training process, evaluation was done on the validation set at regular intervals, and training was stopped when the validation set error appeared to increase. The number of bridge surface crack images we acquired is not very large, and the model can easily fall into overfitting if it has a solid fitting capability. For this reason, the proposed method streamlines the model's structure while decreasing complexity. We input a picture with features that look like bridge surface crack, and the output was the original image as shown in Fig. 14. It follows that the model does not appear to be considerably overfitting. Other effective techniques, like data augmentation and regularization, can also be used to prevent overfitting.

Fig. 14
figure 14

Image for testing

Sample imbalance problems in bridge surface crack detection

There may be positive and negative samples in the algorithms for object detection, hard and easy samples, and unbalanced samples between categories due to different detection algorithms and differences between data sets. The problem of unbalanced hard and easy samples exists in detecting bridge surface cracks. Based on whether it is easy to learn and the degree of overlap of labels, all samples can be classified into four categories: simple positive samples, difficult positive samples, simple negative samples, and difficult negative samples, as shown in Fig. 15.

Fig. 15
figure 15

Schematic diagram of various samples

From the photos we took in the field, most samples are simple samples, which are either negative samples with no overlap with real cracks or positive samples with significant overlap with real cracks. Although the individual loss of simple samples is tiny, if all the losses are computed, the convergence and precision of the model will be affected. Hard samples refer to boxes in the transition zone between foreground and background that are less readily identified. As shown in Fig. 16, our model does not detect the minute cracks in the complicated background. During network training, the loss of hard samples is relatively high. We want the model to learn from the optimized samples, as this portion of the training can enhance the accuracy of fracture detection. In the future, we will attempt to introduce Focal Loss [42] with dynamically updated weights to address the issue of sample imbalance.

Fig. 16
figure 16

Result of complex sample image testing

Conclusion

Bridge surface crack detection is integral to bridge health monitoring and is greatly important in automating civil engineering and infrastructure projects. Manual visual inspection of bridges could be labor-intensive, and working in particular geographical areas and under severe weather is dangerous. With regards to this, this paper introduced an object detection method to detect cracks on the surface of bridges. The first attempt was to use the YOLO v4 algorithm as the root. Some lightweight networks were used instead of the original backbone feature extraction network, and DenseNet, MobileNet, and GhostNet were selected for the lightweight networks. After training the method of replacing network structure, the experimental results showed that the YOLO v4 bridge crack detection algorithm based on the lightweight network took up memory was about five times smaller than the original model. But the accuracy of bridge surface crack detection would be slightly reduced. An improved YOLO v4 crack detection algorithm was designed for detection accuracy and speed. Before model training, the anchor was optimized, an attention mechanism with good results after multiple experiments was included, and the network's general structure was modified based on the crack characteristics. After the model training was completed, non-maximum suppression was used in the prediction. In response to the problem of positive and negative samples, we proposed a vision of what needs to be accomplished in future work. Experimental results showed that the proposed method's precision, recall, and F1 values were 93.96%, 90.12%, and 92%, respectively, higher than other advanced methods. And the model occupied only 23.4 MB of memory space and could reach 140.2 FPS, making it possible to deploy it on some mobile hardware platforms with limited computational power to meet the requirements of practical scenario applications.