1 Introduction

Civil infrastructure including bridges, roads and tunnels remain vulnerable to deterioration due to the occurrence of disasters, along with cyclical loading and harsh environmental factors [1]. Timely detection and simultaneous maintenance of civil infrastructure is an indispensable way to ensure human safety and reduction in the costs associated with infrastructural damages [2, 3]. As reported in the literature, structural deficiency along with aging and respective failure, have been associated with the damages which ultimately lead to the collapse of the majority of bridges (~ 46%) [4]. Effective detection and maintenance of the infrastructural health that is exposed to various types of damages in the form of corrosion and cracks are thus important [5].

One alarming and frequently occurring infrastructural damage is the appearance of cracks. Generally, cracks initiate on the surfaces of the concrete structures mainly due to stress, fatigue, cyclic loading, poor construction, deterioration/corrosion, moisture, temperature effects, shrinkage and the use of incongruous construction materials and strategies [6,7,8]. Various structures including bridges, tunnels, railway tracks, roads, buildings, pavements, aircraft, and automobiles are prone to cracks [9]. Cracks are the earliest signs of degradation that can lead to serious damage if allowed to penetrate or left unmaintained or unrectified [9, 10].

Broadly, cracks can be described in terms of their occurrences time, width, component used for construction and overall activeness. Classical methods including SIFT, ORB, SURF etc. require extensive manual supervision and do not allow automated crack detection. Deep learning can reduce this overhead to identify the cracks, thus permitting the automatic labelling of whether the crack is active or dormant [11].

Once cracks have developed in a structure, they can either remain dormant or be active. The difference between the two types of progressions of cracks is that the dormant cracks stay unchanged throughout time period. The dormant cracks include a minor crack, thin crack, line-like crack, complex crack, and sealed crack. On the other hand, the active cracks progress with changes which include deepening of the width, increase in the length or spreading of the crack in more directions. Due to the direction and type of changes, the active cracks include reflection cracks, transverse cracks, and miscellaneous cracks [8]. Cracks in concrete structures can be described (and lead to) as the partial or complete segregation of concrete into separate parts upon fracture or breakage [12]. Therefore, an essential measure to sustain the structural safety and health of the engineered structures is the early detection of cracks through the utilisation of effective methods. The manual inspection of cracks is a tedious process that demands extra effort and time. It is also prone to subjective assessment of deterioration and inadequate observations by crack inspectors [13, 14].

The advancements and breakthroughs that have been achieved in computer vision and image processing techniques, enable the replacement of the manual crack detection methods with more effective automated inspection procedures [15]. The application of various computer vision-based techniques to efficiently deal with image segmentation [10, 16, 17], colour tracking [18], curvilinear structures [19] and crack detection [20,21,22], have been extensively reported in the literature.

Notably, the detection and localization of cracks are very complex, as numerous visual patterns are associated with cracks and it is quite complicated to achieve a single method that can be applied to different cracks [10]. Using the crack detection methods or the traditional image processing techniques alone is not sufficient to deal with different scenes and for distinguishing cracks under different scenarios (i.e., lighting spots, shadows and edges). The literature indicates that deep learning based methods can be used for effectively overcome the limitations of traditional computer vision methods in terms of extraction and learning of high-quality features [12]. Therefore, the effective amalgamation of computer vision techniques with machine/deep learning approaches is highly necessitated to ensure the efficient and automatic detection and localization of cracks [23, 24].

The increased use of machine learning approaches such as neural networks instead of the traditional vision-based approaches has encouraged the exploration of other similar methods for crack detection in concrete. More recently, deep learning methods have been in focus for crack detection, particularly the Convolutional Neural Networks (CNNs) have gained considerable importance and applicability due to their high performance on many sophisticated computer vision tasks [25, 26] including image classification, image segmentation, and object detection [9, 10, 27, 28]. Traditional crack detection systems have a major limitation in that the applied method is highly specific to a particular situation or scene [11, 12]. In addition, various methods such as FoSA, FFA and CrackTree work considerably well for thinner cracks but fail when applied to wider cracks [13]. Moreover, detection of features fails at variable instances thus leading to a non-generalised extraction of features [13].

The use of CNN brings along various powerful hierarchical features, automatic feature learning, grid-like image topology, differentiation of multiple classes and improved detection of cracks without the requirement of additional image processing techniques [17]. The deep learning models also provide an improvement in the detection and classification performance by using the stacked convolutional layers for the exploitation of image features in different resolutions [29]. The pooling process and the presence of a set of sparsely connected neurons within the CNN require fewer computations as compared to ANN [29]. CNNs are designed to deal with visual data and capabilities including visual object recognition, object detection and image classification. CNN's one of the most efficient methods used for image recognition [12]. CNNs are more valuable than ANNs when it comes to visually processing information, with the latter being more inclined towards processing tabular and textual data. Also, CNNs are faster than ANNs when it comes to dealing with and sorting huge data sets.

Recent application of CNNs in literature includes the automatic detection of concrete cracks in roads, tunnels, and Gas turbines [13]. However, unlike other cases where the material surface is more homogeneous, the detection of concrete surface defects (occurring on inhomogeneous surfaces) should be carefully configured in terms of deep architecture. This requires the use of an extensive data set and taking into consideration variable conditions leading to surface imperfection such as stress, cyclic loading, poor construction, deterioration/corrosion, moisture, temperature effects, shrinkage, and utilization of incongruous construction materials) that are essential for dealing with several real-world problems [22].

In this study, we propose a robust CNN-based classifier for detecting cracks in the concrete surface of bridges based on 10 convolution layers, and CycleGAN has been used to improve detection accuracy and avoid data augmentation. This method does not succumb to factors such as lighting, noise due to lighting, blur, casting, and shadow-based noise and provides wider adaptability. Unlike the traditional approaches, our proposed approach does not require the use of feature extraction and calculation rather it is capable of automatic learning of image features.

The paper is organized as follows. Section 2 presents the methodology, with brief overview of the case study, data collection and pre-processing of the images and the proposed methodology for the crack detection. In Sect. 3 the experimental analysis is elaborated, the performance metrices used for evaluating the techniques followed by the results and discussions. Section 4 summarises the key results of the study, performance of the suggested framework based on deep hierarchical CNN architecture along with Cycle GAN for predicting crack segmentation, and limitation of the study.

2 Materials and Method

In the current study, we propose the development of a robust CNN based architecture that includes a cycle generative adversarial network (Cycle-GAN) for detecting cracks on infrastructures such as bridges. Over the past years, Cycle-GAN has gained considerable progress in terms of utilisation in deep learning methods [30]. Therefore, due to the broader application of Cycle-GANs the detection of cracks on civil infrastructures can be dealt with effectively using image-to-image translation. Cycle-GAN provides network training without the requirement of ground truth labelling. Due to its capability of translating crack images to an image set that displays a pattern similar to the ground truth like images [18].

The proposed approach will assist in the robust, efficient, and cost-effective inspection of infrastructural health and the maintenance of infrastructural damages. Additionally, the proposed Deep Neural Network Framework for automating the crack detection process also provides an advantage of eased scaling to any edge device (i.e., coral dev, jetson nano). In this study, the proposed CNNs based architecture was applied to the data set obtained from the Bolte Bridge in Melbourne, Australia.

3 Case Study

For the case study, the Bolte Bridge in Melbourne, Australia (Fig. 1a) was selected. Bolte Bridge is a large twin cantilever road bridge carrying a total of 8 lanes of traffic. It is present on the west side of central business district (CBD), spanning over the Yarra River and Victoria Harbour (Fig. 1b). The total length of the bridge is 490 m and comprises four spans, two sides of which are 72, long and the main measure 173 m. The data was collected by VERIS which is a leading company for providing spatial data services to their clients (Fig. 2a). VERIS provides an integrated approach for the project life cycle starting from the planning phase to the final delivery phase. It uses innovative technologies to conduct surveys and damage assessments of the infrastructures such as railways, bridges, roads, buildings etc. Aerial imagery of the Bolte Bridge was carried out using UAVs (unmanned aerial vehicles). A DJI M200 UAV was used for surveying the region (Fig. 2b). A machine learning-based algorithm was developed for crack detection. Images would typically be obtained from drones in cases where access is limited (e.g., due to the span of the bridge, presence of traffic or cases of floods), by automatically identifying cracks and vulnerabilities in the bridge infrastructure.

Fig. 1
figure 1

a The Bolte Bridge, Melbourne, Victoria. b Geographical location of the Bolte Bridge

Fig. 2
figure 2

a Field sampling day. b Specification of DJI M200

4 Data Collection and Pre-processing of Images

The crack detection procedure was initiated by the collection of 2D images that form the needed dataset. The model training and testing were performed on a single machine Intel Core i9-10900KF (10 × 3.70 GHz, 20 MB L3 cache, 125 W) with GPU (GeForce RTX 2080 Ti). The quantitative and qualitative results were observed and compared with state-of-the-art methods.

The images of the bridge were obtained using a UAV (Unmanned Aerial Vehicle) carrying a digital camera onboard (Fig. 2b). Besides that, images from public dataset CRACK9001were gathered for training and testing purpose. A total of 2097 images were captured, with dimensions of 4864 × 3648. Images processed by deep learning are augmented through cropping, colour modification, geometric transformation, noise injection, and flipping. The images included in the dataset had three main types of cracks that can be classified into simple cracks, hairline cracks and artificial marking cracks as shown in Fig. 3. Simple cracks usually result from infrastructure settling onto its foundation however, in comparison, the hairline cracks are very small and shallow that mainly emerge due to plastic shrinkages about 0.003 inches in width [31].

Fig. 3
figure 3

The crack types used in the dataset include a simple crack, b hairline crack and c artificial marking crack

After finalizing the dataset, the collected crack images were preprocessed to remove any noise or undesirable background, following this step, an image brightness adjustment was carried out. Cropping was performed on the images to remove any unwanted background such as grass, water, sky, building, trees etc. Particularly, for the crack images, the data set was divided into two types of levels including the crack and structures without cracks (non-crack) levels respectively. The overall percentages of the pixels for all images (with or without crack) are shown in Table 1 which indicates that a lower percentage of the crack regions are included in the complete dataset.

Table 1 Percentages of the pixels for crack and non-crack images

A total of 2.93% significant crack pixels, 1.41% weak crack pixels and 95.93% non-crack pixels were included in the complete data set respectively. For both sets, training and test set, a total of 96.32% and 94.69% non-crack pixels were included, as shown in Table 1. Additionally, a total of 3.24% and 4.15% of crack pixels for training and testing were used in the current study (Table 1). Generally, a crack width in the range of 1 to 5 pixels is considered a weak crack whereas significant cracks are those which have more than 5-pixel width. It was observed that the thin cracks and surface cracks had different properties in comparison to wider cracks. Therefore, the application of traditional post-processing methods (with length constraint, curvature and geometric features) is necessary to obtain the complete and continuous thin cracks [32], which is a limitation of the deep convolutional networks.

For the crack images in the current study, the height and width distributions are presented according to two levels mainly crack and non-crack respectively. Figure 4 illustrates the cracks in terms of spatial representation such that the width and height of crack pixels are gathered through Pytorch and WANDB [33]. Along with crack images, the dataset also included pothole and water straining images for training and testing purpose.

Fig. 4
figure 4

The height, width, and spatial extent of the crack pixels in our dataset

For the current study, the crack pixels frequency was predicted, and the bounding boxes or labels were identified through spatial location analysis and the use of data distribution. The axis presented in Fig. 5 provides the representation of size distribution and it is shown that spatial or frequency distribution for our crack pixels is neither skewed nor projected in one place. Rather, crack pixels display Gaussian or well-distributed pixel data as shown in Fig. 5 which is indicative of the fact that the pixel distribution in the selected crack dataset is devoid of biases.

Fig. 5
figure 5

Comparison of generative and discriminative methods

Many portions of the dataset consist of drive view images roughly 54% from road damage detection challenge 2020.

The total dataset for this study includes 10,000 cracks. The health of the dataset is explained through plots. The location is shown through a Gaussian distribution right around the central region where most cracks appear. Determining the size of the crack is tricky because of the transformations that can occur in cracks. By looking at a crack, the only way to analyse the size of a crack with 100% surety is to be orthogonal to the crack. Moreover, the region of a crack that is close to the camera is fully visible but the ones further away from the surface may appear like a thin edge or depict other differences due to the camera angles as well as the transformations in the crack which might make this difficult to detect. The proposed methodology can potentially enable crack analysis in terms of structure in a consistent manner. This method can generalize to the environmental setting but cannot gauge shifts in perspective. The comparison of generative and discriminative methods is shown in Table 2.

Table 2 Comparing generative and discriminative methods

Figure 5 shows a comparison of generative and Discriminative results of Faster-RCNN and Yolov5-s). The highly expressive Deep CNNs entailing numerous parameters have brought considerable advancements in the classification and processing of images [29]. However, the image features in the CNN’s training set can be a risk as it tends to over-fitting because of the non-generalized features in this network. Using an insufficient set of samples for training can lead to overfitting [29]. Additionally, the collection of abundant samples is an exorbitantly costly endeavour, which has increased the utility of data augmentation methods (i.e., flipping, resizing, random cropping) to enhance image variation and overcome the issue of over-fitting [34]. In the overall training procedure of the proposed approach, label generation and crack detection were performed through data augmentation are presented in the Table 3.

TABLE 3 Data augmentations details

5 Proposed Method

The overall workflow of the current research study is presented in Fig. 6.

Fig. 6
figure 6

A holistic view of the framework proposed in the study

5.1 Per-pixel Segmentation

The use of the pre-trained model for semantic segmentation does not work on general images because it is based on the association of a class label to each pixel of an image. Therefore, we used Crack9001 (A publicly available crack-detection dataset) for the training of the SegNet which aims to perform pixel-wise segmentation of the captured dataset (by UAV). The SegNet method displays limited accuracy and requires manual supervision therefore, per-pixel annotation was used in the current study.

5.2 Baseline Design (BN)

It consists of Max Pooling, ReLU Activation, Concatenation, and convolution operation (Fig. 7). It consists of three sections i.e., contraction, bottleneck, and expansion. For obtaining high precision results in semantic segmentation, it is vital to collect finer details while retaining semantic information. However, having a limited dataset for training a deep neural network is a limitation. This can be overcome by using a pre-trained network and applying it to the desired datasets. The extensive data augmentation carried out in U-Net is another way to overcome the rainy issues. Its key contribution is the creation of shortcut connections. The performance of the U-Net can be enhanced by replacing the plain unit with the residual unit.

Fig. 7
figure 7

Proposed deep residual UNET architecture

5.3 Training

5.3.1 Loss Function

Boundary loss for road boundaries (highly unbalanced segmentation) is being used. The loss function aims to get smoother outputs at the boundaries and enhance model output for two close parallel roads. The integrals are used over the boundary between the regions by boundary loss, instead of applying the unbalanced integrals over the regions. The boundary loss function was used in combination with BCE-Dice Loss. Learning Rate, Epoch loss and Epoch IoU Score Plots are shown in Fig. 8.

Fig. 8
figure 8

Learning rate, epoch loss and epoch IoU score plots (red and orange line implies validation and training data)

The model was initially trained for the first 10 epochs with a combination of boundary loss and BCE-dice loss (Fig. 8) and further fine-tuned for another 30 epochs (Fig. 8). The cyclic learning rate was used with a cycle size of 5 epochs and a learning rate decay of 0.8 (20%) after each cycle. To ensure that only the best weights were used during inference, an early stopping criterion was applied. It was observed that when network was evaluated on unseen datasets a loss in performance was observed. While better performance was achieved when evaluated for synthetically modified dataset.

The existing methods for crack detection face many limitations, which mainly include the availability of limited datasets. Changing the dataset in such cases leads to difficulties in crack detection. Examples include CrackTree, FFA, and FoSA methods which are reliable for thin cracks but tend to fail in terms of detecting wider cracks. These applications stand to benefit from the hierarchical features and powerful abilities of CNN. The use of CNN is suitable for the goal of learning a non-linear model for image analysis. A Conditional Random Field (CRF) has been used previously [16] for refining low-resolution images as a post-processing step. On the other hand, using a Fully Convolutional Network (FCN) results in up sampling of the feature maps but the output of such a method is not very accurate. Hence, using an approach that combines the parameters included in CNN, CRF and FCN was considered more desirable.

In this architecture, there are no fully connected layers, the side-output layers are inserted after the convolutional layers, deep supervision is applied at each side-output layer and then all of them are concatenated to form a final fused output (Figs. 9, 10 and 11). In this way, the final output layer acquires multi-scale and multi-level features as the plane size of the input of side-output layers becomes smaller and the receptive field size becomes larger. The fused prediction is refined by guided filtering with the first side-output layer (Fig. 11).

Fig. 9
figure 9

The multi-layer CNN architecture used for damage detection

Fig. 10
figure 10

Encode-decoder architecture used for damage detection

Fig. 11
figure 11

Deep crack architecture

Predictions made at the processing stages can preserve the boundaries of cracks but are also sensitive to noises such as dark spots and dirt. On the other hand, better anti-noise capabilities are shown by the predictions of deeper convolutional stages. However, a failure in the preservation of segmentation boundaries is also associated with predictions of deeper convolutional stages. Therefore, it is commendable to carry out a linear combination of all the combinations carried out at different stages. We added some modules for refinements such as phase shift and convolutional layers [10]. The binary mask is generated initially which is followed by the setting of the guidance map with a side-output (conv1_2). The final refined prediction significantly preserves the crack boundaries also leading to noise removal in the low-level prediction. Additionally, for training, we have also used the cycle-consistent generative adversarial networks (Cycle-GAN) that can reduce human intervention for manual label generation. The guided filter achieves the final refined prediction by well preserving the crack regions and removing the noises in the low-level prediction. Compared to the CRF method, such a technique is faster and more efficient.

5.4 Model Training Using Cycle-GAN

In recent years, Generative adversarial networks (GAN) have been employed in deep learning methods successfully as it offers a novel strategy for the training of different models [34]. Originally for GAN, a fully connected layered generator configuration is used that allows the images to be generated from random noises. However, lately, the cycle-consistent adversarial networks (Cycle-GAN) were proposed by Zhu, Park [35] which allowed effective training without the need for the using data pairing step. Therefore, based on the applicability of Cycle-GANs we develop crack detection in concrete structures as a translation problem in the image-to-image translation approach. Notably, the Cycle-GANs can effectively train the networks without the requirement of manually labelled Ground Truths, as they enable images with similar outlooks to be translated [36].

For the Cycle-GAN based training of the network, two separate data sets are required (Fig. 12). These include the crack image set (M) with images {mi}, and the structure library (K) with {ki} images respectively. The network topology is based on two image-to-image translation GANs (i.e., Forward and Reverse GANs) as presented in Fig. 12. The forward and reverse GANs perform image translations from \(M to K (F: M \to K)\) and \(K to M (F: K\to M)\) respectively. The system contains two discriminators mainly \(Dm\) and\(Dk\). Here, \(Dm\) is used for distinguishing between the \(\{mi\}\) and \(\{R(Ki)\}\) with \({L}_{advr}\) (reverse adversarial loss) whereas, to overcome the data imbalance and differences in domains, \(Dk\) is used that distinguishes between \(\{ki\}\) and the translated images\(\{F(mi )\}\). The objective function is presented in Eq. 1:

Fig. 12
figure 12

Use of cycle-Gan for Training

$$L=\left({L}_{advf}+{L}_{advr}\right)+\lambda \left({L1}_{fc}+{L1}_{rc}\right)$$
(1)

Here, \(\lambda \) controls the weight between the two losses (adversarial and the cycle-consistent loss), and \({L1}_{fc}\) and \({L1}_{rc}\) represent the two-cycle consistent losses with L1-distance formulas in the forward and reverse GAN respectively [30].

5.4.1 Adversarial Loss

Real-like images can be generated from noise while using generative adversarial networks for training. The GANs execute by max–min two-player game and it is, therefore, important to alternately optimize the following objectives (Eqs. 2 and 3):

$${max}_{D}{V}_{D}\left(D,G\right)={E}_{y{P4}^{\left(y\right)}}\left[logD\left(y\right)\right]+{E}_{x{P4}^{\left(x\right)}}$$
(2)
$${max}_{G}{V}_{G}\left(D,G\right)={E}_{x{P4}^{\left(x\right)}}$$
(3)

Here, \(D, G, x and y\) denote the discriminator, generator, noise vector input into the generator and real image in the training set respectively. \(G\) Generates images \((Gx)\) that are like images from \(Y\) and \(D\) distinguishes between the real samples ‘y’ and the generated sample \(G (x).\) Moreover, \(D\) and \(G\) try to maximize Eqs. 2 and 3 respectively, which results in adversarial learning.

5.4.2 Cycle-Consistency Loss

It is well known that the adversarial loss can help in obtaining structured images, but when used alone it is inadequate for translating the crack image patch to the desired structure patch or the other way round. Thus, it does not guarantee the consistency of the structure pattern between the input and the output images. Therefore, the introduction of an extra cycle consistency parameter can help in training the CNN and maintaining the consistency of structure patterns between the input and output [31]. For the data set \(Q\), each sample ‘q’ should be able to return to the original patch through the network, after the processing cycle \((q\to G(q) \to F(G(q)) \sim q)\). Similarly, for each structure image, ‘s’ in the structure set the network should allow the return of n back to the original image \((s \to R(s) \to F(R(s)) \sim s)\). These constraints can lead to the formulation of cycle-consistency loss defined as follows (Eq. 4):

$${L}_{cyc}\left(F,R\right)={E}_{Q{p4}^{\left(q\right)}}$$
(4)

5.5 Model Parameters

The CNN was developed on sophisticated implementations including FCN [26], DSN, HED and SegNet whereas, the CRACK9001 library was used for training [13]. Stochastic Gradient Descent (SGD) was used for optimizations. The aim here was to differentiate between two classes (crack, and non-crack) and utilize the loss of function, normalization, and side-output layers for the network such that they can provide enhanced accuracy and convergence along with eliminating the need to use networks based on pre-trained models. The model parameters selected for the study were (i) the size of the input image was 544 × 384 × 3 (ii) ground truth size 544 × 384 × 1 (iii) learning Rate 1 × 10 − 4 iv) loss weight associated with each side-output layer was 1.0 (v) loss weight associated with final fused layer was 1.0 (vi) momentum 0.9 and (vii) weight decay was 2 × 10 − 4.

5.6 Data Augmentation

Data augmentation forms an integral component of deep networks. The data set was augmented 10 times for this study. The data augmentation was carried out by (1) rotating images to 12 different angles after each 30°in. [0°, 360°], (2) editing the largest rectangle without blank regions in the rotated image, and (3) horizontal flipping of images at each angle. However, for training the network both raw and augmented images were used and due to rotation transformations, resized input images (256 × 256) were used.

6 Experimental Analysis

The database was analysed using the selected methods and performance was evaluated based on the metrics and F-score. The results of the proposed architecture were compared with existing methods for the crack detection.

7 Performance Metrics

The proposed architecture was applied to the collected database and three metrics were used for the evaluation of common semantic segmentation [22]. We calculated the Global accuracy (GC), class average accuracy and the mean intersection of the union over all classes. The global accuracy estimates the percentage of correctly predicted pixels and is calculated in Eq. (5) as follows:

$$\mathrm{GC }={\sum }_{m}{n}_{mm}{\sum }_{m}{t}_{m}$$
(5)

The Class average accuracy (CAC) measures the predictive accuracy over all the classes and is defined as follows (Eq. 6):

$$\mathrm{CAC }= (\frac{1}{{n}_{cls}}){\sum }_{m}{n}_{mm}/{t}_{m}$$
(6)

Whereas the mean intersection of the union (IoU) over all classes is calculated using Eq. 7. IoU metric is used for the quantification of percent overlap evident between a target mask and the predictions made for output results. Briefly, the IoU parameter can be used for the measurement or quantification of the number of overlapping pixels between a target mask and the predictions made for output results [12, 32].

$$\mathrm{IoU}=\left(\frac{1}{{n}_{cls}}\right){\sum }_{m}{n}_{mm}/({t}_{m}+{\sum }_{p}{n}_{pm}-{n}_{mm} )$$
(7)

7.1 F1-Score

In addition to the measures, three other metrics including Precision (P, Eq. 8), Recall (R, Eq. 9) and F-score (F, Eq. 10) were also calculated to evaluate the semantic segmentation. The Precision (P) parameter indicates the positive predictions for a positive class whereas the R metric is utilized for the quantification of positive predictions for all the positive classes included in the collected dataset [12, 32]. Moreover, the F-score is a measure that considers the precision and recall parameters. The F1-score metric indicates a model’s accuracy on a considered data set. Figure 13a, b represents the confusion matrix and the obtained results.

Fig. 13
figure 13

a The confusion matrix. b crack and no crack images as per matrix

$$ACC = \frac{TP+TN}{TP+FP+TN+FN}$$
(8)
$$TPR = \frac{TP}{TP+FN}$$
(9)
$$TNR = \frac{TN}{FP+TN}$$
(10)
$$PPV = \frac{TP}{TP+FP}$$
(11)
$$NPV = \frac{TN}{TN+FN}$$
(12)
$$F1 = \frac{2TP}{2TP+FN+FP}$$
(13)

8 Results

8.1 Performance Analysis of Proposed Methods

We compared our method to three other common methods adopted to corroborate our experiments. The methods considered were the (1) Crack-BN, (2) Crack-GF and (3) SegNet [23]. Additionally, SegNet is also one of the latest approaches that are used to perform semantic segmentation. Here, fine-tuning of the SegNet network and loss functions was carried out on the augmented datasets that were used in the current study. Crack-BN is also based on HED [24]; before operation activation, additional batch normalization layers are added. In Crack-GF, a guided filtering method that is highly efficient and rapid as compared to the conditional random fields (CRF) is used [37]. The probability maps were also binarized using the variant global thresholds. The Precision-Recall curves created through segmentation methods used in the present study are presented in Fig. 14. A representative data set of images obtained from target structures and the segmentations utilized are presented in Fig. 15.

Fig. 14
figure 14

Precision and recall Crack-Det and Crack-Det-GF

Fig. 15
figure 15

Cracks categorization matrix distinguish crack pixels based on the pixel width

The P–R curves generated for the two methods show that the GF method has a better performance as can be seen by the F-Score value (0.870) in contrast with the baseline method exhibiting an F-Score of 0.838 as stated in Fig. 14.

Our proposed architecture shows a significant upgrade in the performance, in comparison to the existing methods (Crack-BN, Crack-GF, and SegNet) included in the study as depicted in Table 4.

Table 4 Comparison between the methods

Batch normalization can significantly boost the performance as it leads to reduced over-fitting for the CNN. Moreover, the dense predictions can be refined through the implementation of both the guided image filtering and conditional random field methods. Results show that guided image filtering appears to be faster and more efficient in comparison to other techniques. It is notable from our results that in comparison to all other methods our proposed GF pipeline displayed the highest-Class Average Accuracy, Mean IOU, F-force, Global Accuracy, Recall and Precision values of 0.931, 0.878, 0.870, 0.989, 0.861, and 0.881 respectively (Table 4). However, in comparison to all other methods, the lowest performance was achieved using SegNet as indicated by the statistical parameter values presented in Table 4.

Most importantly the total crack pixels used for e training and testing were divided into significant and weak crack pixels respectively. This categorization was used for distinguishing crack pixels based on the pixel width. A crack having a score between 1 and 5 for pixel depth was defined as a weak crack whereas; a crack exhibiting a pixel width greater than 5 was defined as a significant crack pixel as shown in Fig. 15 [21].

About metric values, our method generalizes better than the respective Crack-BN, Crack-GF and Seg-Net as shown in Table 4. Effective data augmentation methods are essential for deep models when training data is very limited. Moreover, using refinement modules like the PS operation and convolutional layers for analysing the overlap between the two maps shows that the proposed method can provide higher generalization and retain a greater amount of information on the low-dimensional features (Fig. 12). The results of this study show that this model can perform more robustly as compared to other methods. Moreover, our method also removes the background and irrelevant noise in the dataset [12].

9 Discussion

In this study, a drone was used for capturing images of cracks on concrete bridge surfaces (Fig. 2b). A total of 2097 images were captured. The total data set of images were divided into two, the training and test sets containing 1300 and 237 images respectively. Overall, 78% of the images containing a significant crack and 13% of images with a weak crack were used. However, 9% of the non-crack images were used in the test set only.

We have presented some of the corresponding segmentation and representative images used in the study in Fig. 15. The segmentations were generated by representing the subject in the binary images. For every image, a pixel-wise segmentation map was utilized which allows coverage of the crack regions and the pixel size of images were readjusted to 544 × 384. The percentages of the pixels for crack and the non-crack images are shown in Table 2. For each image, a pixel-wise segmentation map was used, which represents the total crack region coverage in the collected image set. To include a universal representation of cracks in the current study, a diverse range of scenes and scales were considered to select the crack images.

In our proposed framework loss of function, side output, and batch normalization were used to distinguish the crack and non-crack levels in the current study. However, to reduce the overfitting of the proposed CNN, a performance boost can be achieved through batch normalization. Model parameters including Loss weight of each side-output and final fused layer, momentum, decay, and learning rate were used for the training of our CNN network. Additionally, to reduce training time, a small dataset was used to train our CNN model. Most prominently, our network was trained using two different approaches including (1) baseline (BN), and (2) Guided Filtering (GF). In the baseline approach, data augmentation is not performed. However, the baseline pipeline is based on the UNET and our modified loss functions for smooth training. In later design, we have also added batch normalization layers before each activation operation to address domain invariances and co-variance shifts. The GF is a version of the baseline with the application of a guided filtering module. For every approach, we have used an augmented dataset for training.

Overall, in the current study, we assessed the performance of the proposed baseline and GF methods. Additionally, the performance was also compared with three other methods namely, Crack-GF, Crack-BN and SegNet. The performance of the studied methods was measured using Mean IOU, F-score, P, CAC, GC, and R. For every architecture, cross-validation was also performed, and the predictions were assessed for each method using evaluation methods explained in Table 3.

Our results show that when the training set is augmented 10 times, the performance improves to a greater extent. Hence, the refinement of the proposed post-processing methods is effective. In comparison to Crack-GF, Crack-BN, and SegNet, our proposed architecture shows obvious improvements. It is already reported that the traditional methods involve post-processing (i.e. length constraint, curvature and geometric features etc.). Therefore, it is indispensable for obtaining continuous and complete thin cracks. However, the convolutional neural networks display this weakness.

10 Conclusion

The manual investigation of damages incurred to infrastructure is a challenging endeavour that is time-consuming and lacks objectivity and reliability. Therefore, automatic crack detection through techniques such as image processing is inevitable, but the influence of noise caused by lighting, blurring and other factors need to be addressed. Amongst the different deep learning approaches CNNs to provide automatic learning of image features instead of image feature extraction thus making it less influenced by noises. For this reason, we suggest a framework based on deep hierarchical CNN architecture along with Cycle GAN for predicting crack segmentation for each pixel in an approach that is end-to-end.

The proposed method utilizes the extended FCN (Fully Convolutional Networks), the DSN (Deeply Supervised Nets) and a U-net architecture. The DSN delivers direct and integrated feature supervision at each convolutional stage. Moreover, the intricately designed model network learns and aggregates features as it moves from the low convolutional layers to the high-level convolutional layers during the training procedure. Thus, the used architecture is different from the ones used traditionally which mainly rely on using the last convolutional layer. Additionally, for the refinement of prediction results, we utilized the Phase shift based guided filtering. Our proposed deep hierarchical convolutional neural network (CNN) architecture achieved advanced/high-tech performances on the considered dataset showing using a GF pipeline displayed the highest-Class Average Accuracy, Mean IOU, Global Accuracy, Recall and Precision values of 0.931, 0.878, 0.989, 0.861, and 0.881 respectively. Several limitations exist in the proposed CNNs framework such as limitations in terms of pixel-perfect accuracy. Other limitations that might be evident are that it requires many computational resources, because of the generative approach. In future, this work can be converted into a knowledge distillation architecture (student and teacher) [38]. Where a complex network (teacher) is used to learn the underlying mapping and at the same time enforce limitations in the complexity of the student model.