1 Introduction

The active management of aging civil infrastructure has become a twenty-first century challenge for transportation agencies around the world, committed to maintaining healthy infrastructure and preventing unexpected structural failures. However, there is clear evidence of reduced efficiency, slow recovery operations and aged infrastructure assets. In the 1990s, the UK and Switzerland construction maintenance markets, already accounted for 50% of the total value of the construction markets [1]. The Report Card published by the American Society of Civil Engineers in 2021 stated that the United States has more than 617,000 bridges, 42% of which are at least 50 years old and with the current investment rate it will take until 2071 to cover the $125 billion of rehabilitation cost [2]. In South Korea, regular annual inspections of 270,000 structures are required, although the budget and number of inspectors are gradually decreasing [3]. Moving to the European continent, much of the bridge infrastructures were built after World War II and are now beyond their useful lives. All these numbers highlighted the need for reliable and integrated systems of inspection, assessment and maintenance to ensure safety and efficient allocation of resources. Quickly and frequently surveys are required to plan essential maintenance and repairs in a proactive way before it becomes too dangerous and expensive. Currently, the assessment of structural condition involves the engagement of qualified inspector which perform on-site inspections with the use of photographs, annotations, drawings and the collection of historical information. However, as these are carried out at pre-fixed time intervals, there is a risk of performance below an established threshold between an inspection and another. Furthermore, such inspection can be time-consuming, costly, laborious and dangerous especially for inaccessible structures. Finally, depending on the expertise of the inspector, there are subjectivity and human error that can lead to a different classification for similar defects. To address all these issues, improved inspection with less human intervention, lower costs and higher spatial resolution needs to be developed to enable automated assessment of civil infrastructures condition. In recent years, with the development of low-cost and high-quality imaging devices, computer vision technique has been gathering increasing attention in the research of the civil engineering community. Indices for local condition assessment such as crack, spalling, corrosion, and delamination can be extracted from visual images containing structure surface. The advantages of this method are related to the possibility to enable long-distance, non-contact, low-cost, objective and automatic condition assessment [4, 5]. Moreover, vision sensors used in conjunction with vehicle or unmanned aerial vehicles (UAV) is proposed as one of the promising strategies for fast scanning of higher spatial resolution without the need of traffic closure.

Computer vision-based inspection varies from conventional approaches using image processing algorithms to recent attempts based on deep learning techniques. Traditional detection algorithms rely on the manual features extraction which transform the available data into valuable information, ranging from statistical-based method on greyscale distribution [6], colour and texture descriptors [7, 8], binarization methods [9] and machine learning-based model [10]. However, the application of image processing in an automated structural inspection environment is limited, as these techniques do not consider the contextual information provided by the regions around the defects. These techniques need to be tuned manually, depending on the type of target structures to be monitored [5]. Furthermore, varying the lighting and shadowing controlled during image capturing or acquiring skewed long-range images, could yield false and erroneous results. Real-world situations are very varied and building a general algorithm that can be successful in these general cases is quite complex.

The development of deep learning techniques has greatly extended the capability and robustness of traditional vision-based damage detection by automatically extracting features, without requiring time-consuming and complex processes. As the features are defined by the machine, human bias/error is avoided and replaced by the error of the system, moving from a knowledge-driven approach to a data-driven approach. Different applications for damage detection have been studied for a wide variety of structures and type of defects, ranging from cracks and spalling to corrosion. Convolutional neural network (CNN) architectures have been developed to build a classifier for detecting cracks of steel box girders [11], road pavement [12] and concrete surface [13]. All these methods, to locate the crack, first consider scanning the original images with sub-patches and then activating only those with defects. To overcome the rough localization based on sliding window detection, Quqa et al. [14] proposed a two-step approach that first identifies the “cracked” regions and then applies image processing techniques only to locate the crack pixels. In order to avoid the need for a wide dataset to obtain high level of accuracy for CNNs trained from scratch, the transfer learning technique was adopted on pre-trained networks. The well-known AlexNet architecture has been fine tuned to classify cracking [15, 16] and spalling [17] on concrete surface. Savino and Tondolo [18] compared eight pre-trained networks to classify images containing undamaged, cracked and delaminated structural elements, reaching the maximum accuracy of 94% with GoogLeNet architecture. Kruachottikul et al. [19] developed a defect-inspection system for reinforced concrete bridge substructure, able to classify cracking, erosion, honeycomb, scaling, and spalling defects with an accuracy of 81%. Since the image classification approach can only distinguish between images based on the expected class, object detection methods have recently been applied to recognize and locate multiple damages within bounding box. Cha et al. [20] proposed a Faster Region-based Convolutional Neural Network (Faster R-CNN) to detect in the same image concrete crack, steel corrosion, bolt corrosion and steel delamination with an average precision of 87.8%. Faster R-CNNs were also used to identify and locate damage of masonry historic structures [21], urban shield tunnel lining [22] and large crane structures [23]. However, object-detection-based methods, providing only class labels and the bounding box around the region of interest, cannot precisely define the shape of the damage but only identify and locate it. Moreover, suffering in the case of overlapping regions, they are unsuitable to provide morphological information and the extent of defects.

An effective method to delineate the precise location and shape of object is named semantic segmentation. More specifically, a semantic segmentation network classifies each pixel of an image with a certain label, providing an image that is segmented by class. To the knowledge of the authors, relatively few works used at the date of this paper, semantic segmentation neural networks for civil infrastructures defect assessment. Zhang et al. [24] proposed a CNN architecture for the pixel-level pavement crack detection on 3D asphalt surfaces with a precision of 90.13%. Zhu and Song [25] presented a weakly supervised segmentation and detection network based on autoencoder to identify cracks on asphalt concrete bridge deck. Most of the studies concerned pixel-level surface crack detection using transfer learning [26] or combining the advantages of pre-trained networks [27,28,29,30]. Pozzar et al. [31] investigated the performance of different pre-trained models to detect multiclass concrete damages using thermographic images, identifying the VGG16 model as the most promising with average IoU values of 59.5% for delamination and 39.4% for crack.

In most of the previous research, neural networks were trained with images collected under near-ideal laboratory conditions, such as camera positions and angles, depending on the appearance and location of the defects. Furthermore, as they were considered datasets with hundreds of images, much smaller images were cropped from the original images to increase their number. However, this approach cannot cover the diversity of the on-field environment because it is difficult to reproduce ideal conditions and continuously control lighting direction, positions and angles of cameras installed on UAV. This condition makes most existing image-based methods highly dependent on the data used, generalizing poorly to other datasets. Most of the research efforts, as mentioned previously, only focused on the semantic segmentation of specific defect at a time. To the knowledge of the authors, never have been investigated the performance of pre-trained semantic segmentation models to detect multiclass concrete defects. Based on the mentioned gaps, this research proposed a CNN able to perform the semantic segmentation of images containing “Crack”, “Delamination” and “Background” in several civil infrastructures. The first objective was to train a robust neural network that is not affected by quality of images and that is effective for a wide range of on-field inspection. Therefore, the neural network has been developed considering a dataset of 1250 images, collected from Internet, on-field bridge inspections and Google Street View. The images are affected by a broad range of noise linked by the sources and represent real environmental conditions with several background. The second objective was to find among the existing pre-trained neural networks the most suitable for civil engineering defects detection task. It will allow further research to detect additional types of structural damage, such as corrosion, efflorescence, stain moisture and voids. Furthermore, morphological information were extracted to prove the superiority of the semantic segmentation approach over the existing object detection methods in providing quantitative information about civil infrastructure defects.

2 Semantic segmentation

The general architecture for the semantic image segmentation task is a CNN which associates each pixel of an image with a corresponding class label. Generally, the architecture of CNN consists of shallow layers to learn low-level features and deep layers specialized on high-level details. For image classification task, aimed to learn what the image contains, the expensive computation of deep neural networks is relieved by down-sampling of feature maps with pooling or strided convolutions. However, for image segmentation task, the full-resolution semantic prediction must be preserved by adopting encoder/decoder structures (Fig. 1). The encoder part down-samples the input into low-resolution feature maps and learns to discriminate between classes, the decoder part up-samples from a low-resolution map to a full-resolution segmentation map.

Fig. 1
figure 1

Semantic segmentation neural network

The down-sampling part is actually a very deep CNN which is built adopting multiple layers, such as convolution, pooling and activation layers. Since the excessive downsizing of encoder part due to consecutive pooling operations results in a loss of information which cannot be recovered in the decoder part, Chen et al. [32] proposed DeepLabv3+ decoder to refine the segmentation results. The proposed model employs Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale contextual information without losing spatial resolution. Finally, the softmax layer predicts the class of each pixel after a series of transpose convolutions which upsample the resolution of the feature maps.

2.1 Convolutional layer

Convolutional layers are the main building blocks used in CNNs, responsible for capturing different level of features. The first element involved in a Convolutional layer to perform convolution operation is called kernel or filter. A convolution is a linear operation that involve an element-wise multiplication between the input and the weights contained in the filter (Fig. 2). The sliding step size of the kernel on the input is defined as a stride and affects, together with the padding, the size of the convolved feature.

Fig. 2
figure 2

Convolutional operation example

Assuming the case of one-dimensional convolution, the output of the convolution process is

$$y\left( i \right) = \sum \limits_{{k = 1}}^{K} x\left( {i + k} \right) \cdot w\left( k \right) + b$$
(1)

where x (i) is the input, w (k) is the filter of length K and b is the bias. Systematic application of the filter across an image, allows to extract a feature anywhere in the image and create a feature map. It is important to note that the local dependencies in the original image depend on the weights that are automatically tuned during the training process. Since the convolution is a linear operation, the Convolutional layer ends with an activation function to introduce nonlinear transformation components. The most used activation function is the Rectified Linear Unit (ReLU) which returns the value provided as input directly, or zero for negative input. Because ReLU is linear for positive values, it facilitates much faster computation during the training process of a neural network with backpropagation.

2.2 Pooling layer

Similar to the Convolutional layer, the Pooling layer decreases the size of the convolved feature to reduce the probability of overfitting and the computational power. The key features are commonly extracted by two types of Pooling: Max Pooling and Average Pooling (Fig. 3).

Fig. 3
figure 3

Pooling layer example

The Max Pooling returns the maximum value for a portion of the feature map covered by the kernel, whereas the Average Pooling computes the mean value. The Pooling layer is frequently used after the Convolutional layer in order to intensify the important features kept in the Convolutional layers and discard all the information irrelevant for the output.

2.3 Atrous spatial pyramid pooling

DeepLabv3+ applies several parallel Atrous convolution, also called hole convolution or dilated convolution, to capture the features computed by CNNs at different scale. Atrous convolution is a type of convolution that increases the filter size using the same number of parameters (Fig. 4). The dilation rate, l, indicates how much the filter is widened, l-1 is the number of hole or zeros filled between consecutive filter parameters.

Fig. 4
figure 4

Atrous convolution (3 × 3) with different dilation rates of 1, 2, and 3, respectively

When the rate is 1, it corresponds to a standard convolution; when the rate is equal to 2, the receptive field goes to 5 × 5 while having 3 × 3 convolution parameters. Similarly, the Atrous convolution with dilation rate 3 is able to get the information of 7 × 7 convolution with the same number of parameters. In case of one-dimensional convolution, the Atrous convolution for each location i on the output feature map results:

$$y\left( i \right) = \mathop \sum \limits_{k = 1}^{K} x\left( {i + lk} \right) \cdot w\left( k \right)$$
(2)

where x (i) is the input of a pixel, and w (k) is the filter of length K. As pointed out above, the standard convolution is a special case in which the dilation rate is l = 1.

In the ASPP model, four parallel Atrous convolutions with different rate are applied to ensure detailed spatial information and capturing features at each scale. After the parallel operations, the results are concatenated by converting to a 1-dimensional vector.

2.4 Transposed convolutional layer

To produce the pixel-to-pixel prediction results, an upsampling operation is introduced to increase the spatial resolution of a coarse feature map to the dimension of the original image. This operation is called transposed convolution and, unlike the Convolutional layer, the output becomes larger than the input. As presented in Fig. 5, each element in the input is multiplied by the kernel (i.e., the element containing the weights) and then these middle matrices are combined with strides in both width and height directions. Finally, the assembled values in overlapping regions are added together to extract the extended input.

Fig. 5
figure 5

Transposed convolution with a 2 × 2 kernel and stride of 1 for a 2 × 2 input

In contrast to the regular convolution where strides are specified for the input, in the transposed convolution, they are specified for intermediate matrices increasing the size of the output.

2.5 Softmax layer

At the last layer of the CNN, it is necessary to have a layer that assigns a score with each class to each pixel within the original image. To convert the vector of output number into a vector of multiclass categorical probability distribution by a normalized exponential, it is used the softmax function, which is expressed as

$$\sigma \left( y \right)_{i} = \frac{{e^{{y_{i} }} }}{{\mathop \sum \nolimits_{j = 1}^{K} e^{{y_{j} }} }}$$
(3)

where yi are the elements of the input vector to the softmax function and K is the number of classes in the multiclass classifier. To quantify how far the network prediction are from the actual classes, it is calculated the Cross-Entropy loss function, defined as

$$L_{{{\text{CE}}}} = - \mathop \sum \limits_{i = 1}^{n} t_{i} \log \left( {p_{i} } \right)$$
(4)

where ti is ground truth and pi is the Softmax predicted score at the specific pixel. The smaller the loss function, the closer the predicted values are to the right classes.

3 Training process

In this study, to minimizes the Cross-Entropy loss during the training process, it has been selected the stochastic gradient descent algorithm with momentum. This is an iterative method for optimizing the loss by adjusting the weights of the network and increasing accuracy. To define the most suitable architecture for the damage detection task, it has been used the transfer learning of pre-trained networks on ImageNet dataset. The transfer of learned generic features helps to achieve better performance with less training time without considering randomly initialized weights from scratch. On the other hand, the fine-tuning of deep layers and the new classification layer refines the representations of the high-level features of the new dataset in the base model. Furthermore, for the best performing network the hyperparameters configuration has been optimized to define the optimal architecture. In this work Deeplabv3+ networks have been created, with weights initialized from pre-trained MobileNet-v2 [33], Xception [34], ResNet-18 and ResNet-50 [35] networks. The MATLAB Deep Learning Toolbox allows easy implementation of the “deeplabv3plusLayers” to create a DeepLabv3+ layer with the specified base network, number of classes, and image size. In addition, the “pixelClassificationLayer” function creates a pixel classification output layer to provide the categorical label for each image pixel. The training has been performed using MATLAB on a NVIDIA GeForce GTX 1650 Ti with 4 GB of GPU memory.

3.1 Pretrained networks

Pretrained networks are layered architectures shared by their respective teams, which allows to replace the final layer and retrain some of the previous layer to reach a stable state on a new task. Deep learning models have different accuracy, speed and size which should be considered as a starting point in the choice for a new classification system. The present experimental study involved balanced fast/accurate pre-trained network to be deployed with high performance on embedded system, such as MobileNet-v2, Xception, ResNet-18 and ResNet-50.

MobileNet-v2 is a neural network architecture announced by Google researcher to run efficiently on devices with low computational power. The main idea behind MobileNet architecture is the split of the convolution layer into a depthwise convolution layer and 1 × 1 convolution layer to form a “depthwise separable” convolution block. The depthwise convolution applies a single convolutional filter for each channel image, the pointwise convolution builds new features through computing linear combinations in depth dimension. In v2, 1 × 1 expansion and 1 × 1 projection layers were added at the beginning and end of the depthwise convolution to form the Bottleneck Residual block which allows the use of low-dimension tensors and reduces the number of computations. The full MobileNet-v2 architecture, then consists of 17 of these building blocks followed by a 1 × 1 convolution, a global average pooling layer and a classification layer (Table 1).

Table 1 MobileNet-v2 architecture

Xception Model was developed by Google researchers as extension of inception architecture, involving “depthwise separable” convolutions and Max Pooling, all linked with shortcuts connections. Adding connections which skip one or more layers, avoid degradation problem related to learning of identity mappings for deeper networks. The specificity of Xception is that the depthwise convolution is not followed by a pointwise convolution, but the order is inverted. The feature extraction base is formed by a linear stack of 36 Convolutional layers structured into 14 modules. The diagram in Fig. 6 shows in detail the number of filters, the filter size and the strides.

Fig. 6
figure 6

Xception architecture

The shortcuts connections were introduced for the first time within the deep Residual Network which made it possible to train hundreds or thousands of layers without running into the vanishing gradient problem. There are several variants of ResNet architecture that are based on the same concept but with different number of layers, such as ResNet-18 and ResNet-50. ResNet networks consist mainly of five types of convolution blocks called conv1, conv2, conv3 and conv5 followed by a fully connected layer and a softmax layer. Each convolution block uses 2 convolution layers of size 3 × 3 for ResNet-18 or 3 convolution layers of size 1 × 1, 3 × 3 and 1 × 1 for ResNet-50. In Table 2 is reported a summary of the output size at every layer and the dimension of the convolutional filters at every point in the architectures.

Table 2 ResNet-18 and ResNet-50 architectures

Table 3 provides further details about the network architectures adopted in this study. The depth is defined as the number of sequential convolutional or fully connected layers from the input layer to the output layer.

Table 3 Pretrained networks properties

3.2 Building database

The performance of a neural network is related to the variety of the training images. Considering a training dataset with images characterized by constrained conditions may lead CNNs to poorly perform in case of classification task outside the assumptions. In order to obtain training images under a wide variety of possible situations, the training dataset was established collecting raw images from Internet, on-field bridge inspection, and Google Street View. The use of three different sources allowed to gather images with different quality, resolution and background, increasing the usefulness of the research also for on-field applications with low-cost sensors. A total of 1250 images have been manually labeled, using the MATLAB Image Labeler app, where the pixels are labeled as “Delamination”, “Crack” and “Background”, respectively. Figure 7 shows examples of collected raw images used to build the datastore and their ground truth annotation. Withe pixels correspond to “Background”, yellow color is used to annotate “Delamination” and cyan is used for “Crack” annotation.

Fig. 7
figure 7

Examples of images used to build the datastore and their ground truth

To decrease the computational cost and training time, all the images have been cropped into 300 × 300 pixels resolution after the labeling operation. Furthermore, to make neural networks invariant to distortions in image data and decrease the probability of overfitting, the amount of training data is increased by applying randomized augmentation with a combination of rotation, reflection and shear. After data augmentation, the doubled database is randomly divided by 80% for the training set and 20% to validate the model. Specifically, 2000 images are randomly selected to generate the training set and another 500 are used to create the validation set.

A common issue in concrete damage datasets is the unbalance of class distribution between pixels containing cracks and the others because in general they cover less area in images. When a dataset is unbalanced, the error of the overrepresented classes contributes much more than the error contribution of the underrepresented classes, making poor performance for the underrepresented classes. To avoid a semantic segmentation biased toward the dominant classes, a class weighting has been adopted during the training to increase or decrease the importance of a pixel. The weight of each class, wc, has been defined computing the median frequency weighting according to

$$w^{c} = \frac{{{\text{median}}\left( f \right)}}{{f^{c} }}$$
(5)

where the frequency f represents the number of pixels of the class divided by the total number of pixels in the images that contain an instance of the class c. The number of pixels for each class within the training set, denoted by “Pixel count”, its frequency and the class weights, can be seen in Table 2.

From Table 4, it can be noticed for the “Crack” class, the lower number of pixels and frequency corresponding to a heavier weighting.

Table 4 Pixel numbers and median frequency class weights for each class

4 Comparative analysis and evaluation

Defined the model architectures and the dataset, the model hyperparameters need to be configured to start with the training process. Being external to the networks, these values cannot be directly estimated from data but can be set using heuristics. Thus, the optimal network architecture has been explored considering a fixed number of 10 epochs, a mini-batch size of 16 images, a momentum of 0.9 and a L2 regularization of 0.0001. To identify a suitable initial learning rate, have been examined with the training process of each network the values 10–3, 10–4 and 10–5. A comparative study has been first established according to the percentage of correctly classified pixels, defined as

$${\text{GA}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(6)

where GA is the global accuracy, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives. Figure 7 depicts the validation accuracy recorded during the training process of each network for different learning rates.

As reported in Fig. 8, the best performing networks have been ResNet-18 and ResNet-50 with a learning rate of 0.001, achieving a global accuracy of 88.76% and 88.57% respectively. However, this metric can present misleading results in case of class imbalance, resulting biased towards the classes that dominate the image. For this reason, to define the network with superior segmentation ability, we considered accuracy and intersection-over-union (IoU) for individual classes. The IoU metric is the ratio between the amount of overlap and the union between the predicted segmentation and the ground truth:

$${\text{IoU}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}} + {\text{FN}}}}$$
(7)
Fig. 8
figure 8

Validation accuracy during training processes with different learning rates

The lower the IoU, the worse is the prediction result. Tables 3 and 4 show the accuracy and IoU-related measures for ResNet-18 and ResNet-50, respectively.

From Tables 5 and 6, it can be noticed that accuracies for “Delamination” and “Crack” classes improve from ResNet-18 to ResNet-50 network. It can be also noticed a corresponding slight decrease in both the accuracy and the IoU metric for the “Background” class. Overall, since the mean accuracy improves from 0.87739 to 0.89427, by approximately 1.7%, and the mean IoU from 0.5909 to 0.59213, in this work as a reference architecture it has been chosen the one of the ResNet-50 network.

Table 5 Accuracy and IoU for ResNet-18 network
Table 6 Accuracy and IoU for ResNet-50 network

5 Results and discussion

As shown in the previous section, ResNet-50 achieved better results than ResNet-18. Therefore, an empirical evaluation of optimal hyperparameters has been performed for a further improvement of performance. An extensive search has been conducted to define the best contribution of the previous parameter update to the current learning iteration given by momentum. Furthermore, the regularization term L2 for the weights to the loss function has been assessed to reduce overfitting. To observe the complete convergence behavior and avoid overfitting, the loss function has been minimized considering a training process with 20 Epochs and training iterations on mini-batch with size 8. The final configuration of tuned hyperparameters is: learning rate 0.001, momentum 0.9, regularization 0.0001, epochs 20 and mini-batch size 8. Figure 9 depicts the final training and validation results for the network considered in this work.

Fig. 9
figure 9

Accuracies for each epoch

The latest training and validation accuracies achieved after a training time of about 3 h are 93.42% and 91.04%, respectively. Table 7 summarizes semantic segmentation quality metrics for each class.

Table 7 Accuracy and IoU for the optimal tuned hyperparameters

The result proves that among the three classes, “Crack” has the lowest accuracy whereas “Delamination” has the highest. This finding about the difficulty to recognize the pixel of this class make sense from a visual point of view, since “Delamination” and “Background” have more easily distinguishable spatial features. To examine the performance of the trained and validated network, it has been presented in Fig. 10 an example of images used for the validation processes. The first column contains the original images, the second column consists of ground truth and the last column represents the predictions.

Fig. 10
figure 10

Examples of detection result by the proposed network

Despite the good performance of the proposed network, it still has some inaccuracies in detecting mainly cracks. The typical incorrect prediction refers to the thickness of the cracks being larger than the ground truth. Although minor errors, the results demonstrate the reliability of this model for the automatic assessment of existing concrete structures. Therefore, a larger training database could improve model capacity and generalization in future applications. On the other hand, performance and results could be improved considering high-resolution images. Figure 11 shows some examples on test images with high-resolution, that are never used for both training and validation phase.

Fig. 11
figure 11

Test images with high-resolution: a underside of stairs; b piers; c girders; d piers; e pier cap; f abutment; g pier cap; h concrete surface

Test images showed that considering high-resolution images could add significant capability to classify civil infrastructure damages, even those related to the “Crack” class. Looking into the details, the proposed method is not susceptible to various background patterns, concrete texture, exposure and environment, resulting useful for on-field civil infrastructure inspection.

5.1 Damages’ measurement

Once the damages have been detected, it is possible to extract morphological information to determine durability, conditions of exposure, and to define economic and safety impact. Currently, the actual need and urgency are defined with an approximate and qualitative way, according to quick survey on the infrastructural heritage. For each defect on the structure, extension and intensity are indicated through constant coefficients without referring to quantitative analyses. Complexity, level of detail and the cumbersome of investigations are conversely related to the number of infrastructures on which they are applied and to the uncertainty of the results. The proposed deep learning-based inspection approach not only makes the process automatic but provides useful data to reconstruct damage evolution without operator dependent errors.

Once predicted the class for each pixel, properties of image regions can by quantified by using the MATLAB function “regionprops (Image, ‘properties’)”. Table 8 lists some measurements on the test images of Fig. 11 related to the actual number of pixels in the region classified as “Delamination”, “Crack” and “Background” (‘Area’).

Table 8 Area measurements on test images (Fig. 11)

Furthermore, the amount of damage can be defined in terms of percentage of the total area, or other units, given a proper spatial calibration factor.

6 Conclusions

This paper proposed an automated civil infrastructure inspection based on deep learning to detect and quantify “Delamination”, “Crack” and “Background” regions on real structures, at pixel level. To ensure a wide range of adaptability, the training and validation dataset were built collecting images from the Internet, on-field bridge inspection and Google Street View under uncontrolled situations. Multiple environments, concrete texture and photo properties have been considered in order to obtain a very robust model suitable for real on-field inspections. For the collected images, each pixel has been manually labeled to create ground truth data for training semantic segmentation algorithms. Data augmentation was implemented to enhance diversity and expand the dataset. After augmentation, the number of images used for training and validation was 2000 and 500, respectively.

A comparison study has been performed between pre-trained networks to define the most suitable for the semantic segmentation of civil infrastructure defects. To find the best training model, the best learning rate has been selected with empirical method. The most performing ResNet-50 network has been fine-tuned to set hyperparameters configuration, achieving the highest validation accuracy of 91.04%. With the validation datasets, it is observed that the highest accuracies correspond to “Delamination” and “Background” classes whereas “Crack” class is found the most challenging to detect accurately. In addition, the performance of the trained network was tested considering test images with high-resolution, not used for training and validation. This analysis demonstrated that the proposed method could provide very accurate detection results with reference to all classes. Furthermore, the proposed method has been used not only for detection task but also to quantify defects by extracting morphological information. This research confirmed a high degree of applicability and advantage for computer vision-based inspection in civil infrastructures, which may significantly improve the productivity in the future.

To improve the performance of the semantic segmentation networks and allow engineers to apply this technique for their specific tasks, the datastore can be downloaded as open source from the website (https://drive.google.com/drive/folders/1sdzPAai6d6fVgM-qEFCnCl0MQkIG7NTN?usp=sharing).

Future research will concern the improvement of the semantic segmentation metrics considering a larger dataset and multispectral images that provide further information about each pixel. Furthermore, LiDAR sensor data and digital models will be integrated to develop a fully automated inspection procedure. Thus, computer vision-based method is expected to replace traditional visual inspection in the near future because of the objective assessment and the saving in resources.