Keywords

1 Introduction

Under the background of intelligent manufacturing in China, Intelligent construction technology is increasingly becoming an important development direction in the field of construction and engineering. Intelligent building technologies combine artificial intelligence, big data, the Internet of Things, and advanced sensing technologies to improve the efficiency, quality, and sustainability of building construction, operation, and maintenance.

Crack detection on concrete surfaces is crucial in maintaining infrastructure and ensuring structural safety used for detecting and repairing cracks in concrete structures early to ensure safety and reliability. Concrete structures such as bridges, buildings and roads are part of the infrastructure of modern society. Cracks that are not detected or not repaired in time may cause serious damage to the structure and even threaten life and property safety [1]. Early detection and treatment of cracks in concrete structures can reduce maintenance and repair costs. Chronic neglect of cracks may lead to more expensive repairs and recovery work [2]. Concrete cracks may cause moisture and harmful substances to penetrate and thus negatively affect the surrounding environment. Effective detection helps to reduce environmental pollution by [3]. Existing concrete surface crack detection methods generally have the following four directions: Visual detection [4], Deep learning methods [5, 6], Image processing technology [7] and Sensor technology [8].

The application of deep learning and image processing technology provides a new hope for crack detection, which can improve the accuracy and efficiency of detection, reduce maintenance costs. Convolutional neural network (CNN) is one of the key factors for deep learning to make breakthrough in the field of image. The LeNet-5 [9], proposed by LeCun et al. in 1998, was an early CNN, which laid the foundation for the digital recognition task. Subsequently, AlexNet [10]’s proposal led to the deep learning research in the field of image recognition, which achieved a significant victory in the ImageNet competition in 2012. Deep learning has been widely used in image classification and object detection tasks. Some important work has included the Faster R-CNN [11], the YOLO (You Only Look Once) [12], and the Mask R-CNN [13]. These models have achieved significant performance improvements in the field of object detection through their end-to-end training methods. In these improvements, the VGG-16 neural network is one of the important milestones in the field of deep learning, which was proposed by Karen Simonyan and Andrew Zisserman [14] in 2014. The network structure includes a 16-layer deep convolutional neural network with a series of convolutional layers, pooling layers, and fully connected layers. VGG-16 demonstrates a broad potential in image processing. It is widely used in image classification tasks, especially achieving excellent performance on large-scale image datasets, such as ImageNet.

This paper aims to propose an improved deep learning model for the automatic detection and identification of concrete surface cracks. The model is based on the convolutional neural network and the attention mechanism, which is able to effectively extract the features of the concrete surface and classify them. The advantage of this model is that it does not require manual annotation of the crack location or complex image preprocessing, thus saving a lot of time and resources. In this paper, we verify the validity and robustness of this model by performing experiments on publicly available concrete surface crack datasets. The contribution of this paper is that it provides a new way to solve the crack problem in infrastructure maintenance and construction engineering, thereby improving safety and reducing maintenance costs.

2 Crack Recognition of the Concrete Surface Based on an Improved Convolutional Neural Network

In the construction research, the accuracy of automatic identification and timely diagnosis of concrete surface cracks has attracted the attention of researchers. To improve the accuracy of crack recognition, reduce the neural network parameters, and improve the training efficiency, the VGG-16 neural network structure was improved. The MendeleyData-CrackDetection [15] concrete crack dataset, including crack and no crack images, covering dry shrinkage, plastic shrinkage, temperature and external load cracks, were used. The dataset includes 40000 RGB images of 227 × 227 pixels in negative (without cracks) and positive (with cracks), with 20000 per category. The dataset had surface finish and illumination differences, without random rotation, flip, or tilt data enhancement. The data set was divided 7:1 into training set (35000) and validation set (5000).

In the concrete surface crack detection, the original VGG-16 network has problems, such as long training time and general accuracy. Therefore, the network is improved to introduce the batch normalization layer [16], which maps the activation values of each layer to a range of mean 0 and variance 1 to solve the gradient vanishing problem. The batch normalization method improves the network convergence rate and reduces the number of iterations while maintaining the same accuracy [17]. The basic mathematical expression for the method is shown as follows.

For the input m samples \(x_{i}\)~\(x_{m}\), the mean value is shown below in (1):

$$\mu =\frac{1}{m}{\sum }_{i=1}^{m}x_{i}$$
(1)

The variance is shown below in (2):

$${\sigma }^{2}=\frac{1}{m}{\sum }_{i=1}^{m}x_{i}{(x_{i}-\mu )}^{2}$$
(2)

The normalized result is shown below in (3):

$${\widehat{x}}_{i}=\frac{x_{i}-\mu }{\sqrt{{\sigma }^{2}+\varepsilon }}$$
(3)

After a batch normalization operation, the mean of the data was adjusted to 0 and the variance to 1. To avoid cases where the inequality does not hold for ε = 0, we introduce the constant ε. However, such an operation may affect the feature distribution of the image, and thus, the need to recover the original feature distribution of the image through scale transformation and offset operation. The specific mathematical expressions are given in the formula (4)–(6).

$$y_{i}=y_{i}x_{i}+\beta_{i}$$
(4)
$$y_{i}=\sqrt{Var(x_{i}})$$
(5)
$$\beta_{i}=\text{E}(x_{i})$$
(6)

These parameters are obtained by learning and training, where E represents the mean and Var represents the variance function.

2.1 LeakyReLU Activation Function and the P-ReLU Activation Function

Most convolutional neural network models will use the ReLU function as the activation function after the convolutional layer. However, due to the characteristics of the ReLU activation function, when the network output is negative, the output value is always 0 after the activation function processing, which triggers the gradient disappearance in subsequent training, that is, the phenomenon of neuron “death”. Considering this feature, some researchers will adopt the LeakyReLU activation function to optimize the convolutional neural network model in practice. For example, Chen Mianshu et al. [18] used the LeakyReLU activation function to select the nonlinear activation function in the image classification task based on the convolutional neural network. The mathematical expression of the P-ReLU activation function Formula is shown in formula (7).

$$PReLU(x_{i})=\left\{\begin{array}{c}x_{i},\quad x_{i}>0\\ ax_{i},\quad x_{i}<0\end{array}\right.$$
(7)

This function is similar to the ReLU function, that is, when the input value is positive, the original value is directly output, and when the input value is negative or zero, the original value is multiplied by a constant C. The Leaky ReLU function is such a function, where C is a small positive number, determined before training begins. This has the advantage of maintaining the activation of neurons with negative inputs, avoiding the phenomenon of neuronal death, while also increasing the neuronal diversity. According to the above formula, the default value of C is 0.01 [18].

However, there are also some problems with using the Leaky ReLU activation function in the network training. Since the output coefficient of negative input in its mathematical expression is fixed and is not necessarily the most suitable for training effect, if you want to find the best value of coefficient C, you need to do many experiments. To address this issue, this study employed P-ReLU activation function in the network model to replace the original ReLU activation function in the convolutional layer. The P-ReLU activation function [19] is a learnable activation function that automatically adjusts the output coefficient at the negative input based on the training data.

Within the negative interval, the weight of the neurons is controlled by the parameter A. Unlike the LeakyReLU activation function fixed weight, this weight can be learned and dynamically adjusted during training. The variable i represents the different channels. This activation function allows all channels to share one weight, and can set different weights for each channel. The default initial value is 0.25. Because the P-ReLU function still has derivatives at x < 0, no gradient vanishing problem, and the function is non-saturated, it can effectively solve the problem of neurons dying in the negative interval, so as to improve the network performance and accelerate the model convergence to a certain extent. Below is the image of the ReLU activation function versus the P-ReLU activation function [20] (see Fig. 1).

Fig. 1.
figure 1

Convolutional neural network using ReLU(x) versus P-ReLU(x)

This paper is based on the VGG 16 convolutional neural network and preserves the overall structure. A batch normalization (BN) layer was added after each set of convolution operations, replacing the original activation function as P-ReLU to solve the neuronal death problem. The last fully connected layer of the original model has 1000 labels, but there are only four categories in this paper, so we changed the last Softmax classifier to four labels and reduced the number of fully connected layers to two. The dimensionality of the first fully connected layer becomes 4096 and the second is 4, corresponding to four concrete crack detection images. These adjustments simplify the network structure and improve the identification efficiency. The structure of the improved convolutional neural network (CNN) model is shown below (see Fig. 2).

Fig. 2.
figure 2

The structure of the improved convolutional neural network (CNN) model

In the modified VGG-16 convolutional neural network model, the key parameters of each layer are shown in the table below (see Table 1).

Table 1. Parameters of the improved VGG-16 neural network model

After calculating the loss function, to update the parameters of the network nodes, we adopted the Adam optimizer [21] for training. Compared with the traditional stochastic gradient descent (SGD), the Adam optimizer comprehensively considers the first and second order gradient estimation, thus achieving the model convergence faster and improving the training efficiency. The mathematical expression of the Adam optimization algorithm is shown below in formula (8)–(12).

$${m}_{t}=\mu *{m}_{t}-1+(1-\mu )*{g}_{t}$$
(8)
$${n}_{t}=v*{n}_{t}-1+(1-v)*{g}_{t}^{2}$$
(9)
$${\widehat{m}}_{t}=\frac{{m}_{t}}{-{\mu }^{t}}, \quad {\widehat{n}}_{t}=\frac{{n}_{t}}{1-{v}^{t}}$$
(10)
$$\Delta {\theta }_{t}=\frac{{\widehat{m}}_{t}}{\sqrt{{\widehat{n}}_{t}+\varepsilon }} + \eta$$
(11)
$${\theta }_{t}+1={\theta }_{t}+\Delta {\theta }_{t}$$
(12)

\({\theta }_{t}\)+1 represents the weight values of the neural network model in the t + 1 round iteration. Meanwhile, \({\text{m}}_{\text{t}}\) and \({\text{n}}_{\text{t}}\) represent the first and second moment estimates of the gradient, respectively. The t and t are correction terms for \({\text{m}}_{\text{t}}\) and \({\text{n}}_{\text{t}}\), which are the first moment estimate and the second moment estimate of the corrected deviation, respectively. The μ, v and ε are hyper-parameters, usually set as μ = 0.9, v = 0.999 and ε = \({10}^{-8}\).

2.2 Setting of the Hyperparameters of the Network Model

In neural network training, the choice of learning rate is crucial. Too much learning rate may lead to oscillations and extended training time, while too little learning rate may lead to slow convergence and local optimal solution problems. Therefore, the rational selection and adjustment of the hyperparameters is the critical task. After multiple training and adjustment, the hyperparameters of the original VGG-16 and the modified model are as follows (see Table 2 and Table 3):

Table 2. The hyper-parameters of the original VGG-16model
Table 3. The hyper-parameters of the improved VGG-16model

3 Experimental Results and Analysis

The evaluation model mainly focuses on the following aspects: training accuracy, decline speed of loss function, model convergence rate, and oscillation existence. We will analyze the effect of the improvement strategy on the experimental results from the perspective of training accuracy. The accuracy reflects the model’s ability to identify concrete surface cracks, the loss value reflects the error level of the model in identifying diseases and insect pests, and the oscillation degree reflects the stability and gradient explosion of the model. The following figure shows the training accuracy of the original VGG-16 model (see Fig. 3) and the improved convolutional neural network (see Fig. 4).

Fig. 3.
figure 3

Prediction accuracy of the VGG-16 original model with the number of training times

Fig. 4.
figure 4

Prediction accuracy of the improved VGG-16 model with training times

As can be seen from the figure above, after 120 rounds of training, the convergence rate of the improved model and the training accuracy are significantly better than the unimproved model. Because we introduce the batch normalization module and P-ReLU activation function, which effectively improves the convergence rate of the network, avoids large oscillations, reduces the risk of overfitting, and improves the generalization ability of the model. In contrast, the unimproved VGG-16 model failed to perform as well on the accuracy of the model training due to its numerous parameters. In terms of convergence rate, the original model starts to converge at round 15, while the improved model approaches convergence and is faster at round 10. Considering the training accuracy and convergence speed, it can be concluded that in addition to the change of the network training accuracy, it is also necessary to pay attention to the change of the loss function.

For the original, the VGG-16 model and the improved convolutional neural network. From the curve of the loss function, the loss function value of both models decreases rapidly in the beginning stage of training and eventually drops to nearly zero, but the improved neural network model decreases faster. Although the improved model had local oscillations at the beginning of training, which showed slightly worse stability, considering the size of the loss function and the reduction of the training loss function. The loss value changes of the original VGG16 network model (see Fig. 5) and the improved convolutional neural network model during training (see Fig. 6).

Fig. 5.
figure 5

The original model loss values were varied with the number of training times

Fig. 6.
figure 6

Improvement model loss values as a function of training times

When training the network model using SGD stochastic gradient descent, the network convergence rate is significantly slower compared to the Adam optimization algorithm. In the experiment, the network starts to converge until the end of training. Finally, the network model does not fully converge and needs to increase the number of training rounds, leading to a significantly longer training time. The final training accuracy was 91.1%, which was lower than the 98.7% for the improved model. The loss function value also fluctuates repeatedly in a certain area until the end of the training, with no level close to zero. Therefore, it is necessary to choose the appropriate optimization algorithm according to the network model and the data set reality to obtain better results.

An improved deep convolutional neural network model based on VGG-16 convolutional neural network to identify four concrete surface cracks. The training accuracy of the improved model reached 98.7%. Improvements include introducing a batch normalization module after each convolutional layer to improve the convergence rate of the model. Meanwhile, the ReLU activation function in the original network was replaced with the P-ReLU activation function to reduce the effect of the ReLU function on the gradient vanishing problem of the network training.

4 Conclusion

As a strong infrastructure country, China is crucial to the timely and effective monitoring of the surface cracks of buildings, which is related to the safety of people’s lives and property. However, traditional methods are often time-consuming and laborious, and fail to meet practical needs. Fortunately, the emergence of deep learning techniques has solved this problem. Convolutional neural network model is widely used to identify and classify concrete surface crack images, thus providing reference data for improving process and maintenance. LeNet-5 is the earliest convolutional neural network model, followed by AlexNet, VGG, GoogLeNet, ResNet, etc. These models are deepening, but also bring some problems, such as training accuracy tends to saturation, occupy space and increasing parameters. Therefore, when selecting a network model, we need to focus on network structure, training time, occupancy space and parameters in order to find the most suitable model for this study. In this paper, we improve the original VGG model structure by adding the batch normalization layer and P-ReLU activation function and replacing the SGD optimizer with the Adam algorithm, thus increasing the convergence speed.