Introduction

Civil infrastructure system removed all the existing barriers from our daily life helping the society to function being reduced the cost and time in matter of transportation. Examples of the most important civil infrastructure are bridges, roads and buildings. Each one of this assures a specific utility: bridges are important for transportation over the rivers, roads are important also for long distance transportation and building are important for places to live, work and so on. The harsh environment exercise on the whole life cycle of an infrastructure a large variety of negative impact. The negative impact is caused by: earthquakes, snow and vehicle loads, temperatures causing visible and/or invisible damages on infrastructures. The most challenging events which occur due to these factors are cracks. Once a crack appeared in a certain point of an infrastructure, to take all the necessary measures to inspect that area at regular time or when an significant factor took place (such an earthquake) is very important.

Manual inspection for crack detection is time consuming, subjective (Rafiei et al. 2017) and hazardous when made at a high level above ground. This kind of hazard cracks inspections are made in case of bridges (Alipour et al. 2019). To minimize the inspection hazard, drones are used for image capturing. Due to fact that the captured images also require a lot of time for inspection if made by man-power, computer vision methods can be used to reduce this operation. There are three types of computer vision methods: image processing, machine learning and deep learning. Image processing method involves the use of edge detection (Nayyeri et al. 2019), threshold and morphology. Despite the fact that this method has the advantage of being practical and simple, it is not suitable in the process of crack detection where images contain a complex background. In this case, the resulting accuracy will be low. A combination between image processing and machine learning will result in a better crack detection accuracy, but also this kind of method is not completely automatic and needs manual parameter adjustment for different kind of analyzed images. Support Vector Machine (SVM), K-nearest neighbors (KNN) or Artificial Neural Networks (ANN) are example of traditional machine learning methods. Each machine learning algorithm developed had its own weaknesses and strengths (Božić 2024).

There are four main types of learning algorithms (Liu et al. 2024):

  • Supervised. The most common class of algorithms used. These algorithms are used on data which were previously labelled. The operation of data labelling is time consuming.

  • Unsupervised. This class of algorithms is used when the data are not labelled, being suitable for tasks such as clustering.

  • Semi-supervised. This class of algorithms uses a large amount of unlabelled data and a small amount of labeled data.

  • Reinforcement. These algorithms are used in applications where, depending on the action type, a penalty or a reward can be obtained.

A high detection accuracy for cracks is obtained using deep learning methods. In deep learning, the deep neural networks automatically learns and extracts necessary features. Deep learning uses multiple layers to process information. During this process, it also learns from the data received. Other important aspects of deep learning consist in:

  • Multi-layered processing. The input data pass through multiple layers of interconnected nodes, extracting complex features.

  • Representation learning. Because deep learning excels at automatically learning, there is no need for manual extraction of features. This leads to a reduction of time in comparison with traditional machine learning.

  • Applications. It can be used for image recognition in a large variety of domains such as agriculture (Hu 2022), transportation (on land, air, or sea), medical, civil infrastructure.

The main challenges of deep learning are: computational costs, interpretability and data dependency (Pedrycz and Chen 2020). The principle of deep neural networks resides in using interconnected layers of artificial neurons. In the case of images, the the features are extracted in each layer. Prior to this operation, each neuron processed and passed the information. When convolutional neural networks (CNN) are used in computer vision, excellent results are obtained in terms of image classification, object recognition and image segmentation.

The specific contributions of this paper include the following:

  • Developing a technique which reduces the dataset of images size on disk.

  • Developing a technique which increases the accuracy of Visual Geometry Group (VGG) network in case of civil infrastructures images affected by noise.

  • Developing a technique which reduces the computational time in case of civil infrastructures images affected by noise.

The paper is organized as follows: Section 2 presents the current state of the art. Section 3 describes the methods involving the architecture of VGG16 and VGG19 and the mathematical morphological operators. Next, Section 4 describes the proposed algorithm used for noise removal using MMOs along with the obtained results. Finally, in Section 5, the conclusions, and the future improvements to be taken into consideration are presented.

Related work

Using image processing techniques in civil infrastructure, certain types of corrosion and cracks can be monitored during infrastructure lifetime (Dumitru et al. 2023). The procedures used to achieve this task are: Markov random fields segmentation, K-means clustering algorithm and Otsu method thresholding. In case of a gray scale images, Otsu method thresholding returns binary images. The value of 1 is for image foreground pixels and the value of 0 is for background pixels.

Another method used for crack detection is based on pixel-level and is made in two steps: crack recognition and crack semantic segmentation (Zhang et al. 2024). For crack detection, the authors used a VGG16 model and for pixel-level semantic segmentation, they used a Unet++ model. This method was tested on a number of 8064 images of civil infrastructures cracks. The resolution of each image in the dataset is 224x224. One obtained a mean detection speed of 8.1 ms per image. Using 5 datasets for training and testing for classification experiments, the authors obtained an accuracy ranging from 82.70% to 99.90%.

Using a Mask Region-based Convolutional Neural Network (Mask R-CNN) model to detect pavement cracks in roads, in Shomal Zadeh et al. (2024), the authors obtained a performance accuracy ranging from 95% to 99%. The paper analyzed the cracks that can appear in public roads on two direction: longitudinal and transversal. The model used can complete one task such as segmentation of detection.

Based on a dataset which contains a number of 12000 images with asphalt cracks (minor fatigue, longitudinal, diagonal, transversal and major fatigue), in Simonyan and Zisserman (2014), there types of approaches were tested: Light Gradient Boosting Machine, Convolutional Neural Network and Deep Neural Networks (DNN). The best accuracy rate of 99.10% was obtained for detection of longitudinal cracks and the lowest accuracy rate of 96.80% was obtained in case of images without cracks.

Civil structure cracks take a large palette of attributes: irregular, longitudinal, transversal or block. To determine the severity of the crack, one needs to measure the crack’s width. Among the algorithms used for this task, one can mention subpixel edge detection and Sobel edge detection (Rajab et al. 2024).

Another approach to detect concrete cracks and spall consists in using You Look Only Once v8 (YOLO v8) networl with ByteTrack and supervision (Yang and Ji 2021). Using this approach, the results are better than using only YOLO v8 for crack and spall detection. ByteTrack uses a cutting-edge tracking algorithm applied on the output resulted from YOLO v8. This result feeds tracking data back into the YOLO v8 model which will improve the model recognition. One trains the Supervision model using labeled and pseudo-labeled data. ByteTrack model had an accuracy for crack detection of 90%, Supervision got an accuracy of 85% but used together, one achieved an accuracy of 95%.

In case of images achieved in conditions of poor illumination, crack detection can be made based on pixel intensity information. These methods using this approach are: percolation theory, egde-based and graph-theory (Lv et al. 2023).

Visual Geometry Group-16 (VGG16) used along with the Principal Component Analysis (PCA) (Hoang and Nguyen 2023) on a dataset which contains over 3000 images with six rock types achieved a classification accuracy above 90% correct recognition of rocks. By employing Visual Geometry Group-19 (VGG19) (Hanin 2019) on a dataset comprised of two categories of images, i) positive (which include images with different building sections affected by cracks) and ii) negative (which include images with different building sections without cracks), a classification accuracy of 92.2% was obtained.

Methods

To remove salt and pepper noise from images (Luo et al. 2022) excellent results are obtained using MMO filters. This kind of filters acts directly on the pixel values of an image. In Fig. 1, the effect of salt and pepper noise on a civil infrastructure image which present cracks is shown.

Fig. 1
figure 1

Salt and pepper noise. (Left) Original image. (Right) Zoomed image resulted from compilation of our method in MATLAB 2023b simulation environment

To better highlight the effect of salt and pepper noise in an image which presents cracks, in Fig. 1 (Right), one zoomed a section of the original image. The pixels affected by noise are colored in black.

In specialized literature, there are three types of noise which occurs in images: impulse, structured and photoelectronic. Impulse noise in images can be exemplified as salt and pepper. This type of noise can be observed in special in binary images where, in dark regions, one can observe white pixels and, in white regions, dark pixels will be observed. Other noise which can occur in images can be caused by malfunction of the equipment, Gaussian noise resulted from compression.

There are four types of MMO which can be used as a filter: erosion, dilation, closing and opening. Closing and opening are a combination between erosion and dilation. In case of closing filter applied to an image, the first operation is dilation followed by erosion. In case of opening filter applied to an image, the first operation is erosion followed by dilation. Figure 2 highlights the pixel effect when dilation filter is compiled: the pixels which had the black color, considered as noise, disappeared. To highlight this operation, in Fig. 2 (Right) a section with the effect of dilation filter is zoomed. The compilation was made on Fig. 1.

Fig. 2
figure 2

Salt and pepper noise removal. (Left) Original image. (Right) Zoomed image resulted from compilation of our method in MATLAB 2023b simulation environment

To remove noise using MMO filters, one needs to find the best shape and size of the structural element (SE) used. Based on its shape, the SE can be a line, circle, square and so on. By size, SE can be 2x2 square, 3x3 square and so on. Equation 1 represents the MMO for opening applied to an image:

$$\begin{aligned} A \circ B = (A \ominus B) \oplus B \end{aligned}$$
(1)

where,

A is the original image, \(\circ \) represents symbol for opening, \(\ominus \) represents symbol for erosion, B represents the SE, and \(\oplus \) represents symbol for dilation.

Equation 2 represents the closing operator applied to an image:

$$\begin{aligned} A \bullet B = (A \oplus B) \ominus B \end{aligned}$$
(2)

where \(\bullet \) represents the symbol for closing operation.

CNN is useful in matter of complex images due to large convolutional kernels and numerous network parameters. Number of epochs and batch size are important during the process of training, validation, and testing.

To implement our proposed approach, we used two convolutional neural networks, VGG16 and VGG19 respectively. Both pre-trained models for image recognition VGG16 and VGG19 were developed at the University of Oxford (Rouf et al. 2024) (see Table 1 for their architectures). The input resolution for both models is 224x224 pixels with red, green and blue (RGB) channels. The models are characterized by having small 3x3 convolutional filters stacked in line. Considering the number of layers situated in each block, VGG16 have 2 blocks and VGG19 have 3 blocks.

Table 1 Architecture of VGG16 and VGG19

The dimension of the convolution kernels is 3x3 with a 1-pixel stride, this operation being able to maintain spatial resolution post-convolution. Maxpooling process has a pool-size of 2x2 with the stride of 2 pixels.

Table 2 refers to the number of parameters for VGG16 and VGG 19.

Table 2 Parameters of VGG16 and VGG19

Usually pre-trained networks are trained on various and complex datasets for specific requests such as image recognition (Ai et al. 2023). In the process of model training, model parameters optimize the loss function. The loss function is calculated as error between the actual label values and algorithmically predicted values. When an inside dataset is used, the output is compared with the ground-truth labels used for testing resulting the following measurements: accuracy, recall and precision. The performance of the model can be further assessed on images not part of the inside dataset.

Fig. 3
figure 3

Flowchart of the proposed algorithm

Equation 3 refers to the accuracy of the system which is the ratio between the number of images properly recognized and the sum between the number of images properly recognized and the number of images not correctly recognized:

$$\begin{aligned} Accuracy=\frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(3)

where:

TP represent true positive represents the images correctly recognized.

TN - true negative represents the number of images correctly recognized.

FP - false positive represents the number of images incorrectly recognized.

FN - false negative represents the number of images incorrectly recognized.

Table 3 Dataset size on disk

Equation 4 refers to the precision which is the ratio between real positive detections and the sum between real positive detections and false positive detections:

$$\begin{aligned} Precision=\frac{TP}{TP+FP} \end{aligned}$$
(4)

Equation 5 refers to recall which is the ratio between real positive detections and sum between real positive detections and false negative detections:

$$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$
(5)

The workflow of a DNN is as follows:

  1. 1.

    Data acquisition. More and accurate data will result in a highest accuracy.

  2. 2.

    Preprocessing. At this stage we have built our dataset and we are looking to highlight the important features. To this end, there is necessary to clean our data, to handle and scale our real-valued features.

  3. 3.

    Building the model. The model is based on an input layer which can be an image by a certain size, a number of hidden layers and an output layer.

  4. 4.

    Classification. To identify each class among the class labels, a softmax function (Crognale et al. 2023), which can be defined as a ubiquitous helper function, is used to calculate the probabilities.

  5. 5.

    Learning. Rectified Linear Unit (ReLU) activation functions (Hanin 2019) are used inside of the hidden layers and return value of 0 if the input is negative or 1 if the function value if positive.

Algorithm description

The proposed method is based on the workflow in Fig. 3.

Deep transfer learning was implemented with MATLAB 2023b simulation environment using an Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz 2.40 GHz, 64-bit operating system, x64-based processor, Windows 10 Pro.

The image dataset consists of 56,000 images (Maguire et al. 2018) with cracked and non-cracked images of civil infrastructure: bridges, roads, and pavement. The images include obstruction as: shadows, edges, surface roughness. A number of 2582 images are affected by salt and pepper noise. Each image has a dimension of 256x256 with RGB channels.

Table 4 Training parameters used in our method

The input image size for VGG16 and VGG19 is 224x224 with RGB channels. To use the dataset in our proposed methodology, there is necessary to resize the images dataset from 256x256 with RGB channels to 224x224 with RGB channels.

The SE used in our proposed paper had the shape of a rectangle with the size of 2. As a result of this noise removal algorithm, the dataset size on disk was reduced as presented in Table 3. In this table, SE of different size and shapes were tested on the dataset images with civil infrastructure cracks and it was listed the one with the highest performance, respectively lower size on disk.

In Table 4, the training network parameters used in our paper for VGG16 and VGG 19 are detailed. We tested two types of optimizer: Adam and stochastic gradient descent (SGD). Adam optimizer obtains a better performance than SGD. Also, we set the value of the epoch number to 3 because beyond this value the accuracy rate is not increasing substantially and computational time is increasing.

In Table 5, the accuracy metrics using VGG16, VGG 19 and MMO are detailed. We can observe that the accuracy metrics are the highest when opening filter was used on the dataset. Also, elapsed time is better than in the other scenarios presented.

Table 5 Accuracy metrics obtained using training options from Table 4

In Table 6, the classification performance on the original dataset and after noise removal using MMOs is presented. This improvement is determined by noise reduction. Salt and pepper noise (which acts on the pixel values from images) is reduced by MMOs which act directly on pixels value. PPV represents the ratio between truly images (detected with cracks) and the total number of images with cracks. A ratio of 100 percent means that all predictions were correctly detected. NPV represents the ratio between truly images detected without cracks and the total number of images without cracks.

Table 6 Summary of classification performance metrics

The efficiency of other MMOs in matter of performance metrics is revealed in Table 7. As one can observe, not all the MMOs used as a filter on images with civil infrastructures cracks affected by noise have an efficiency improvement.

Table 7 Efficiency of accuracy on VGG16

Conclusions

In this paper, we explored, using pre-trained neural networks VGG16 and VGG19, the efficiency of the mathematical morphology operators on a dataset which contains noise. The results showed that, in the matter of accuracy metrics, MMO obtained better results than using only VGG16 and VGG19. The highest accuracy was obtained using VGG16 and opening operator. Also, the time for processing this operation is lower than processing time used in case of VGG16 and VGG19.

As a conclusion, our paper revealed that using MMO with pre-trained neural networks on a dataset with noise, better results can be obtained, being a useful tool in the process of crack detection in civil infrastructures.

Because the cracks that affect civil infrastructures are important, for future works, we propose to identify the size of the cracks and to classify the importance of the crack by its size.