Domain adversarial training for classification of cracking in images of concrete surfaces

The development of automatic methods to recognize cracks in surfaces of concrete has been under focus in recent years, firstly through computer vision methods and more recently focusing on convolutional neural networks that are delivering promising results. Challenges are still persisting in crack recognition, namely due to the confusion added by the myriad of elements commonly found on concrete surfaces. The robustness of these methods would deal with these elements if access to correspondingly heterogeneous datasets was possible. Even so, this would be a cumbersome methodology, since training would be needed for each particular case and models would be case dependent. Thus, efforts from the scientific community are focusing on generalizing neural network models to achieve high performance in images from different domains, slightly different from those in which they were effectively trained. The generalization of networks can be achieved by domain adaptation techniques at the training stage. Domain adaptation enables finding a feature space in which features from both domains are invariant, and thus, classes become separable. The work presented here proposes the DA-Crack method, which is a domain adversarial training method, to generalize a neural network for recognizing cracks in images of concrete surfaces. The domain adversarial method uses a convolutional extractor followed by a classifier and a discriminator, and relies on two datasets: a source labeled dataset and a target unlabeled small dataset. The classifier is responsible for the classification of images randomly chosen, while the discriminator is dedicated to uncovering to which dataset each image belongs. Backpropagation from the discriminator reverses the gradient used to update the extractor. This enables fighting the convergence promoted by the updating backpropagated from the classifier, and thus generalizing the extractor enabling it for crack recognition of images from both source and target datasets. Results show that the DA-Crack training method improved accuracy in crack classification of images from the target dataset in 54 percentage points, while accuracy on the source dataset remains unaffected.

also used to provide measures of the cracking width at selected points. This is a cumbersome task, highly prone to error and consuming of human resources.
The automation of such processes is sought by the managing authorities due to its potential advantages in theoretically increasing the detection rate of such defects, while decreasing the allocation of resources to the task. Thus, the automatic detection of cracks is a relevant research question. Several approaches have been carried out using image processing, particularly in the last two decades, to automatically scan concrete surface photographs. Some authors have explored image processing techniques, such as binarization and edge detection algorithms (Abdel-Qader et al., 2006;Fujita & Hamamoto, 2009;Valença et al., 2012), image clustering (Oliveira Santos et al., 2019), or region-growing techniques for crack segmentation (Yamaguchi & Hashimoto, 2009. Although the fairly good results have been achieved, these are highly dependent on local conditions such as lighting, homogeneous of brightness, presence of other elements, or even the randomness of the starting regions in processing. Following the pioneering work of Krizhevsky et al. (2012), convolutional neural networks (CNNs) gained a lot of attention in a wide range of fields, including crack classification in images of concrete surfaces (Hamishebahar et al., 2022). Cha et al. (2017) presented a new architecture specifically developed for the task of crack classification in small patches cropped from a wide-angle image. Since the model is only able to classify at the patch level, the authors use a dataset of labeled patches (crack versus no crack) for training. For the prediction, the full image is scanned through a sliding window process to classify small windows iteratively over the full-size image. Kim and Cho (2018), da Silva and de Lucena (2018), and Qu et al. (2020) used the same schema; however, taking advantage from off-the-shelf models with transfer learning, using a binary-labeled public dataset (Ozgenel, 2018) to train only the classifier part of the network. Trying to evaluate the performance of off-the-shelf CNNs, some authors (Ali et al., 2021;Özgenel & Gönenç Sorguç, 2018) focused on comparing commonly used pre-trained CNNs, evaluating them for the task of crack classification. These approaches showed fairly good results, overcoming the main difficulties found in the image processing methods, with accuracy values above the 95% threshold; however, these performances are achieved in favorable conditions, that is, in a test subset reserved a priori from the same dataset as the training subset. That is indeed, an unseen test set, although these images are theoretically from the same domain as the training subset. Despite the problem of poor generalization of the CNNs, recent works are more focused on detection methods. These include You Only Look Once (YOLO) (Park et al., 2020), region proposal convolutional neural networks (R-CNN) (Saleem et al., 2021), or single-shot detection (SSD) (Jiang et al., 2021). These methods are aimed at detecting cracking while locating it in full-size images, avoiding the computationally expensive sliding window method, which is an important step. However, they do not focus on the main problem of classification and the generalization of the model. To enforce generalization of the models, Cha et al. (2018) and Deng et al. (2019Deng et al. ( , 2021 decided to include more classes at the training stage; Jang et al. (2018) used hybrid images, including infra-red and continuous-wave line laser with visible light images; while Chen and Jahanshahi (2018) used a Naive-Bayes data fusion scheme. These approaches allowed for good results in crack classification, but they require richer inputs from the image database, the first requiring a dataset with more labels and the others demanding more dimensions on the data, with direct impacts on the acquisition procedure.
Irrespective of the technique used to train a CNN for crack classification, the models usually achieve high accuracy. However, these are not frequently generalized for classification on slightly different images from those in which it was trained, presenting a relevant drop of accuracy. This can be surpassed by retraining the network for the new dataset, although this requires a new labeled dataset and a new training process, which are cumbersome tasks. Domain adaptation is a technique that can be used to generalize a CNN, making use of a smaller and unlabeled dataset (Ganin et al., 2016;Pinheiro, 2018) combined with a richer labeled dataset. This method can help aligning domains based on the invariant features from both source and target datasets, labeled and unlabeled, respectively.
To achieve this goal, the domain adversarial training for crack (DA-Crack) method is proposed, based on an adversarial architecture to train the model using a labeled source dataset and an unlabeled target dataset. The DA-Crack method is expected to increase the accuracy in classifying image patches from the target dataset, while keeping the accuracy in the source dataset.
This study is organized as follows: this introductory section encloses a brief presentation of the problem at hand and sets the objective. Section 2 summarizes the most relevant related work. This is followed by Sec. 3 in which the method is described. The results are presented in Sec. 4, starting with the metrics used to evaluate the performance of the method followed by the presentation and discussion of the results. Finally, the main conclusions are presented in Sec. 5.

Related work
Deep learning methods have gained a lot of attention in recent years from the research community working on image-based automation of crack inspection in concrete structures (Dong & Catbas, 2020). These years have been rich in new approaches, with some authors presenting personalized architectures of neural networks (Cha et al., 2017), while others are taking full advantage of off-theshelf architectures (Qu et al., 2020). Personalized architectures seem more efficient for crack classification, dismissing complex and deep networks. However, crack classification presents results above 95% only on datasets very similar to those in which the networks are being trained (Ali et al., 2021;da Silva & de Lucena, 2018). Thus, the convolutional neural networks implemented for crack classification do not generalize enough, which makes them useless for their main purpose, classifying cracks in images of concrete surfaces generically.
The generalization of deep learning networks may be achieved through domain adaptation techniques, which can be applied to several fields of expertise, for example, in crack classification on images of pavement (Liu et al., 2021). Domain adaptation refers to a set of techniques aimed at generalizing descriptive models (Ben-David et al., 2010) including neural networks trained on a source and a target dataset enabling it to classify images from both domains, even with a slight variance shift between them (Wang & Deng, 2018). This model requires training on both a source domain, taking advantage of a huge labeled source dataset, and a target dataset from which only unlabeled patches are available. The objective of the domain adaptation technique is to find a latent space in which both the source and target domains share the invariant features between classes. The domain adversarial neural network (DANN) (Ganin et al., 2016) is able to converge to that latent space, by training the model to classify the images from the source dataset, at the same time as it learns to discriminate images between domains. A gradient reversal layer is applied to the loss backpropagated from the discriminator to the feature extractor, as in Fig. 1. This constitutes an adversarial network that converges towards the optimization of the CNN, by minimizing the loss from the classifier while maximizing that of the discriminator. The DANN training configuration enforces the model to converge towards an optimum state in which it aligns features output by the extractor between domains, and thus confusing the discriminator while aiding the classifier.

DA-Crack method
The domain adversarial training for crack (DA-Crack) method proposed in this work is based on the architecture of Ganin et al. (2016), presented in Fig. 1. This is an adversarial network that includes three main parts, namely a feature extractor (ResNet50) that feeds a classifier and a discriminator, which are both composed of two fully connected layers. At the training stage, both the source and target inputs are fed into the feature extractor. The classifier is fed only with features extracted from the source dataset since the target dataset does not include labels. The discriminator is fed with features extracted from both the source and target inputs, and learns to discriminate between domains. The main specificity of this architecture is that while performing backpropagation, the gradients derived from the discriminator are reversed after being backpropagated to the feature extractor. This gradient reversal layer (GRL) enforces the feature Fig. 1 Architecture of an adversarial training model (Ganin et al., 2016) extractor towards a latent space in which features from both datasets are shared and tend to overlap class-wise.
Note that the gradient derived from the classifier loss ( L y θ y ) is backpropagated through the classifier and combined with the reverse of the gradient derived from the domain loss ( − L d θ d ). The has the task of weighting the effect of the discriminator gradient on the weights of the feature extractor. This is a hyperparameter that is set to 1 in this particular case. Because in this specific case the classification is binary, then it requires a classifier architecture with a binary output: crack or no crack. In the scope of this work, the feature extractor used is the ResNet50 (He et al., 2016), an "off-the-shelf " CNN, pretrained on the ImageNet (Deng et al., 2009) dataset and loaded from the torchvision library (v.0.7.0). Both the classifier and discriminator share the same architecture, with two fully connected layers, mapping the output of the feature extractor from a 2,048 elements array to 128, and then to a two element output, as shown in Fig. 2 The classifier is instantiated with random weights, which causes high-gradient updating, affecting the pretrained feature extractor. This effect is attenuated by pretraining the classifier. Therefore, the DA-Crack training method is twofold: (i) with the CNN and classifier using only the source dataset; at this stage, the CNN is not updated while the classifier is trained for 100 epochs; (ii) in the second phase, the adversarial training is done on both the source and target datasets for 100 epochs; in this phase, the network is in its adversarial architecture and training starts with the pretrained CNN, the classifier instance trained in the previous stage and the discriminator instantiated with random weights. Then, all trainable variables are updated during training, following the premises of the adversarial method. The objective function used for both the classifier and discriminator is the cross-entropy loss.
The optimizer used is the Stochastic Gradient Descent (SGD), with a momentum of 0.9 and a learning rate of 0.01. The learning rate is decayed at each iteration as a function of the iteration and the size of the source dataset. Despite the CNN being pretrained on the ImageNet using normalized images with mean values of [0.485, 0.456, 0.406] and standard deviations of [0.229, 0.224, 0.225], the images used in the scope of this work were normalized to mean values of [0.5, 0.5, 0.5] and standard deviations of [0.5, 0.5, 0.5], respectively, in the three RGB image channels. This choice is due to the distinctiveness of the dataset, which encompasses images of concrete surfaces only, less diverse in color than those of the Ima-geNet dataset.

Dataset
Two datasets of image patches from concrete surfaces are used, namely a source labeled dataset, which includes a large number of samples, and a target unlabeled dataset, with a small number of samples. The target dataset is not appropriate for training because the number of samples is too small and these are not labeled. On the contrary, the source dataset is labeled and includes a large number of samples that are fully available, but the model trained with it cannot perform the classification of images from the target dataset. Images from both datasets are of The source dataset is the Ozgenel (2018), entitled Concrete Cracking Images for Classification (CCiC). Patches were cropped from 458 high-resolution (3,024 × 4,032 px) images captured in buildings located in the Middle East Technical University (Ankara, Turkey), which was made available publicly in 2018. The patches were cropped with a size of 227 × 227 px following the guidelines described in Zhang et al. (2016), resulting in 40,000 samples. These were manually classified as positive (20,000 patches) or negative (20,000 patches) relative to the presence of cracks. The procedure classifies a patch with crack as positive if a crack is shown within a frame of at least 5 px from the edges of the patch.
The CCiC dataset includes images of concrete surfaces in gray and mainly with a smooth texture. A small number of samples show slightly different roughness and changes in color, mainly due to repair works. The images show constant brightness, sometimes exhibiting shadows and scattered paint drops. The patches with cracks are mainly close ups with very few cracks, some of which are quite thin. Cracks are centered in the patch and may have any direction.
The CCiC dataset was split in three subsets with 70%, 20%, and 10%, for training, validation, and testing, respectively. The training and validation subsets are used in the training process with a testing subset used to evaluate the final model in a similar but unknown set of image patches. Figure 3 shows some examples of patches that can be found in the CCiC dataset, with samples a, b and c showing positive patches for crack, while d, e and f represent negative patches.
The target dataset results from an acquisition campaign undertaken by the authors (Valença & Júlio, 2018) in the Itaipu Dam, Brazil, identified in this work as Itaipu dataset. Patches included in this dataset were cropped, respecting the same guidelines as in the CCiC dataset, from the three high-resolution (3,072 × 4,608 px) images in Figs. 4, 5 and 6. The images from the target dataset are of a dam in service, showing low brightness, shadowed regions, dark irregular concrete color, and very rough and irregular texture. Dark hollow cavities scattered all over the surface are visible such as runoffs due to poor finishing.
Due to the characteristics of the images in terms of spatial resolution, the cropping was performed in patches of size 50 × 50 px. This patch size was selected by user sensibility to approximately match the scale  from the CCiC. The same patches are all resized to 227 × 227 px to respect the conditions expected from the input. This includes 294 patches that were manually classified as positive and negative, also following the guidelines from Zhang et al. (2016). The dataset is balanced with the same number of positives against negatives; however, because the crack area is a lot smaller than that without crack, the number of negatives is reduced with random selection to the same number of positive patches. This target dataset is split in two subsets of 60% and 40%, resulting in 176 and 118 samples, for training and testing, respectively. Examples of patches from the target dataset are shown in Fig. 7, with samples a, b and c showing positive patches for crack, while d, e and f represent negative patches.

Evaluation
The evaluation of the DA-Crack method is performed using the classification accuracy. Reference accuracy baselines are established by training different models as follows: (i) train on the source dataset only, to evaluate how well this reference version performs on the test subsets from both source and target datasets. This also establishes a lower bound for the performance of the model on the target dataset (tested on the target test subset); (ii) train on the target dataset (labeled) only, to establish an upper bound, reflecting the limits of performance expected for the following domain adapted model; (iii) train on both source and target datasets, labeled and unlabeled, respectively, using the DA-Crack method. The accuracy of this model reflects the effective performance of the DA-Crack training method. The difference between the accuracy of the model trained on the source dataset and that achieved using the DA-Crack method when tested on the target test subset reflects the improvement achieved by the training method proposed in this work.
At the inference stage, once trained, the models are reshaped since the discriminator part is no longer needed. Figure 8 shows the architecture used for classifying individual patches.
The DA-Crack method is evaluated using the classification accuracy to measure the model performance in computing predictions for individual patches in each of the test subsets, both from source and target datasets.

Results
The DA-Crack trained method described in Sec. 3 was used to train a model adapted to the domain of the images from target dataset, namely from Itaipu Dam. In the scope of this work, two remaining models were trained for evaluation purposes, namely, a model trained only with the source dataset and another trained only with the target dataset. These reference models are used to establish performance upper and lower bounds. The source trained model provides a lower bound when performing on the testing subset of the target domain, while the target trained model provides an upper bound, expected if performing on the testing subset of the target domain. To ensure that the model trained on the source-only dataset was not overfitting, the training and validation accuracy curves were computed, revealing a similar behavior between both curves. Finally, the model trained using the DA-Crack method proposed in this research work is evaluated on the source and target testing datasets.
The scores from Table 1 were achieved using three models trained for each training dataset, namely using  only images from the source domain, only images from the target domain and finally using both datasets according to the training method assessed in this work, the DA-Crack method. Testing for a real case scenario is unavailable since the target dataset would be unlabeled, preventing the user to select the epoch at which the best accuracy is achieved. Thus, the training number of epochs was defined by the user, 100 epochs in this case, after which the model was stored for testing purposes. Because the number of epochs is a guess from the user, six runs were performed for each training configuration and the corresponding mean value was computed.
The model trained on the source-only dataset provides very good accuracy (over 99%) in the classification of images from the source testing subset, which is as expected. The same model shows a relevant decrease if classifying images from the target dataset, with a mean accuracy of 54%. This represents a lower bound, and sets the challenge addressed in this work on how to improve the accuracy of the model on the target dataset having access only to an unlabeled dataset from this domain.
If the model is trained on the target dataset (with full access to labeled images), the model performs well on images from the target domain as expected, with a mean accuracy of 96%. This model cannot maintain a similar accuracy when tested on images from the source domain, with a mean accuracy of 81% only. Theoretically, the accuracy of the model trained on the target dataset and tested on a subset from the same domain sets an upper bound, showing the limits expected for the accuracy of the DA-Crack method proposed herein.
The model trained using the DA-Crack method and taking advantage of both the source dataset and of an unlabeled target dataset shows a mean accuracy of 99% if tested on the source subset. The same model shows an accuracy of 84% performing classification of images  from the target testing subset. This result shows that the model keeps its performance on the source testing subset at the same time as it improves the mean accuracy in classifying images of the target testing subset from 54% to 84%. This corresponds to a relevant improvement of 54% in the model accuracy.
To evaluate the behavior of the models accuracy during training, Figs. 9 and 10 show the evolution of accuracy achieved for both the source and target testing subsets on each of the models trained, namely on source-only and target-only datasets, and using the DA-Crack training method. Figure 9, focused on testing the models on source images, shows that these are accurately classified by the source-only trained model, with accuracy values above 99% (in green). If the model used is trained using the proposed DA-Crack method, the testing accuracy remains above the 99% threshold. When the model is trained on the target-only images (in blue), the source images are poorly classified, which is as expected since these are completely unknown to the model, and this is in line with the problem assessed in this work.
The test performed on images from the target domain (Fig. 10) shows that the model trained on the sourceonly dataset (in green) shows poor results in terms of accuracy, as expected. This represents the problem of generalization to tackle herein. The model trained on the target-only dataset (in blue) presents high accuracy, which establishes an upper bound to the classification accuracy expected. The model trained using the DA-Crack method shows a relevant improvement in classification accuracy, converging to values between 73% and 90%, with a mean accuracy of 84%, as in Table 1.
The testing results accuracy show some volatility along training but with some tendency. These results present some convergence in terms of accuracy after around ten epochs, indicating that the 100 epochs used in this study may be excessively conservative. In the case of testing the target domain images, the dispersion is more pronounced; however, it presents a tendency of improvement, as shown by the filled regions in the figures. Figures 11 and 12 present examples of patches with the classification proposed using the DA-Crack trained model, from the CCiC and Itaipu datasets, respectively. In Fig. 11, the classification of patches is highly accurate, as expected since the testing patches are from a subset of the training dataset and are clean and very similar to those used for training. In Fig. 12, the testing patches are from the Itaipu dataset and are more difficult for classification by a model that was trained only on the source Fig. 9 Evolution of validation accuracy of source test on the six models trained on the source-only dataset (green), target-only dataset (blue) and using the DA-Crack method (red) dataset. However, when using the training DA-Crack method, the model becomes able to fairly classify these patches. Patches shown in the image are mostly classified correctly, except for the third patch in the upper line and the second in the lower line. These misclassified examples show blurred cracks, and that may be why these were considered clear concrete surfaces without a crack in it.
The distribution of encodings from image patches of both datasets were computed using t-SNE (Van Der Maaten and Hinton, 2008) projection with color-coded Distribution of the patches produced by the traditionally trained model, using the source-only dataset is shown in Fig. 13a. Figure 13b shows the patch representation produced by the model trained following the DA-Crack method. The violet and blue dots are representative of both no-crack and crack patches from the source domain, respectively. The green and yellow dots represent the no-crack and crack patches from the target domain, respectively.
Patch representation in feature space following the source-only training (Fig. 13a) shows a clear separation of source patches between classes (violet and blue dots), which is as expected since the model was trained specifically on this dataset. The same model shows its representation of the patches from the target dataset, with some degree of separation between classes (green and yellow dots); however, these do not match the separation learnt for those from the domain dataset, invalidating the model's ability to correctly classify these patches. A noticeable separation also exists between different datasets with green and yellow dots (target dataset) represented in a particular group separate from the violet and blue dots (source dataset).
Following the training using the DA-Crack method, Fig. 13b shows the distribution of patches in a spiral-like shape, after t-SNE to 3D feature space. Patches from the source dataset remain separated; however, with some overlap in this case (violet and blue dots), which suggests some reduction of classification accuracy, observed also in Table 1. The encodings of patches from the target dataset follow the same behavior, with some degree of separation between its classes (green and yellow dots). The main finding clearly noticed in this figure is that classes from both datasets regroup; namely, yellow dots grouping with blue dots (patches with crack) and green dots become closer to violet dots (patches without crack). This behavior is in line with the results reported in Table 1, with a clear increase in the classification accuracy of patches from the target dataset on the model following the DA-Crack method of training.
The main focus of developing classification models for concrete cracking is to detect cracks in large surfaces. In that sense, the sliding window technique was used to scan full-size images from the Itaipu Dam with classification of patches of similar size as those used for training, as shown in Fig. 15. To validate this detection method, prior to its application to the images from the target domain, it was applied to full-size images (in Fig. 14) randomly selected from those that originated the CCiC dataset (Ozgenel, 2018). Figures 14 and 15 show an overlap of detection performed by both models: the one trained traditionally using only the labeled source dataset and that trained on the DA-Crack method using a labeled source dataset and a small unlabeled target dataset, respectively. Patches positively classified as crack by the source-only trained model are represented in blue, while those from the model trained under the DA-Crack are in red. It is assumed that patches positively classified as crack by the DA-Crack model are also positives under the source-only trained model. The detection maps of Fig. 14 show that both models perform well on images from the source domain. The cracks are detected with most of the patches correctly classified and with the source-only trained (blue patches) model performing better than the model trained with DA-Crack method, as corroborated by the results in Table 1. Despite the slight under-performance of the model trained on the DA-Crack method, it still correctly classifies most of the patches, with the crack correctly identifiable (red patches). The patches identified in yellow are the patches classified as clean surface and are also important in the classification since both models correctly classify them.
In the case of images from the Itaipu Dam in Fig. 15, the performance of the model trained on the sourceonly dataset is not acceptable (patches in blue) since the crack cannot be identified. The model correctly classifies patches with crack, but the number of false positives is high, preventing the model from being used for classification of cracks from the Itaipu Dam (target dataset). This result establishes the problem to be tackled with the method proposed in this work. The model trained with the DA-Crack method classified patches in the same images with better results (patches in red). The false positives are completely removed. Despite the model not being able to effectively classify all the patches positive for crack, the shape of the crack in the full image is  The green patches are those classified as negative for crack by both source-only and DA-Crack trained models; the red patches are those classified as positive for crack by the DA-Crack trained model; the blue patches are those classified as positive for crack by the source-only trained models. Red patches overlap blue patches that were previously painted perfectly drawn. This is corroborated by the classification accuracy presented in Table 1 and Fig. 13, showing an improvement of 54% between the model trained with the DA-Crack method relative to that trained on the sourceonly dataset.
This shows the practical improvement introduced in this work, demonstrating that even without a labeled dataset from some structure (Itaipu Dam in this case) a neural model can be domain adapted using a labeled dataset from another structure through the training DA-Crack method.

Conclusions
The DA-Crack training method is presented in this work to overcome the problem of generalizing a model to perform on datasets of different surfaces of concrete. This is achieved by implementing an adversarial architecture to train the model. The model was trained in both a source public dataset (Ozgenel, 2018) and a target dataset from the Itaipu Dam in Brazil. The performance of the model was evaluated on images previously reserved from both datasets. The DA-Crack training enabled the model to improve its accuracy in classifying cracks in the target images, at the patch level. The accuracy improved from 54% to 84% between a model trained only on the source dataset and a model trained following the DA-Crack method, using both the labeled source and unlabeled target datasets. This constitutes an increase of accuracy of 30 percentage points, showing an improvement of 54%. Moreover, apart from the improvement in classification accuracy enforced by the domain adversarial training, the ability of the model to classify images from the source domain remained over the 99% threshold. The achievement in the classification of the target domain images was obtained from training the model in the source dataset, which is widely available in public databases, in combination with a more specific and tiny unlabeled target dataset.
Further studies are necessary, namely testing the model on other datasets acquired under different conditions, including variability in brightness, scale and roughness of the surface.