1 Introduction

In the automotive industry, many processes are conducted automatically. This is also the case for quality assurance done by non-destructive testing (NDT) methods. For plate-like structures, air-coupled ultrasonic (ACU) testing has advantages in comparison to coupled ultrasonic testing, which allows a higher degree of automation [1]. In ACU testing the ultrasonic waves propagate through the couplant air. Due to the high impedance contrast between air and the specimen e.g., for carbon fiber-reinforced polymers (CFRP), most of the energy is reflected at the surfaces of the specimen. For that reason, high excitation voltages of up to 800 V are necessary to generate signals with a high amplitude and in comparison to coupled ultrasonic testing lower frequencies (30-700 kHz) are applied [2]. Smaller frequencies lead, according to the \(\frac{{\uplambda }}{\text{2 }}\) criterion, to a lower resolution. Inhomogeneities in the material introduce additional boundaries at which the ultrasound signal is reflected. This results in a decrease of the signal amplitude.

Commonly, an evenly-spaced measuring grid is used during the ACU testing procedure. Therefore, the dataset consists of A-scans, which are distributed spatially. The A-scans are normally spatially assembled and converted to an image called a C-scan, which is displayed in Fig. 1. For this, the time series is truncated to the part where the ultrasound pulse is visible in the signal. Afterward, the signal is transformed with a fast Fourier-transformation, and the highest amplitude within the frequency spectrum of the transducer is picked. Sometimes the A-scans are filtered with a bandpass filter to ensure that only the transducer-frequencies are present in the signal. Furthermore, the final C-scan is commonly converted into the dB scale to emphasize small changes. Since this workflow represents a compression of the data, some of the information (e.g., the travel time) is lost. In Fig. 1, a C-scan and the corresponding A-scans are visualized.

Fig. 1
figure 1

C-scan with corresponding A-scans. The color of the C-scan encodes the maximal amplitude in the time series

The goal of this study is to automate ACU testing and detect defects more precisely in comparison to manual testing. For this, three different approaches for classifying or segmenting defects with ACU testing data by deep learning are investigated. In the first approach, the single A-scans are classified by a time series model. Secondly, C-scans are segmented with U-net similar architectures. The third approach is a hybrid between the time series models and image segmentation. The output of the time series models is reconstructed into an image and this image is segmented with a U-net similar model. The results of the models on the test dataset are compared with several metrics. Further, the performance of the models is benchmarked against the conventional evaluation, which is thresholding of the C-scans. To the best of the author’s knowledge, this is the first study comparing these three approaches with ACU testing data. Additionally, the specimens made out of two different materials, acrylic and CFRP are investigated. As a testing setup three different arrangements of transducer and receiver, two using lamb waves and one through-transmission setup, using bulk waves, are used. While lamb waves enable single-sided ACU tests, they come with the drawback of reduced accuracy due to defect elongation caused by the longer wave propagation within the plate. The primary research inquiry addressed in this study pertains to whether the diminished accuracy observed in lamb wave testing can be compensated by the application of deep learning models.

2 Related Work

The objective of this study is to make use of ACU testing data via deep learning or artificial neural networks to automate and improve ACU testing. Machine Learning (ML) and especially deep learning can be found in many applications in recent years. An example is consumer products such as smartphones. Many breakthroughs have been accomplished with deep learning in fields such as image, audio and video processing [3]. In non-destructive testing (NDT) artificial neural networks have also been applied successfully in many cases [4]. Typically, NDT measurement data is classified manually after some feature extraction. This only works if the selected features are well-defined and, therefore, this approach may fail if the changes through the inhomogeneities are subtle. On the one hand, with deep learning the feature extraction step is not necessary since it is done by the neural network. On the other hand, the evaluation is done automatically. This enables new ML-based NDT systems that match the demands of Industry 4.0, such as mitigating the influence of human factors and reducing the time needed for evaluation of the data [5]. In the future, the NDT inspector could be supported or fully replaced by these kinds of systems.

For ultrasound testing, the studies focus either on time series data (A-scans [6,7,8]), or on image datasets (B-scans [9] or C-scans [7]). Furthermore, according to [5] there are different tasks in ultrasound testing that are solved with deep learning. These are for example denoising, defect detection, data reduction, improving the resolution, defect characterization, and material property determination.

The following two studies, which make use of ultrasound data measured in a grid with deep learning, are introduced. In [6] the measurement was conducted using immersion testing in a water-filled tank and on a grid. The objective was defect detection in a specimen made out of a nickel-based alloy with some segregation in it. It was not possible to detect this defect type with conventional evaluation of the ultrasound data. However, with deep learning, the defect was clearly identified. On the individual A-scans a time series model: a recurrent convectional neural network with attention and spectral representations (RCAS), was trained [6]. For ACU testing [7] implemented a denoising procedure to improve the SNR for single-sided ACU scans. The dataset consists of C-scans from CFRP impact specimens. The label of the data was generated by manual annotation. The temporal and spatial domain is used. In the first step, classification was performed on the time series data (A-scans) with an LSTM in a concatenation with three 1D convolution layers. The result of this classification was thresholded and reconstructed into an image. This image was segmented with a separate convolutional neural network tailored for images, which uses patches of sizes from 5 \( \times \) 5 pixels to 65 \( \times \) 65 pixels as input, and therefore, can identify the shape of the defect from high to low resolutions. The accuracy of the time series model were lower than the accuracy of the model for image segmentation. Both studies presented also suffer from the problem of an imbalanced dataset, which is also the case in this work. An imbalanced dataset has an unequal distribution of classes within a dataset. Both presented studies use ultrasound data measured on a grid, which is also the raw data type of this study. In [6] only the temporal and not the spatial information is used. Furthermore, in comparison to [7], in this work, transmission measurements and different specimen types are investigated and an extended evaluation of the model performance is conducted.

To extract significant features from the ACU dataset power spectral density can be used. Mwema et al. [10] illustrated this approach in atomic force measurements, where they utilized power spectral density as a preprocessing technique for images.

3 Material and Methods

In this section, it is discussed how to arrange the transducer and receiver in ACU testing. In this study, transmission and lamb wave arrangements are considered. The ACU testing setup and the investigated specimens are described in Sects. 3.2 and 3.3. The procedure for generating the ground truth for the dataset is visualized in Sect. 3.4. Finally, the neural network architectures, including the investigated hyperparameters and evaluation metrics, are explained in Sects. 3.5 and 3.6.

3.1 Transducer Arrangements for Air-Coupled Ultrasonic Testing

Normally, ACU testing is done in the through-transmission arrangement (Fig. 2a), where bulk waves are excited on one side of the specimen and measured on the opposite side [11].

Fig. 2
figure 2

Transducer and receiver arrangements in ACU testing, with an optical microphone as receiver. In a the through-transmission method, in b the single-sided re-emission method and in c the re-emission method with transducer and receiver on opposite sides are displayed

In this setup, the transducer and receiver are aligned perpendicularly to the specimen and on one axis. With 400 kHz transducers and a sufficiently small scanning grid, a defect of approx. 1 mm can be detected in this arrangement [2]. For simplicity, the through-transmission arrangement will be called transmission arrangement in this work. Another measuring principle of ACU testing uses guided or lamb waves.

Lamb waves are ultrasonic waves, which propagate between two parallel surfaces [12]. They have a wavelength of the same size or larger than the plate’s thickness. Lamb waves have several symmetrical S or antisymmetric A modes, which result from geometrical constraints of the plate [13]. In order to excite lamb waves with an air-coupled ultrasound transducer, the incident ultrasound wave has to reach the surface of the plate at the angle, \(\Theta \). The incident wave is then converted to the bulk wave modes: a longitudinal or p-wave and a transversal or s-wave. This is shown in Fig. 3.

Fig. 3
figure 3

Lamb wave formation with an incident wave propagating through air, adapted from [14]

Through multiple reflections of these wave modes and further mode conversion, lamb waves emerge [14]. The angle of the incident determines which mode is excited and the SNR of the respective mode. The maximal SNR is achieved when the following equation applies

$$\begin{aligned} \textrm{sin}(\Theta _{\mathrm {A_0}}) =\frac{c_{\textrm{air}}}{c_{\mathrm {A_0}}}, \end{aligned}$$
(1)

where, \(c_{\textrm{air}}\) is the sound velocity in air and \(c_{\mathrm {A_0}}\) is the sound velocity of the \(\mathrm {A_0}\) mode, which is commonly investigated with ACU testing [1].

ACU testing with lamb waves is done using the so-called re-emission method. With this arrangement, single-sided testing is possible, as can be seen in Fig. 2b. For a single-sided arrangement, a beam shield is necessary. The beam shield prevents the receiver from capturing reflections of the ultrasound wave coming from the surface of the specimens. Examples of this setup are given in [1, 15,16,17]. The disadvantage of the re-emission arrangement in comparison to the transmission arrangement is a longer travel path of the ultrasound signal within the plate leading to a lower SNR and resolution. Furthermore, defects are stretched in one direction since the whole travel path of the ultrasound between the transducer and receiver influences the signal [15]. Another arrangement for measuring lamb waves with ACU testing is by using the same transducer position as in a single-sided arrangement, however, the receiver is placed on the opposite side, see Fig. 2c. Since no direct reflection from the surface can reach the receiver in this way, no beam shield is necessary. The arrangements in Fig. 2a and c are investigated in this study. Arrangement c) was used since it is easier to realize than arrangement b), but it leads to similar results. Another ACU arrangement would pulse-echo, However, this is very uncommon because of the high energy losses at the air-plate boundaries.

3.2 Experimental Setup for Air-Coupled Ultrasonic Testing

The transducer arrangements described in Sect. 3.1 and two types of materials (acrylic and CFRP) are investigated with ACU testing. All experiments were conducted on a measurement grid with a distance between each A-scan of 0.5 mm in x- and y-direction. For the transmission measurement, the ultrasound wave was generated with the piezoelectric transducer Sonoair CF 400 with a resonance frequency of 400 kHz.

For the transducer arrangement using lamb waves shown in Fig. 2c, two configurations were investigated: One with transducer and receiver aligned in parallel and one aligned perpendicular to the scanning direction. These arrangements will be called horizontal lamb wave and vertical lamb wave measurement, respectively, in this work. The incident angle of the ultrasound wave was adjusted for each tested material in order to reach the maximal signal amplitude of the \(\mathrm {A_0}\) mode. The lamb wave measurement on acrylic specimens were conducted with the 230 kHz transducer Sonoair CFC 230-D-D25-T and for the CFRP specimens, the 75 kHz transducer Sonoair CFC 075 was used in order to reach a signal with a higher amplitude.

The receiver is the optical microphone Xarion Eta 450, which measures the sound pressure by means of interferometry. Since no moving components are present in this microphone, the device has a broadband sensitivity from 100 Hz to 1 MHz. Furthermore, it has a small aperture of approx. 2 mm and, therefore, an improved spatial resolution [18].

3.3 Specimens and Measurement Data

Two types of plate-shaped specimens are investigated in this study: specimens made from acrylic and CFRP material. The measurements for the two investigated specimen types were split into training, validation, and testing datasets in order to train the neural networks and evaluate their performance.

Fig. 4
figure 4

C-scans of the test and validation dataset of the acrylic specimens for the three arrangements. The rectangular defect in the validation specimens has an edge length from 10 to 30 mm and the circular-shaped defects in the test specimen have diameters from 3 to 20 mm. Each C-scan has been scaled between 0 and 1

Fig. 5
figure 5

C-scans of the test and validation dataset of the CFRP specimens for the three arrangements. In gray is the validation dataset marked with a gray rectangle. Each C-scan has been scaled between 0 and 1

For the acrylic specimens, eight plates were manufactured. The plates were of the size 250 \( \times \) 300 \( \times \) 3 mm and had flat bottom holes of different shapes (e.g., circles, squares and rectangles with different rotations) milled into them. A mixture of paraffin and fine-grained sand was filled into the holes in order to attenuate the ultrasound wave. Of the eight specimens, six were used for training, one for validation, and one as a testing dataset. All specimens were measured with the transmission, the horizontal lamb wave and the vertical lamb wave arrangements. The resulting C-scans of the validation and test specimens are displayed in Fig. 4. The amplitude values of each specimen have been normalized. In the C-scans conducted with lamb waves, it can be seen that the defects are stretched in one direction in comparison to the transmission C-scans. Furthermore, the background of the defect contains a pattern, which is a result of the not-plane surface through the bending of the plate.

The CFRP specimens are of the size 550 x 350 x 3 mm. The material was composed carbon fibers rowing fabric sheets and an epoxy resin matrix. The orientation of the fiber sheets was \(45^{\circ }\)/\(45^{\circ }\)/\(0^{\circ }\)/\(45^{\circ }\)/\(45^{\circ }\), which represents a transversal- or quasi-isotropic material with isotropy in-plane directions. In three of these plates, impacts were generated with energies between 7 and 12 J by a drop-weight impact test. An image of the damage due to a 10 J impact can be seen in Fig. 6.

The resulting C-scans of the validation and test specimens are displayed in Fig. 5. One of the three specimens was used for testing and 20 % of a second specimen was used for validation. The right side of the validation specimen, which is marked in Fig. 5, represents 20 % of its data and is the validation dataset. The rest of the measurement data was used for training.

3.4 Generating the Ground Truth of the Measurements

Since the C-scans with the 400 kHz transmission measurements of the CFRP and acrylic specimens show the boundaries of the defect the clearest, the labels were generated based on these C-Scans. The labels were then also used for the lamb wave datasets.

The transmission acrylic specimens show a more heterogeneous pattern within the defects in comparison to the CFRP impact specimen and the boundary for smaller defects is not clearly visible (see Fig. 4). For this reason, two different strategies for generating the ground truth were chosen. For the acrylic specimens, a camera photo was taken in which the defects were manually annotated. In the corresponding C-scans, the visible defects were also annotated. Afterward, the centroids of each defect were registered on one another, whereby the C-scan centroids represent the fixed coordinate system. The determined transformation matrix was then applied to the annotation of the camera photo. For the CFRP specimens, the images were thresholded based on the 6 dB criterion. The 6 dB represents the drop of the signal amplitude from the region with no defect to the defect region [19]. The thresholded images represent the label. For this the measurement in the through-transmission arrangement was used. Further, the digital image processing operations opening and closing were applied in order to fill all holes within the defects, remove false positives outside the defect, and smooth the boundaries of the defects.

3.5 Neural Network Architectures and Experiments

The deep learning procedure presented in this work can be used to denoise C-scans and for defect characterization. Hence reproducing the shape of the defect more precisely than by manual evaluation.

In this study, three types of deep learning approaches based on the available data modalities are considered. These models have been trained on the datasets of the different arrangements and specimen types. In total 24 model or dataset configurations were investigated. The experiments are the following:

  • specimen type: acrylic or CFRP

  • data type or deep learning approach: These are on the one hand time series models, which are trained on A-scans and on the other hand image models, trained on C-scans or trained on the reconstructed output of the time series models.

  • arrangement of transducer and receiver: the ACU measurements were conducted in three different arrangements: 400 kHz transmission, horizontal lamb wave, and vertical lamb wave. The datasets of the horizontal and vertical lamb waves are also used in combination for training models.

All of the models were implemented in fastai, which is a well-established high-level library implemented on PyTorch [20]. In order to choose a suitable learning rate for the models, the fastai learning rate finder was used for each trained model. The models were trained according to the fastai fit-one-cycle policy, which uses a learning rate scheduler [20]. For the stochastic gradient-based optimization the Adam optimizer is utilized in all experiments.

For all the experiments presented in this work the hardware compromised a workstation with a NVIDIA GeForce RTX 3090 Ti GPU with a GPU-RAM of 24 GB, an AMD Ryzen Threadripper PRO 3955WX with 16-Cores and with 264 GB of RAM.

Fig. 6
figure 6

Damage resulting from a 10 J impact on the opposite side

The image models were pre-trained on the ImageNet [21] dataset and a three channel input was necessary since the ImageNet dataset contains three channel RGB images. In each of the three input channels, the same respective C-scan was copied. Furthermore, an optimization procedure was conducted with the two data types (time series and images) in order to find the best performing model and pre-processing configuration (see Sect. 4.1, Fig. 10). Afterward, a leave one specimen out cross-validation scheme was performed with the time series data of all specimens. This means that all the data except one specimen was used for training and the left out specimens was used for interference. The model and pre-processing configuration used for this was determined with the optimization scheme. The prediction results are reconstructed into an image and represent a second image data type. The reconstructed image of the test sample for acrylic and CFRP are shown in Figs. 8 and 9. It can be seen that with the transmission measurement data on the top left in Figs. 8 and 9, respectively, the defects have clear boundaries and a high contrast. For the lamb wave configuration, the contrast is lower and the defects are more blurry.

For the lamb waves, it was further investigated whether using the horizontal and vertical lamb wave measurement data within one neural network improves the result. For the time series, the architecture was adapted to be suited for multivariant time series classification with two-time series input channels. For the image models, a three-channel image input was necessary, since the models were pre-trained on the ImagNet [21] dataset. For this reason, the first channel was the horizontal lamb wave C-scan the second channel the vertical lamb wave C-scan and the third channel was the sum of the two images. All the experiments and their classification and segmentation results, which are distinguished by, specimen type (acrylic or CFRP), data type (time series, images/ C-scan, images/prediction time series) and arrangement of transducer and receiver (transmission, horizontal lamb wave, vertical lamb wave, horizontal and vertical lamb wave) are summarized in Sect. 4.2 in Table 1.

3.6 Evaluation Metrics

In order to compare the output of the neural networks in a quantitative way, meaningful metrics need to be used. The softmax output of neural networks, which is commonly used for classification problems, is scaled between 0 and 1. In order to binarize the data a threshold of 0.5 is applied in this work. When the thresholded data from the model is compared to its labels, four kinds of results occur. This is visualized in a confusion matrix in Fig. 7.

Fig. 7
figure 7

Confusion matrix for binary classification

Based on the confusion matrix several metrics can be determined. Commonly used is accuracy, which gives the ratio of correctly classified samples to the total number of samples

$$\begin{aligned} \text {accuracy}=\frac{\text {TP}+\text {TN}}{\text {TP} +\text {FP}+\text {TN}+\text {FN}}, \end{aligned}$$
(2)

but this metric does not work well for imbalanced datasets. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations

$$\begin{aligned} \text {precision}=\frac{\text {TP}}{\text {TP}+\text {FP}}. \end{aligned}$$
(3)

Recall takes the TP and FN into account

$$\begin{aligned} \text {recall}=\frac{\text {TP}}{\text {TP}+\text {FN}}. \end{aligned}$$
(4)

Precision measures the extent of error caused by FP whereas recall measures the extent of error caused by FN. In order to take the error of FP and FN into account the F1-score is used. The F1-score is the harmonic mean between precision and recall

$$\begin{aligned} \text {F1-score}=2 \cdot \frac{\text {precison} \cdot \text {recall}}{\text {precison}+\text {recall}}. \end{aligned}$$
(5)

The F1-score is still a meaningful metric for an imbalanced dataset. For that reason, it is applied in this work.

4 Results and Discussion

In this section, the results of the model optimization are discussed. Afterward, the results of the model applied to the different data types and arrangements are presented and compared to the results of thresholding the C-scans. Finally, the metrics are compared and the segmented images are displayed.

4.1 Results for Optimizing the Models

In order to find a configuration for the models which perform well on the given dataset, a stepwise optimization scheme was performed for the time series data (A-scans) and the image data (C-scans). The dataset used for this was the 400 kHz transmission data. In Fig. 10, the stepwise optimization scheme is displayed.

Fig. 8
figure 8

Prediction of the time series model for the acrylic test specimen

Fig. 9
figure 9

Prediction of the time series model for the CFRP test specimen

Table 1 Results of the models and thresholded C-scans on the test datasets

The determined final configurations were then also used for all further time series and image datasets.

For the time series, the training was performed with 10 % of the training data in order to save computation time. Since the C-scans image dataset contains only six training images, the whole training dataset was used. The performance of the models was evaluated on the validation specimens. In the stepwise optimization scheme, different pre-preprocessing configurations and model hyperparameters were investigated. After each step, the configuration with the highest F1-score on the validation dataset was chosen for all further steps.

For the time series data, at first, different models are compared using the cross-entropy (CE) loss. After the model comparison, for which the FCN was the best performing model, it was tested if oversampling (OS) and different loss functions, CE and focal loss, improve the performance. Focal loss and oversampling are strategies to compensate for issues with an imbalanced dataset. CE loss in combination without using oversampling was the best-performing model. Afterward, a similar pre-processing procedure as for creating the C-scans was performed by bandpass filtering (BP) the data within the frequency range of the data and truncating (TC) the signal around the ultrasound pulse. It was found that bandpass filtering the data improves the performance. Finally, rescaling the data into the decibel scale resulted in a higher F1-score.

The C-scan dataset was scaled in the dB scale and then normalized. The starting configuration was a pre-trained model and random cropping with a patch size of 96 \( \times \) 96 pixels. In the model comparison step, ResNet50 had the best performance, and using a pre-trained model did also lead to a higher F1-score. For the C-scans also CE loss had a higher F1-score. In the last step, it was tested if further augmentation, besides random cropping, leads to a higher F1-score. The used augmentation methods were rotation, flipping, and rescaling and, therefore, no augmentation methods, which change the amplitude of the data. Augmentation improved the performance and the final F1-score was for the C-scans with 0.7813, higher than 0.2011 for the time series data. It should be mentioned that only a fraction of the dataset was used for training for the time series.

In general, for time series similar techniques for pre-processing the data as for C-scans helped to improve the performance. In particular, bandpass filtering the data in the frequency range of the transducer leads to a significantly higher F1-score.

4.2 Comparison of the Metrics and the Segmentation Results

After determining configurations with the highest F1-score (see Sect. 4.1) for the time series and image data, these configurations were trained on the entire training datasets with all transducer and receiver arrangements and tested on the test datasets. Additionally, the output of the time series model was generated with a leave-one-specimen-out cross-validation scheme and was used as input for the image model. For this also the configuration of the best performing image model from Sect. 4.1 was used. The metrics on the test dataset were calculated for all trained models and are given in Table 1. Furthermore, the metrics were computed based on the thresholded C-scans. For the measurements involving the through-transmission setup, the 6 dB criterion was employed. However, as this criterion is not suiting for lamb wave measurements, the threshold selection for lamb wave data was determined differently. Specifically, the threshold yielding the highest F1-score was identified. This threshold was determined by evaluating the F1-scores for a set of 1000 different thresholds, evenly distributed within the range of values present in the image data. The threshold that resulted in the highest F1-score was selected.

Fig. 10
figure 10

Stepwise optimization scheme for the time series data (A-scans) in a and image data (C-scan) in b. For the optimization scheme the 400 kHz transmission dataset was used. The decision after each step was made based on the highest F1-score on the validation dataset, which is highlighted in red. The abbreviation of the respective steps is the following: oversampling (OS), cross-entropy (CE), and bandpass filtering (BP). The compared model are described in [22] for the time series models, in [20] for the image models, and in the documentation of fastai. The default configuration of the models was used

Most of the models show a high accuracy of >0.95. However, this is not really meaningful, because of the high data imbalance. For the acrylic test specimen classifying all pixels as having no defect would lead to an accuracy of 0.9809 and for the CFRP test specimen to an accuracy of 0.9691. In general, the 400 kHz through-transmission arrangements show the best performance based on all metrics. The reason for this may be that the labels were generated for this data based on a camera photo. Furthermore, the distortion or stretching, which is visible in the lamb wave C-scans, is not present in the transmission C-scans and the contrast of defect and no defect regions is also higher for the transmission C-scans than for the lamb wave.

For the acrylic specimens, the thresholded has the highest F1-score for the transmission data. The deep learning approach with the best performance is here the time series model. Especially, for smaller defects it is able to capture the size of the defect more precisely. This can be seen in Figs. 11a and 12.

Fig. 11
figure 11

Binarized prediction of the deep learning models on the dataset of a the acrylic and b the CFRP specimens for the different arrangements and models. In order to visualize the defect with a higher resolution a part of the specimens was cropped. The boundary of the ground truth is marked in red

Generally, the image-based models reach a higher F1-score for the lamb wave prediction than the time series models do. This is due to the lower precision, resulting from false positives, which can also be seen in Fig. 11a. The recall, however, is higher for the time series models than for the image models, which is due to fewer false negatives. The time series and image models which combine the horizontal and vertical lamb wave perform overall better than the time series models using just the horizontal or vertical lamb wave. The horizontal or vertical lamb wave prediction of the time series and image model are stretched in the respective direction, see Fig. 11a. This is compensated by using both the horizontal and vertical lamb wave data. For the acrylic specimens, the models trained on the time series prediction have only a better performance for the transmission data than the image models.

For the CFRP specimens and the 400 kHz transmission dataset, the image models show the overall best performance based on the F1-score. In contrast to the acrylic specimen, a lot of false positives are present in the prediction by the time series model, which can be seen in Fig. 11b. However, recall is higher for the transmission times series model than for the image models. This means that there are fewer FN and the shape of the defect was predicted more accurately, see Fig. 11b. The F1-score of the thresholded C-scan is observed to be lower compared to that of the image model. One possible explanation for this discrepancy is the pre-processing adjustment applied to the CFRP specimen labels. Although these labels were initially generated from the C-scan data, subsequent digital image processing was performed to eliminate erroneous positive identifications beyond the defects’ actual boundaries and to rectify holes within the defect regions.

The F1-score for the time series lamb wave models is lower for CFRP than for the acrylic specimens. A reason for this may be those amplitudes have relatively high and low values within the defects (see Fig. 5). This is due to resonance effects and was also observed by [23] when using lower frequencies and a broadband receiver. For the vertical lamb wave, the image model trained on the prediction from the time series model showed a better result than the segmentation with the C-scans.

In Fig. 12, the segmentations produced by the best-performing deep learning models based on F1-scores for each dataset (as detailed in Table 1) and thresholded C-scans are shown. Notably, the thresholded C-scans for lamb wave images tend to misplace defects primarily beyond the true boundary, as opposed to the ground truth. The deep learning models exhibit notably better performance for this datasets, as can be seen by the metrics outlined in Table 1.

When considering through-transmission measurements, both the deep learning approaches and the application of thresholding on the C-scans yield comparable results. In the context of acrylic specimens, the employment of thresholded C-scans yields a higher F1-score compared to the segmentation carried out by a deep learning model. However, this method falls short in detecting smaller defects, which the deep learning model effectively identifies.

The results for CFRP specimen measurements in a through-transmission arrangement exhibit similarity between the deep learning approach and the thresholding of C-scans. Further, the thresholded C-scans reveal instances of false negatives or holes within the defects, which are less pronounced with the deep learning model.

Fig. 12
figure 12

Binarized predictions from the best-performing deep learning model are depicted in white, while the thresholded C-scan is shown in green. In order to visualize the defect with a higher resolution a part of the specimens was cropped. The boundary of the ground truth is marked in red

5 Conclusions and Outlook

In this work, three deep learning approaches for the classification and segmentation of ACU testing data spatially and temporally were presented. The specimens were made out of acrylic and CFRP. At first, an optimization scheme was conducted in order to find configurations with a good performance for time series and image data. This optimization scheme was performed with the through-transmission measurements and the final configurations were applied to all further time series or image datasets. The results of the deep learning approaches were compared to the thresholded C-scans.

It was shown that for through-transmission measurements, time series models, which were trained on the A-scans, perform well. In some cases even better than segmentation models trained on C-scans. In these cases also the thresholded C-scans show a good performance. In this particular transducer-receiver configuration, it becomes feasible to perform an in-situ evaluation of the A-scan. This capability not only enhances automation but also allows for the scan to be halted promptly upon defect detection. In general, the predictions of the deep learning models exhibit a higher contrast between the defect and non-defect regions, and they also perform better at detecting smaller defects. This attribute holds particular significance for in-situ evaluations rather than precise defect shape estimations.

For measurements conducted with lamb waves, which lead to a stretching of the defects, the time series model shows poor performance in comparison to image models. On one hand, this can be due to the optimizing the model configuration on the through-transmission dataset. On the other hand, a reason for this can be the labeling, which is inaccurate, because of the stretched defect through the lamb wave measurement. The data was labeled based on the transmission measurements, which gives a more precise estimate of the defect size and shape. This was done because the actual physical defect shape should be reproduced. In contrast to this, [7] labeled the lamb wave measurements based on the C-scans of the lamb wave measurements. In [7], higher accuracy was reached for lamb wave measurement data classified based on single A-scans in comparison to this work. However, in [7], only the shape of the labeled defect could be reproduced, which does not represent the actual shape of the defect due to distortion.

In general, in cases where thresholding of the amplitude images leads to ambiguous results due to e.g., plate resonances (see [23]) deep learning model could compensate for this. This often occurs for CFRP specimens and was also present for the lamb wave measurements in this study. Further, in future studies, we want to label the ACU scans with more high-resolution NDT methods, e.g. X-ray computed tomography, in order to improve the resolution of the predictions.

Additionally, the lamb wave measurements were conducted in two directions perpendicular to each other. Combining both measurements in a time series or image model leads to better performance in most cases. Furthermore, it was tried whether the segmentation results are improved when using the outcome of the time series model as an input image for an image model. For this, the output of the time series model was reconstructed into an image. The segmentation worked well with this approach when the time series models also had a good performance. In order to improve the usage of information from the spatial and temporal domain, one model which makes use of both types of information should be trained. A possible model for this is a ConvLSTM. Our future plans involve exploring a more straightforward method for combining the two lamb waves that were measured orthogonally to each other. One possible approach is to fuse the resulting images, such as by multiplying them together. Further, we want to test our approaches on production parts with e.g., CFRP impact damages. Finally, the uncertainty of the prediction can be explored. For estimating the uncertainty with a sigmoid, a calibration step is usually required since the model could be in a local minimum. In [24], this is accomplished by using an ensemble of models. Other possible methods would involve using Monte Carlo dropout.