1 Introduction

Digitalization is transforming industrial radiography. Computed radiography and digital radiography provide opportunities to make inspections more efficient and reliable. One of the advantages of these technologies is the possibility to use automatic data analysis tools. While these tools were previously limited to rudimentary or highly repetitive tasks, developments in artificial intelligence (AI), specifically solutions utilizing machine learning, have considerably enhanced their applicability. Machine learning applied in non-destructive evaluation (NDE) has so far focused mainly on making improvements to algorithms and proving their effectiveness quantitatively. Machine learning approaches and an increased use of automated tools, however, impose a number of practical considerations to become viable, like the choice of hardware and the design of a human–machine interface (HMI).

[1] provide an introduction to the concept of NDE 4.0, which they define as “Non-destructive evaluation through confluence of digital technologies and physical inspection methods for safety and economic value”. AI is a key component of NDE 4.0, as smarter and more efficient digital capabilities are required to achieve the cyber-physical NDE systems envisioned. Potential applications of AI for the NDE industry include fully automatic inspection systems, assisting tools for inspectors, and data visualisation and statistical analysis. [2] describes NDE development towards prognostics i.e., from fault surveillance to predicting service life and setting service frequencies. This involves combining select NDE data with other engineering disciplines. Similarly in manufacturing environments, it is desirable to utilize NDE more effectively to predict and mitigate production issues.

The use of AI as a part of NDE imposes new requirements on the personnel involved. [3] propose a shift in the infrastructure and roles of people performing inspections due to changes induced by NDE 4.0. They argue that the focus of personnel performing manual data acquisition and analysis will change due to partially or fully automatic tools. [4] provides similar proposals of roles for inspectors faced with new HMIs.

As automatic methods become more effective, it is increasingly important to study the deployment of such methods in real scenarios. It is of interest to map out the requirements for hardware, installation, integration to workflows and the need to change the roles of workers, and considerations in human-centered design, user experience and explainable AI. First steps also need to be taken in achieving better connectivity between the data produced by NDE and the rest of the manufacturing process – a key promise of NDE 4.0.

1.1 Weld Porosity

Weld defects, described and classified in e.g., ISO 6520–1:2007 [5], decrease the structural properties of welded components. Porosity is especially common and its severity is related to pore size and frequency. This is reflected in standardised acceptance levels like ISO 5817:2014 [6]. Thus, there is a wide intrerest to understand the formation of porosity and reduce it. [7] studied porosity distribution and its effects on tensile strength in laser hybrid welding. They found welding speed and heat source distance to impact porosity significantly, and reported a lower tensile and fatigue strength on welds with more porosity.

[8] studied the effects of shielding gas on porosity in laser welds. To measure porosity, they combined qualitative assessment by radiography and a quantitative analysis by digital segmentation of porosity in weld samples by thresholding computed tomography data. Studying porosity in production welds is usually limited to qualitative analysis, because a threshold cannot be used to segment a highly variable dataset, and counting, measuring and reporting the locations of all pores manually from radiographs is not economically feasible outside of small-scale experiments.

1.2 Machine Learning Methods for Radiography

The application of automated analysis tools for industrial radiography has been studied for a wide variety of applications. These include image enhancement [9], defect recognition [10,11,12], classification [13], the application of acceptance criteria [14] and image quality assessment [15]. Computer vision methods have been successfully utilized in cases that involve controlled environments and repetitive tasks [9]. Machine learning has broadened the range of applications available for automatic analysis. The most notable application areas for machine learning in radiography are castings [10, 11, 16] and weld inspections [12, 17,18,19,20,21]. Other areas include solid propellant [22], airport baggage [23], and food products [24].

Machine learning methods encompass a variety of algorithms that use a dataset to optimize to a desired behaviour. Deep learning, a subset of machine learning, uses computations with several processing steps and integrates feature extraction in the optimization process, i.e., the features used in the decisions are not manually programmed beforehand [25]. This strategy has generalized well across a multitude of image recognition tasks, especially after the first successful implementations of convolutional neural networks (CNNs) [26], which have proved to be effective at extracting relevant image features in various domains, as shown, e.g., by [27].

Machine learning techniques for regular image recognition are largely compatible with radiographs. Examples of architectures for classification are variations of VGG [28], used by [10] to detect defective areas in castings, and EfficientNet [29] used by [18] to classify weld defects into categories. For object detection, [21] modified a Faster-RCNN [30] to localize weld defects, [11] found YOLO-networks [31] to be best at localizing porosity in castings, while [12] used a feature pyramid network [32] for the same application. To perform semantic segmentation, U-net [33] was used by [14] for weld flaws and Mask-RCNN [34] was used by [22] for solid propellant defects. Because of the similarities to images, transfer learning [35] has been proposed to leverage the large-scale datasets and compute from more general image recognition tasks. [10] reached a modest improvement by pre-training on Imagenet for a small CNN, while fine-tuning large pre-trained models like VGG-19 [28] did not lead to satisfactory results, which they attributed to overfitting.

Challenges specific to radiographs include low contrast and blurriness and small indications with low signal-to-noise ratio. [13] argued that background noise in radiographs may be overly amplified in the common max-pool operation, and proposed an adaptive pooling strategy to treat the pooling of background and defect edge areas differently. They reported better accuracy in comparison to max-pooling and average pooling. Defects can occur at a multitude of different scales, which is generally difficult for segmentation. To take this into account, [16] proposed an approach with an adaptive receptive field block, and reported an improvement in accuracy.

Some aspects of the field of industrial radiography make the adoption of AI tools challenging. The objectives and specifications in radiography are at least partially specific to components and use cases, making results difficult to generalize. Models trained on the public GDXray database [36], for instance, are unlikely to be directly applicable to other datasets. Individual inspections rarely produce very large amounts of data, which is problematic for training neural networks that often require thousands or millions of labeled examples. Although guided by standardisation like ISO 10,675 [37], inspection procedures and evaluation metrics vary between industries and organizations, as well as individual operators. Because of this, the ground truth labeling for dataset generation is difficult. Although inspection results are archived, they are usually not done at a level of detail or suitable format to be directly used as annotated training data for AI.

Because individual inspection tasks are relatively narrow in relation to more general image recognition tasks, strategies specific to radiography have been proposed to improve results. Simulation and data-augmentation [38] strategies have been used extensively to expand limited datasets. [22] added simulated artificial defects using computer-aided design models for improved accuracy. [11] simulated porosity in castings by superimposing ellipsoidal shapes onto radiographs. [14] extracted defect signals and implanted them onto new locations to augment the dataset. To build robustness against unclear ground truth labeling, [16] used small perturbations to ground truth masks as a data augmentation method, reporting improvements in accuracy.

Ensemble learning [39] is a technique where multiple machine learning models solving the same objective are combined via e.g. averaging to increase performance. In contrast, some architectures like HydraNets proposed by [40], consist of branches specialized in different sub-tasks that are combined for a final result. Inspections often consist of several steps, giving an opportunity to use a combination of multiple models trained for different objectives instead of a single model with multiple outputs. In this work, we call this a multi-model approach. [18] used a multi-model approach for defect recognition. They trained a model to perform a weld segmentation step, to disregard unnecessary parts of radiographs like rulers and image quality indicators. They then filtered welds containing no defects with a separate classification network before using a third model for defect segmentation. While using multiple consecutive models requires more computational resources, it allows to train and validate process steps separately. When training multi-output models (as opposed to multiple independent single-task models), additional challenges arise as described by [41]. These include the need for multi-variate loss functions, and the imbalance of training samples between the outputs.

1.3 Hardware Considerations

Using AI for computer vision requires sufficient computational resources, determined by the size of the input data and the capacity of the AI model. Some architectures like YOLO-nets [31] and U-nets [33] have been designed to be quick for inference, achieving for instance real-time processing of video using only a central processing unit. Radiographs, however, are larger than most images, sometimes surpassing \(10\,000\times 10\,000\) pixels, with indications that may only be a few pixels wide. Because of this, radiographs cannot be downsampled as considerably as regular images for fast processing, making their analysis particularly resource-intensive.

The prevalent method for analysing large amounts of data is cloud computing [42], where data along with the request is sent through the internet to a network of computational services that perform the computation remotely and send the results back. The ability to scale resources according to current needs, as well as ease of use due to no need for specialized hardware on the client side are the core benefits of cloud computing. Improving data security, on the other hand, is mentioned as an open research issue by [42].

[43] provide a review of edge computing [44] for industry 4.0. The term edge computing means computation on location near the sources of data, as opposed to the cloud. They attribute the desire for edge computing to result from the need for reduced latency, energy usage, better security and more efficient analysis of data produced by distributed internet of things devices. Edge computing applications in industrial settings include processing sensor data, augmented reality tools, and maintenance management [43]. Several aspects of NDE make edge computing a favourable option: NDE locations are distributed and sometimes moving, produce large data that need to be analysed without delay, and often impose strict security requirements. With edge computing, NDE data can be analysed quickly and securely before sending results forward.

The use of edge computing to run AI models is often referred to as edge AI [45]. This often involves the use of graphics processing units, which dramatically increase computational efficiency for both the training and inference phases of machine learning models. Compute modules for performing edge AI are commercially available, for instance the NVIDIA Jetson series [46], the Google Edge TPU [47] and field programmable gate arrays [48].

1.4 Practical Experience of Tools Based on AI

As AI tools for radiography are just beginning to be commercialized, the studies regarding their use in production environments are limited. To expand the available material, we can make observations from radiography in other industries, or NDE in other modalities.

Some studies on the use of commercialized AI tools exist in medical X-ray. In a recent survey conducted by the European Society of Radiology [49], European radiologists were asked about experiences using AI in clinical settings. They found that while the tools were mostly considered to perform reliably, only 22.7 % saw a reduction in workload. This highlights that well performing machine learning models, on their own, are not sufficient to improve inspection workflows. Rather, a suitable use case and deployment method are necessary to benefit from AI tools. Successful implementations exist, however – [50] measured an improvement in bone fracture detection accuracy when radiologists were using an AI tool.

[51] studied potential adverse effects of using automation in NDE. Participants exhibited both over-reliant behaviour (trusting incorrect annotations) and overly sceptical behaviour (changing correct markings to incorrect) when told to control the results of an automated system on eddy current testing data. Clearly, the user interface (UI) affects the usefulness of an automatic analysis component greatly.

1.5 Basis and Goals of the Study

This study is based on an earlier machine learning system proposed by [14]. The system was developed to perform automatic defect recognition for welds in an aerospace component in production, and reached a sufficient probability of detection (POD) for being deployed as an aid for the inspector, with an \(a_{90/95}\) of 0.60 mm. It was identified for future work that such a system could be used for statistical analysis to improve the manufacturing process. A preliminary field trial was performed, indicating good potential for adoption to the inspection process. A number of areas of improvement to the HMI were identified: Running the analysis took a long time with the chosen hardware, the model exhibited excessive false calls outside of the regions of interest which was deemed distracting by some participants, and the markings made by the AI were sometimes found to be difficult to explain.

The contributions of this work are as follows. First, we adopt a multi-model approach by adding a weld segmentation model for faster inference speed and false call rate reduction, to meet the needs of field use (Sect. 2.2). Secondly, a UI for using the tool is proposed, with added functionality towards explainable AI (Sect. 2.4). Thirdly, we perform statistical analysis of the AI indications to gain insights of weld defect formation, and propose methods to use these statistics for process monitoring and improvement (Sect. 2.5).

2 Materials and Methods

The study was separated into the following phases. First, welds in 222 radiographs were manually annotated to gather a dataset for training a model for segmenting the welds. Secondly, a windowing strategy for fast inference using the segmentation information was developed. Next, the improved model was deployed on an edge AI device, and a UI was designed to facilitate use in a practical inspection setting. The UI was qualitatively evaluated by expert inspectors. Finally, a new dataset was collected to perform statistical analysis of AI the indications. A part of this dataset was manually annotated to measure the performance of the model and to evaluate the accuracy of the analysis.

2.1 Use Case

The inspection in this study is the same as in our earlier work [14]. Welds in an aerospace component are inspected using X-ray computed radiography. The inspection is a significant workload in the manufacturing process, requiring a large amount of resources. In addition to providing quality control for each part, the X-ray NDE results are used to study and improve the manufacturing process. For instance, all porosity is occasionally manually counted and measured to discover changes introduced by new process parameters, while normally only the unacceptable indications are noted.

2.2 Weld Segmentation

The motivation behind adding a weld segmentation step is two-fold. First, limiting predictions to the areas of interest eliminates a majority of the false calls which occurred in the previous study [14]. Secondly, less data need to be predicted which improves inference speed significantly.

Similarly to defect recognition, the nature of radiography imposes some challenges for weld detection, illustrated in Fig. 1. Low contrast and soft boundaries make the segmentation task more difficult and introduces some ambiguity in ground truth annotation. Geometric features that resemble welds make some simple pattern recognition algorithms impractical. Additionally, specific to this inspection, the material includes areas of interest near weld junctions where the weld boundaries are completely invisible, and the inspection area is recognised by inspectors by observing nearby weld boundaries. In comparison to common RGB-images used in image detection tasks, the radiographs are 16 bits (as opposed to 8 bits per color channel of RGB images), and the lack of color makes different features of the image less distinct.

Fig. 1
figure 1

Challenging elements for weld segmentation in the material. a Part geometry that resembles a weld. Weld shown by a white arrow, geometry shown by a black arrow. b Invisible weld boundaries, shown by a dashed rectangle. The area is recognized by observing visible weld boundaries in the upper right corner

A modified U-net [33] architecture, similar to the defect recognition model, is used for the weld segmentation, for its fast inference speed, simple implementation, and good performance on similar objectives. We pre-process the images with a Gaussian filter (\(\sigma = 15\) px) and normalize each image by subtracting the mean gray-value and dividing by the standard deviation.

The radiographs in question are very large images, sometimes exceeding \(10\,000\times 10\,000\) pixels, and inspectors evaluate them by zooming to small portions of the image at a time. However, CNNs have limited capacity for input size (e.g., \(416\times 416\) pixels being common for models trained on ImageNet [52]). This forces a trade-off between pixel accuracy, contextual information and model size, as shown in Fig. 2. Pixel accuracy is lost if the images are downsampled, while the entire image or a large portion of it can be used as input, giving more contextual information to the model. If the image is divided into smaller windows before inference, the resolution (and thus the accuracy) can be preserved, while the available information from the areas surrounding the region of interest (ROI) is weakened. Accuracy and contextual information can both be kept if the model input size is increased, however this requires more memory and slows down inference. Input size can be increased more if model depth and number of convolution filters are reduced, however, this lowers the model’s segmentation capacity.

Fig. 2
figure 2

Three strategies for performing machine learning inference on very large images. a Downsampling the image, losing pixel accuracy. b Converting the image to a batch of smaller sections, losing contextual information. c Resizing the model input to a larger size, limited by available memory

A working trade-off was found via the following considerations. Since no accurate sizing performance is required for weld detection, the input images can be downsampled more than for defect recognition. Because inference speed is essential, and contextual information relevant to the task is at the size of the entire image, we prioritize input size over model capacity. Thus, the images are resized to one fourth the original size, and the model’s filters for each convolution layer are halved in comparison to the defect recognition model. Finally, \(1\,024\times 1\,024\) pixel windows of the entire images are extracted as input for the model. The resulting architecture is shown in Fig. 3.

Fig. 3
figure 3

Modified U-net [33] architecture for weld segmentation. Rectangles represent the data dimensions - width and height are shown inside each rectangle and depth in the top or bottom. In total, the model has 433 474 trainable parameters

Training data for the segmentation model was gathered by manually annotating 222 images, each divided to 9 windows on average, yielding a total of about \(2\,000\) image patches for training and validation. The weld sections with invisible boundaries shown in Fig. 1 were included in the annotated masks. We used the Adam optimizer with initial learning rate 0.0005, and a batch size of 8. We halved the learning rate after every 300 steps if the validation loss did not improve, and used early stopping to halt training after the loss did not improve for \(1\,000\) consecutive steps. We used random flips and crops for data augmentation. The model was trained for 7300 steps on an NVIDIA RTX 3090 graphics processing unit, which took 35 min.

At inference time, the output of the weld segmentation model is used for determining the windows that get predicted by the defect recognition model. The process is shown in Fig. 4, which demonstrates how the amount of data sent to the defect recognition model is reduced. Inferred ROIs with a portion of the weld missing can lead to missed defects. This is why the estimate needs to be conservatively large. We achieve this by max-pooling the weld masks, as shown in Fig. 4c.

Fig. 4
figure 4

Inference strategy utilizing the weld segmentation model. a Original image with welds as the ROI. b Binary mask output of the weld segmentation model, with the predicted weld areas in white and background in black. Very small segmented areas are discarded. c Max-pooled output, which determines the positions of the windows for the defect recognition model. d The resulting windows extracted from the image as input for the defect recognition model, shown as yellow overlapping rectangles. Areas outside the ROI are not processed, making inference faster (Color figure online)

2.3 Hardware

In addition to weld segmentation, we chose suitable hardware to facilitate fast inference. An NVIDIA Jetson AGX Xavier [53] edge AI device was equipped with the defect recognition model and the weld segmentation model, both converted to TensorRT with layer and tensor fusion, kernel auto-tuning and dynamic tensor memory enabled. The device was configured to be connected by ethernet cable to either a local area network (LAN) or directly to a single computer. The LAN allows many users to use the service, while the direct connection provides a secure standalone environment. The browser-based UI can be used by all computers connected to the device either via LAN or cable. For this study, a direct connection was used.

2.4 User Interface

As discussed in Sect. 1.4, an unsuitable user interface can have a detrimental effect on even a well-performing automatic analysis system. To facilitate rapid iterative development, a browser-based viewing application was built. Developments were made based on the user feedback from the previous study. Specifically, there was a need to toggle annotations off, and a view displaying both indications and acceptance criteria was deemed too cluttered.

The software displays two options for automatic annotations: all indications as separate markings, and suggestions of defects or defect groups that may be unacceptable, using a set of conditions described in the previous work [14]. The display options are shown in Fig. 5. The option of showing all indications is aimed to be used for getting a transparent, explainable view of the AI indications for evaluation, while the suggestions are intended as the primary visualization for use during actual inspection.

To gain insight of how the proposed system works in practice, we performed a field trial. The system was installed on location. Three participants, all experts in the inspection case, evaluated the AI system and user interface qualitatively.

Fig. 5
figure 5

An example of visualisations which can be toggled on or off in the user interface. a Area of a radiograph without annotations. b Green circles depicting every indication made by the AI. The annotation is simple to understand, but obstructs the view and also marks non-essential small indications. c Red area around the group of indications, alarming of a clustered porosity that requires the inspector’s attention. While conceptually more opaque, this style reduces annotations significantly by omitting acceptable indications, which is less distracting to the inspector

2.5 Indication Statistics

After completing model training, a new dataset of 451 images was collected for statistical analysis. The dataset was gathered by randomly choosing inspections of 15 components from a timeframe of eight months of past inspections. We collected image sets spanning a period of time to detect trends. Complete sets were picked to make comparisons between different sections of the component. Three components (93 of the 451 radiographs) were manually annotated for model performance testing and to make comparisons on the statistics produced by AI indications. Aside from annotating the test set, the analysis was performed without any additional manual annotation. The same statistics can be calculated for an arbitrary time frame or monitored continuously.

First, we studied the size distribution of all the defects to gain insights of defect formation. Next, we plotted the average number of indications over time in different size ranges, as well as the indication size distribution over time. This was done to reveal trends in the formation of defects, and to detect events with a large deviation from average in the process. Finally, we studied the average number of indications, for particular welds, to find connections between defects and specific locations in the component. We used the same dataset of 451 images for all the analyses.

3 Results

3.1 Model Performance

The weld segmentation model achieved a mean intersection over union of 0.55 on the test set of 93 manually annotated images. The main error modes were false indications on features resembling welds, and short missed sections on the areas with invisible weld boundaries. Because the windowing step enlarges the predicted area, no weld sections were missed after post-processing.

For the defect detection model with weld detection pre-processing, we calculated the true positive rate (TPR) and false positive rate (FPR) on the test set using any overlap between the ground truth and the predicted segmentation mask as the criterion for a hit, and the amount of predicted radiograph patches as the number of opportunities for a false call. The model achieved a TPR of 83 % and a FPR of 0.13 false calls per patch. As discussed in the earlier work [14], the model performance is low for very small indications. For the larger half of the indications, the TPR was 96 % and the FPR was 0.02. The largest 8 % of indications had a TPR of 99 % and FPR of 0.005. The model accuracy increased with defect size, and most of the false calls were small.

By adding weld segmentation, the amount of image patches sent to the defect recognition model decreased from about 900 per radiograph to 5–180 depending on weld size, significantly reducing the inference times. Depending on image size, the weld segmentation takes 1-\(-\)1.5 s, the defect recognition takes 1.5-\(-\)3.5 s, and the total processing time (including reads, writes, and pre/post-processing) is 4–8 s on an NVIDIA Jetson AGX Xavier. Without the weld segmentation, the processing times were 15–30 s.

3.2 Field Testing

The AI indications were found to be accurate in terms of sizing. The accuracy of indications on the small to medium range was found to be satisfactory. Some larger missed indications were noted, which were reported as the most important area for improvement. The reduced false call rate was found to be a significant improvement, however further reduction of false calls would be required as they were a distraction to the inspector.

The model was found to be fairly explainable. Markings for porosity were easy to understand. Markings for linear indications were found to be confusing at small sizes – it was found that a reliable call on linearity could not be made at scales close to the limit of detectability. The false linear indications were caused by noise, geometry and poorly segmented regular pores. On non-weld areas erroneously marked to be inspected, the model exhibited unstable behaviour, which was found to be difficult to explain. Participants were in favour of training or education on the system, though a specific area requiring training was not indicated.

Potential benefits attributed to using the AI indications were focused on improved accuracy. The tool was not found to have an effect on inspection speed, as inspectors are required to go through each area manually. In addition to improved model performance, a more comprehensive integration to the used display set-up and the specific needs for image presentation would be necessary for an in-depth benefit assessment of the proposed user interface. Significant reductions to inspection time would likely only result from more automated use, e.g., manually inspecting only those radiographs where AI suggests unacceptable defects.

3.3 Indication Statistics

Flaw distribution. The indication size distribution is shown in Fig. 6a. It displays a shape that indicates that the number of observed indications first increases, then decreases with increasing flaw size. This can be explained as follows. The number of observed indications depends on two competing factors: the number of observable indications and the POD. The POD is expected to be close to zero for small defects and increase rapidly to near 1 after some detectability threshold. Thus, the first increase in the observed indications can be attributed to the increasing POD. At some indication size after the number of observations peaks, the POD is close to 1 and the distribution mainly displays the existing distribution of the flaws of interest. This is seen to exhibit a commonly observed behaviour, where the frequency of occurrences decreases with increasing flaw size. The POD, as described in MIL-HDBK-1823A [54], can be modeled using the logistic functional form:

$$\begin{aligned} POD(a) = \frac{1}{1 + e^{\alpha + \beta a}} \end{aligned}$$
(1)

Where a is the defect size, and \(\alpha \) and \(\beta \) are parameters solved numerically. It is commonly assumed that the POD is zero for very small indications and reaches 1 for flaws of sufficient size. These assumptions are used here. We use a power-law distribution to model the observed exponential decrease of indications:

$$\begin{aligned} P(a)=ka^{b} \end{aligned}$$
(2)

Where a is the flaw size, and k and \(b < 0\) are experimentally determined parameters. Finally to translate from a number of observations to the probability of observing a flaw or the expected number of flaws, we need to normalize the number of observations by the characteristic number of opportunities for an observation. For simplicity, the counts here were normalized by the number of images (451), thus yielding the expected number of flaws per image versus the flaw size.

With this, we can model the observed indication count I as a product of flaw probability (modeled by power-law) and POD (modeled by the logistic POD function), both being a function of flaw size: \({I(a) = POD(a)\times P(a)}\). This yields four parameters, which can be fitted to the observation. There is a number of ways to provide such a fit, for the present case a simple gradient descent was used with squared error loss. It is noteworthy that the resulting distribution function is highly non-linear and may exhibit instability during fitting, which may result in, e.g., unreasonable POD values. Thus, initial values for the fit were carefully chosen near the expected range to avoid such issues. The resulting power-law curve, the POD curve and the product of these two are shown in Fig. 6b.

The indication distribution of the manually annotated test set is plotted in Fig. 6c. The large end of the distribution is similar to the AI distribution. On the small end, the high false positive rate is visible. The distribution is less smooth, which can be attributed to the smaller dataset. The combined POD and power-law fit is repeated for the test set and shown in Fig. 6d, where a similar curve is achieved. Notably, the manually annotated ground truth does not perfectly represent the true defect distribution, since there are errors introduced by the data acquisition and both errors and bias introduced by the labeling.

Fig. 6
figure 6

AI indication size distribution is a product of POD and true defect distribution. a Histogram of AI indication sizes. In total, there were 10,037 indications. b A power-law distribution (dash-dotted line) to describe the true defect distribution and POD (dashed line) to describe AI model accuracy, are combined to define a model of the AI indication distribution (gray solid curve). The underlining histogram is shown in light gray. c Histogram of manually annotated indication sizes, 1752 indications in total. d Combined POD and power-law fit for the manually annotated indications

For subsequent analyses, we extract a flaw size with a high POD. At sufficiently large flaw sizes, few indications are missed, giving a more accurate view of the true distribution of flaws. A natural choice would be the \(a_{90/95}\), i.e. the flaw size with \(90~\%\) POD with \(95 \%\) confidence. Since our POD estimation method does not yield confidence bounds, we use a higher POD threshold of \(98 \%\) yielding a corresponding flaw size of \(a_{98}\). On the test set, this corresponds to a TPR of 96 % and a FPR of 0.02. We also extract a threshold value to describe the largest defects, because they are of special interest. By observing the distribution, this value was chosen to be \(2a_{98}\), which is at the 92nd percentile of indication sizes, and has a test set performance of 99 % TPR and 0.005 FPR.

The accuracy of the POD estimation was analysed by calculating the hit/miss POD of the model for the test set of 93 radiographs, which yielded an \(a_{98}\) value 13 % larger than the estimate. This indicates that the modeled \(a_{98}\) is a valid approximation for statistical analysis, but does not give a conservative estimate of reliability.

Indications over time. The quantities of indications per image over time are shown in Fig. 7a. Quantities at three flaw size ranges are shown separately: All indications, those larger than \(a_{98}\), and indications larger than \(2a_{98}\). A step-like reduction of indications can be observed between the first and second half of the studied period, shown as a change in average values in Fig. 7a. We study the significance of the difference for indications larger than \(a_{98}\), for which the model is more accurate. A T-test yields a p-value of \(~1.3 e^{-10}\) for the difference between quantities of indications per image for the first half (237 images) and second half (213 images), confirming statistical significance. This shows that large-scale trends in porosity can be detected automatically. The indications larger than \(2a_{98}\) show a somewhat similar trend, but exhibit more variance between the individual inspections. Two factors contribute to this: firstly, larger flaws are rarer and thus occur more sporadically within the sample size, and secondly the percentage of false indications, which may occur irregularly, is higher.

The indications on the three manually annotated inspections are compared with AI indications in Fig. 7b. There are consistently more AI indications, however they follow a similar trend.

Fig. 7
figure 7

a The number of AI indications over the studied period of 8 months. Each bar represents one inspection, consisting of 30 radiographs on average. Gaps between bars represent time between individual inspections. In total there are 451 radiographs. The number of all indications (light gray bars), those larger than \(a_{98}\) (dark gray bars), and those larger than \(2a_{98}\) (black bars) are shown separately. The average number of indications larger than \(a_{98}\) before and after 2022–05 are shown in dashed lines. b The number of manually annotated indications (yellow, orange and red bars) and AI indications (gray bars) on the test set of three inspections, with 93 radiographs in total (Color figure online)

Fig. 8
figure 8

Indication diameter over the studied period of 8 months. Gaps between boxes represent time between individual inspections. Each box represents one inspection, consisting of 30 images on average. The averages are depicted by orange lines, the box boundaries show the upper and lower quartiles, and the whiskers are placed within the 1.5 interquartile range value

The size distribution of indications over time is shown in Fig. 8. In contrast to the quantity of indications, the average size does not differ between the first and second halves of the studied period. The largest average diameter occurs on the third inspection (close to 2022–03), which also had the largest number of indications as shown in Fig. 7a. Since more frequent and larger porosity increase the risk of failing the acceptance criteria, this component can be considered closer to rejection. The analysis indicates capability to find such components automatically.

Indications per weld section. We studied the differences between specific locations in the component to detect problematic areas. The inspection of the studied component type consists of 5 welds, further divided into a total of 31 sections, each shown in one radiograph. We call the welds by numbers 1–5 and the sections within the welds by letters A,B,C.. and so on. These sections show varying lengths of weld. To compare the frequency of the indications, we normalize the number of indications by the number of windows processed (shown as yellow squares in Fig. 4d). The number of indications per window for different weld sections are shown in Fig. 9a. There are considerable differences between welds: Weld 1 shows the highest frequency of indications, about six times more than welds 4 and 5. These may be attributed to the type of weld, component geometry, welding order, other differences in manufacturing parameters or the radiographic set-up.

To study the effect of model performance, we plotted the FPR on the test set for each weld section (for indications larger than \(a_{98}\)) in Fig. 9b. Welds 1–3 show an acceptable FPR of mostly around 10 % of the indication rate (the number of indications per window), suggesting the model has the capability to detect the areas of the component that have a higher risk of failing the acceptance criteria. In contrast, welds 4 and 5 have excessive FPRs of mostly 40 % to 100 % of the indication rate. Studying the results on welds 4 and 5 showed a repeated false indication on a geometry feature, which is likely caused by an insufficient number of examples of the specific location in the training data.

Fig. 9
figure 9

a Indication frequencies, i.e., number of AI indications per processed window, for different weld sections. The welds are labeled with a number, and the sections of each weld are labeled with a letter. Frequencies of all indications (light gray), those larger than \(a_{98}\) (dark gray), and those larger than \(2 a_{98}\) (black) are shown separately. b The FPR on the test set for each weld section (indications larger than \(a_{98}\)). Sections 2D, 4B and 4I had zero AI indications on the test set

4 Discussion

The weld detection model achieved a satisfactory performance, since no weld areas were missed after post-processing in the studied material. Some false weld area indications were present in the images, leading to unnecessary windows occasionally being sent to the defect recognition model. A larger dataset for training or further data augmentation (e.g., by manipulating image contrast) may improve results. The adopted weld segmentation approach differs from the one used by [18]. They resized all weld images to the same specific size after segmentation, while we used a windowing approach to handle welds in varying shapes and sizes. Our approach is more general, but also requires more computational resources. The manual annotation of weld areas introduces some subjective bias to the ground truths of the weld masks. The effects can be expected to be small, because the physical locations of the weld areas are known (in contrast to weld defects which require more interpretation).

Using the weld model, false call rates outside of weld areas were reduced significantly. This was found to improve the impression of model reliability. A majority of the radiographs still contained some false indications. We expect the FPR would be acceptable at a level where less than half of the radiographs contained any false positives. Splitting the objectives of weld segmentation and defect recognition into a multi-model approach was favorable, because the defect recognition model from the previous work [14] could be used directly without re-training.

Field testing showed similar results to studies of AI tools in medical X-ray [49, 50] - inspectors reported improved accuracy as a future benefit, but not a reduced workload. It is likely that a more comprehensive change to the NDE procedure would be required to better utilize automatic analysis. Limitations in the test set-up made a comprehensive assessment difficult.

The fit of the combined POD curve and power-law distribution to the AI indications is not exact: the curve under-predicts quantities for small sizes and mid-sizes (after the peak), and over-predicts large indications. As discussed in Sect. 3.3 the AI model makes more false calls in the small size range. This is because X-ray noise begins to resemble very small porosity, and because uncertain indications tend to result in small segmentations. Additionally, the defects are unlikely to follow a power-law distribution exactly.

Modeling the indication distribution opens a possibility for making predictions. Most notably, the probability of rarer large defects can be predicted using observations of small defects, which may be used to guide the maintenance frequency or choice of other process parameters. Modeling the POD based on observations may be used as an estimate to monitor model performance after deployment, at least to the extent that large changes in the predicted POD can be used as a signal for performing a re-evaluation of the data acquisition and model performance.

It was demonstrated that high-level trends in porosity over a long period of time can be detected by the proposed system (Fig. 7a). On the other hand, significant differences in indication frequency were found between different welds (Fig. 9a), highlighting the ability to focus on smaller-scale features.

False calls affected the indication distribution and subsequent analyses. Most notably, repeated false calls were found on welds 4 and 5 (Fig. 9b). This highlights the importance of understanding the limitations of the AI tools used. Similar over-reliance as reported by [51] for inspection tools may occur when studying automatically generated statistics. A more accurate model would naturally yield more dependable statistics, and better performance may be required before the statistics are utilized as guidance for critical process parameters. On the other hand, by studying results from an imperfect model, an area of improvement for the AI model could be determined.

Automatic defect recognition provided access to a material for data analysis that is impractical to attain via manual inspection. In comparison to previous studies on weld porosity [7, 8], the proposed technique is more suitable for large-scale statistical studies and can be integrated to existing NDE processes in manufacturing. 10,037 indications from 451 images were counted and measured for the present study, and a dataset of any size can be analysed without additional manual work. In addition to analyses on historical data, the proposed system is suitable for providing real-time statistical information to monitor and improve the welding process. This serves as a practical implementation of NDE 4.0, where data from NDE is processed in a smart way to provide value beyond traditional quality control.

5 Conclusion

We developed an AI tool to perform an improved automatic data analysis of radiographs in aerospace welds considering the deployment in real industry use as an inspection aid and as an enabling technology for statistical quality monitoring. The adoption of a multi-model approach, in the form of an added weld segmentation step, improved the inference speed and reduced false calls. Field testing of the inspection aid indicated some potential for improving accuracy, but more comprehensive changes to the inspection process would be required to lower the inspector’s workload. Automatic segmentation of weld porosity was shown to open the possibility to study the size distribution, as well as the occurrence over time and between different sections of the component. The AI indication size distribution was explained as a product of POD and the underlying defect distribution, which means the indications can provide information on model reliability and predictive capabilities on defect formation, provided that there is sufficient validation of the robustness of the data acquisition and the model performance. We demonstrated that AI indications can be used to detect trends and anomalies over time, and the analysis can be focused to specific locations to find potential problem areas. Both large-scale trends and specific details were detected. The provided statistics were extracted from an existing industrial NDE process, taking practical steps towards NDE 4.0.