1 Introduction

Wire arc additive manufacturing (WAAM) is a direct energy deposition (DED) process suitable for the fabrication of large metal components due to its high deposition rate and relatively low manufacturing cost [1,2,3]. WAAM is becoming a popular alternative to traditional manufacturing processes for materials such as the titanium alloys used in aerospace [4, 5] and nickel aluminium bronzes used in marine applications [6]. In the research domain, work has been published across a number of different aspects of the technology, such as process planning, material property improvements and process control. In pursuit of a fully automated WAAM system, Ding et al. [7] proposed a systematic strategy featuring a series of algorithms, including 2D path planning [8], bead modelling [9] and multi-directional slicing algorithms [10]. The majority of material research tends to focus on material properties for various metals such as titanium alloys [11], aluminium alloys [12], steels [13] and nickel-based alloys [14]. A comprehensive review of material characteristics is provided by Wu et al. [15], Ngo et al. [16] and Liu et al. [17]. Other research efforts investigate methods to improve geometric accuracy [18,19,20], surface roughness [21] and internal stress build-up [22] in deposited WAAM components. Nevertheless, many challenges remain to be addressed due to the overall complexity of the multifaceted WAAM process [16], such as defect detection, manufacturing process optimization and automatic manufacturing. Recently some studies tracked these research areas: Le et al. [23] optimized the WAAM process parameters, resulting in similar mechanical properties with wrought 308L stainless steel; Ramalho et al. [24] characterized different contaminations through acoustic spectrum and achieving quality monitoring; Lopez et al. [25] applied phased array ultrasonic to detect 2- to 5-mm defects in aluminium WAAM component.

Interior defect monitoring is an essential component in ensuring the quality of deposited WAAM components. Although some non-destructive methods, such as X-ray [26, 27], computed tomography [28], ultrasonic testing [29] and eddy current [30], are developed to detect interior defects, but high equipment costs, along with complex detection and analysis processes, restrict the utility of these methods. The visible image captures provide explicit photographs for naked eyes while requiring lower device costs than thermal or infrared cameras. Inspection of visible images also provides clues for deposition quality of a given layer since many internal defects also exhibit visual characteristics at the surface interface. For example, interior voids and lack of fusion defects within the component (two of the most commonly encountered defects associated with WAAM) are often related to the surface quality (smoothness, voids and small area lack of fusion) of the previous layer. These types of defects limit the commercial applications of WAAM, particularly the critical components in naval and aerospace sectors. It is necessary to develop a reliable and intelligent monitoring system for the WAAM processes to predict potential anomalies in the manufacturing process. These systems will be necessary for the eventual qualification of WAAM components for use in their relevant industries.

In the last decade, deep convolution neural network (CNN) algorithms have pushed to the forefront for image classification and object detection tasks due to their accuracy and speed in extracting static 2D image features. In most scenarios, object detection can be considered for useful than image classification, though it is a more challenging task to compute. Computer vision research communities have invested significant effort into the development of these methods, and they have improved rapidly as a result. Table 1 compares the common object detectors in recent years. Object detectors such as region CNN (R-CNN) [31], single shot detector (SSD) [34], YOLO [36], and their derivatives Faster R-CNN [33], Cascade R-CNN [39], deconvolutional single-shot detector (DSSD) [35] and YOLOv3 [38] have all demonstrated their capabilities in object detection with high accuracy. Researchers in the welding domain have begun applying various algorithms and object detectors to assist technicians in localization tasks and identification of weld defects. Kim et al. [40] applied extraction and noise removal algorithms to identify locations of occluded weld seams from arbitrary directions. Zou et al. [41] proposed a mixed object detector combined with sequence multi-feature combination network (SMFCN) and recurrent neural network (RNN) to detect continuous weld seams from images featuring a significant amount of noise. A simplified YOLOv3 model combined with image augmentation was selected for the task after evaluating performance against faster R-CNN, YOLOv2, and YOLOv3 [42] methods. Another interesting application of YOLOv3 in the welding domain is presented by Dai et al. [43], where positions and evaluated quality of spot welds on an automobile body were accurately detected. The lightweight network MobileNetV3 replaced the backbone network of YOLOv3 achieving a fast detection speed. Based on the previous research work, popular object detectors attained reliable detection results and fast detection speed both in beads location and defect detection in the last several years. Nevertheless, it can be argued that training a defect detector provides more utility than a position detector since defect boundary boxes also provide indication of the location of weld seams [43, 44]. A good defect detector can be used to identify the locations of weld seams with some straightforward optimizations.

Table 1 Comparisons of different object detectors

The limited dimension of the training dataset is a major factor for the development of intelligent systems on WAAM. Specifically, the large amount of images required for training a deep neural network model is not obtainable in real-world manufacturing scenarios due to incomplete sensor setups and complicated manufacturing environments. To address these challenges, small dataset is widely applied in welding science. Feng et al. [45] used 487 data points to train a deep neural network (DNN) to predict solidification defects, which provides the feasibility to training a DNN model to predict defects with a small dataset. A small X-ray image dataset for weld defects diagnosis achieves more than 90% F1-score in a deep CNN [46]. Therefore, models that apply small datasets are good choices when big datasets are not available in real WAAM manufacturing.

This paper introduces an intelligent model based on the YOLOv3 algorithm. The goal is to utilize a small dataset to identify surface anomalies which occur during the WAAM process. Following this introduction, the paper will introduce the methodology, related important parameters, and algorithms of the YOLOv3. Then three WAAM components featuring a series of defects are fabricated to provide datasets for the training of the model. To test the model, three additional components featuring fewer defects are fabricated and then analysed by the model. To improve the performance of the defect prediction model, some alterations in anchor settings are attempted in order to develop a robustly trained model. The model with the best performance is identified and discussed in the result section, followed by conclusions and a conceptual discussion on future work.

2 Methodology

In the WAAM process, a fabricated component's quality is dependent on the quality of each individual layer. Deposited weld beads are a product of carefully selected process parameters and task-based parameters such as the infill pattern used overlapping distance between weld beads. With recent developments in computer vision, a number of object detectors have demonstrated capabilities which are likely applicable to the WAAM process. In this section, a brief introduction to the classic object detectors is first provided. Then, as a representation of the newer variety of fast and accurate object detectors, YOLOv3 is discussed in the context of defect detection in the WAAM process.

2.1 Related work

R-CNN [31] is a typical two-stage region-based convolution network detector. R-CNN uses a selective search to extract around 2000 region proposals. Each region proposal is warped into a fixed pixel size (227 × 227) to be compatible with the CNN architecture, which then extracts features for each region proposal. These features are fed into a linear support vector machine (SVM) to identify a specific class, followed by a linear regression model to predict and modify detection windows. R-CNN achieves excellent object detection accuracy based on region proposal. However, each region proposal’s features need to be extracted by CNN and saved to disk, resulting in slow detection speed and expensive training cost.

Fast R-CNN [32] is developed to address these drawbacks by sharing computation for proposals. Fast R-CNN also makes use of selective search for region proposals; however, it feeds the whole image and a set of object proposals into the CNN to obtain a convolutional feature map. Then, a pooling layer processes each proposal to extract a corresponding fixed-length feature vector from the feature map. After that, the fixed-sized feature map is transferred to a feature vector and finally branched into two output layers to produce probability estimates and boundary box positions. These changes help to share the convolution result from CNN, avoid repeated computation for region proposals and therefore attain a speed approximately 25 times faster than that of R-CNN. Faster R-CNN [33] focuses on optimizing the method of selecting region proposals. Both R-CNN and fast R-CNN determine region proposal by selective search, while faster R-CNN trains a neural network, defined as region proposal network (RPN), to find all region proposals. Firstly, Faster R-CNN sends input images to the CNN and obtains feature maps. Then, a pre-trained RPN extracts the proposals from feature maps and unites the proposal information with the feature maps to obtain useful feature information. Linear classifiers and linear regression techniques are then used to predict the presence of select objects. Based on faster R-CNN, many region-based algorithms have been developed to improve the performance and speed of object detection, for instance, Libra R-CNN [47] and Cascade R-CNN [39].

The YOLO series are powerful one-stage object detection algorithms which attempt to balance accuracy and speed. In 2016, YOLO [36] was first published. YOLO [36] divides a 224 × 224 image into a set of 7 × 7 grids and then uses a single convolutional network to output a 7 × 7 × 30 tensor. This tensor composes two kinds of boundary box details and the probabilities for different object classes (20 classes in the PASCAL VOC dataset). Box details include x, y coordinates and w, h. The x and y coordinates represent the centre coordinates of the box relative to the bounds of the grid cell. The w and h denote the grid dimensions relative to the dimensions of the whole image. Finally, YOLO predicts whether an object exists in each individual grid. The limitations of YOLO are well known: significant localization errors are compared with fast R-CNN; low recall is compared with two-stage algorithms and difficulty in detecting small objects. YOLOv2 [37] was developed to solve these issues via (amongst other things) batch normalization, a high-resolution classifier and anchor box convolution. In addition, YOLOv2 proposed a new network architecture called Darknet-19, which has 19 convolutional layers and five max-pooling layers. Darknet-19 achieved 91.2% accuracy on ImageNet, which is higher than the 88% achieved by its YOLO backbone network. The output tensor of YOLOv2 is in the size of 13 × 13 × 125 (five prediction boundary boxes, each boundary box has four box parameters, one confidence score and 20 class probabilities). YOLOv3 [38] further improved the prediction performance and small object detection capabilities. The backbone of YOLOv3 is modified to Darknet-53, which features 53 convolution layers and 23 residual blocks. Another innovation of YOLOv3 is using three different scaled output tensors (in the size of 13 × 13, 26 × 26, and 52 × 52) to predict different-sized objects by three anchor boxes for each scaled output tensor. YOLOv3 has significant advantages both on accuracy and on speed when the intersection of union (IOU, refer to 2.2.iv) is set to 0.5.

Figure 1 displays a performance comparison between popular object detectors. YOLOv3 has a clear advantage in detection speed while maintaining detection accuracy, which is an essential feature for online real-time detection. Thus, in this paper, we selected YOLOv3 for object detection tasks.

Fig. 1
figure 1

Comparison of the performance of popular object detectors

2.2 YOLOv3 algorithm

2.2.1 Backbone

Different from its two predecessors (YOLO and YOLOv2), YOLOv3 implements a new network architecture for performing feature extraction. The backbone of YOLOv3, Darknet-53, is a hybrid architecture composed of Darknet-19 [37] and residual block [48]. In Darknet-53, 53 convolutional layers and 23 residual blocks are used to reduce the shape of the input image (416 × 416) to three feature maps: 13 × 13, 26 × 26, 52 × 52. Then, YOLOv3 uses a feature pyramid network to merge the smaller feature map with the large feature map. Specifically, the feature map produced by Darknet-53 and five convolutional calculations are upsampled and concatenated with the 26 × 26 feature map. A 26 × 26 × 255 output, including all-class information and boundary box information, is obtained through a series of convolution layers. In a similar fashion, the 26 × 26 feature map is merged with the 52 × 52 feature map. Thus, through these merging operations, three different outputs (13 × 13 × 255, 26 × 26 × 255, 52 × 52 × 255) are obtained. Figure 2 provides the overview of complete YOLOv3 architecture. Darknet-53 is regarded as a backbone to extract image feature maps.

Fig. 2
figure 2

Architecture of YOLOv3

2.2.2 Loss function

YOLOv3 uses mean-squared error and binary cross-entropy to evaluate errors between targets and outputs. Error values for each boundary box width and height use mean-squared error, while other errors including the (x, y) coordinates of the bounding box, confidence scores for objects and non-objects and the probability on class identification all use binary cross-entropy error [38]. In this paper, in order to facilitate algorithm optimization, we use mean-squared error for box information (x, y, and w, h) and binary cross-entropy error for other parameters. Equation (1) is the loss function used in our training:

$$\begin{aligned}loss & = {\lambda }_{\text{coord }}{\sum}_{i=0}^{{S}^{2}} {\sum}_{j=0}^{B} {1}_{ij}^{\text{obj }}\left[{\left({x}_{i}-{\widehat{x}}_{i}\right)}^{2}+{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}\right]\\ & + {\lambda }_{\text{coord }}{\sum}_{i=0}^{{S}^{2}} {\sum}_{j=0}^{B} {1}_{ij}^{\text{obj }}\left[{\left({w}_{i}-{\widehat{w}}_{i}\right)}^{2}+{\left({h}_{i}-{\widehat{h}}_{i}\right)}^{2}\right]\\ & + {\sum}_{i=0}^{{S}^{2}} {\sum}_{j=0}^{B} {1}_{ij}^{\text{obj }}\left[{\widehat{C}}_{i}\cdot \mathrm{log}{C}_{i}+\left(1-{\widehat{C}}_{i}\right)\cdot \mathrm{log}\left(1-{C}_{i}\right)\right]\\ & + {\lambda }_{\text{noobj }}{\sum}_{i=0}^{{S}^{2}} {\sum}_{j=0}^{B} {1}_{ij}^{\text{noobj }}\left[{\widehat{C}}_{i}\cdot \mathrm{log}{C}_{i}+\left(1-{\widehat{C}}_{i}\right)\cdot \mathrm{log}\left(1-{C}_{i}\right)\right]\\ & + {\sum}_{i=0}^{{S}^{2}} {1}_{i}^{\text{obj }}{\sum}_{c\in \text{ clases }} \left[{\widehat{p}}_{i}\left(c\right)\cdot \mathrm{log}{p}_{i}\left(c\right)+\left(1-{\widehat{p}}_{i}\left(c\right)\right)\cdot \mathrm{log}\left(1-{p}_{i}\left(c\right)\right)\right]\end{aligned}$$
(1)

The first two terms denote localization loss, terms 3 and 4 represent confidence losses, and the last term indicates the classification loss. \({\lambda }_{\text{coord}}\) and \({\lambda }_{\text{noobj}}\) are constants giving weights to the loss terms. We set \({\lambda }_{\text{coord}} = 1\) and \({\lambda }_{\text{noobj}} = 100\) in this paper. In YOLOv3, every grid cell receives three prediction boxes based on the sizes of anchor boxes, meaning that \(B\) is three. \(S\) is the number of grid cells, which is the feature map’s shape (for 13 × 13 feature map, \(S\) is 13). \({1}_{ij}^{\text{obj}}\) and \({1}_{ij}^{\text{noobj}}\) state if this grid cell is responsible for the object or not. \({1}_{ij}^{\text{obj}}\) equals 1 if the (\(i\), \(j\)) cell contains and be responsible for the object. Otherwise, \({1}_{ij}^{\text{obj}}\) equals 0. \({1}_{ij}^{\text{noobj }}\) is 1 if there is no object shown in the grid cell (\(i\), \(j\)) or the grid cell is not responsible for the object. Parameters such as \({x}_{i}\), \({y}_{i}\), \({w}_{i}\), \({h}_{i}\), \({C}_{i}\) and \({p}_{i}\left(c\right)\) correspond to x, y coordinates, width and height of prediction boundary box, prediction confidence, and class probabilities. \({\widehat{x}}_{i}\), \({\widehat{y}}_{i}\), \({\widehat{w}}_{i}\), \({\widehat{h}}_{i}\) are the ground truth of the labeled boxes. \({\widehat{C}}_{i}\) and \({\widehat{p}}_{i}\left(c\right)\) are the confidence of the cell’s responsibility (which is always 1) and class probability. This loss function tactfully manipulates all losses to a single equation, simplifying the individual optimization processes facing different loss functions.

2.2.3 Anchors

Anchor boxes [33] provide prior knowledge from ground-truth boxes of the training dataset. They deduce the truth box’s dimensions through a K-means algorithm. However, instead of Euclidean distance, the intersection of union (IOU) between boundary boxes and centroids is used as the optimization criterion. With this prior knowledge, the trained model predicts the sizes of boundary boxes, improving the model’s training convergence rate. The pre-setting of anchor boxes to the model solves the issue of multi-scaled objects. In YOLOv3, nine anchor boxes with different dimensions are pre-set for training and sorted into three groups by size. For instance, three minimum-sized anchor boxes are used as the prior knowledge to predict three boxes for every grid cell in the 13 × 13 feature map. The predicted boundary box having the largest IOU with the ground-truth box is regarded as the output boundary box. Similar predictions and calculations are designed for 26 × 26 and 52 × 52 feature maps, which are designed to precept middle- and small-sized objects, respectively. The nine anchors of YOLOv3 on the COCO dataset [49] are: (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), (373 × 326).

2.2.4 Boundary box

Through the pre-set of anchor boxes, every prediction box acquires the best sizes via prior knowledge from anchor boxes. YOLOv3 bonds the anchor boxes’ size and prediction boxes’ size via Eqs. (2) and (3):

$${b}_{w}={A}_{w}{e}^{{t}_{w}}$$
(2)
$${b}_{h}={A}_{h}{e}^{{t}_{h}}$$
(3)

where \({b}_{w}\) and \({b}_{h}\) are the width and height of the final rescaled boundary box. \({t}_{w}\) and \({t}_{h}\) are the original width and height of the prediction, while \({A}_{w}\) and \({A}_{h}\) are the scaled anchor box’s dimensions. Parameter \({A}_{w}\) and \({A}_{h}\) are scaled proportionally as the scaled ratio of the feature map (1/32, 1/16, 1/8 for 13 × 13, 26 × 26, 52 × 52 feature maps, respectively). As shown in Fig. 3, the final prediction box’s dimensions will be bonded with the anchor box’s sizes and scaled into an object area. The centre of the boundary box is calculated by Eqs. (4) and (5):

Fig. 3
figure 3

Illustration of boundary box parameters

$${b}_{x}=\sigma \left({t}_{x}\right)+{c}_{x}$$
(4)
$${b}_{y}=\sigma \left({t}_{y}\right)+{c}_{y}$$
(5)

where (\({t}_{x}\), \({t}_{y}\)) is the original prediction of the boundary box central coordinates, and (\({c}_{x}\), \({c}_{y}\)) is the left corner coordinates of the grid cell. \({b}_{x}\) and \({b}_{y}\) are the adjusted coordinates of the boundary box centre. As the red point displayed in Fig. 3, the coordinate of the grid cell is (\({c}_{x}\), \({c}_{y}\)) = (3, 2). \(\sigma\) is the sigmoid function, which is used to limit the prediction coordinates to absolute coordinates within (0, 1), guaranteeing the centre of the boundary box stays within the grid cell.

As previously described in the ‘Sect. 2.2.3’ section, every grid cell in a feature map receives three boundary boxes of different sizes. After the calculation (Eqs. (2) to (5)), YOLOv3 outputs four coordinate parameters (\({b}_{x}\), \({b}_{y}\), \({b}_{x}\), \({b}_{y}\)) and one confidence parameter for each boundary box. To select the best and most reliable boundary box, two criteria are designed:

  1. 1.

    Confidence threshold is used to show the confidence that an object exists or not. The boundary box with a confidence lower than the pre-set threshold is discarded.

  2. 2.

    Non-maximum suppression (NMS) threshold: A technique is used to remove redundant boundary boxes and is evaluated by an IOU (Eq. (6)). Boundary boxes having an IOU larger than the NMS threshold will be removed, avoiding using multiple boundary boxes to indicate the same object.

    $$\mathrm{IOU}=\frac{intersection\;area\;of\;prediction\;box\;and\;ground\;truth\;box}{union\;area\;of\;prediction\;box\;and\;ground\;truth\;box}$$
    (6)

3 Experimental result of YOLOv3 on WAAM dataset

3.1 Experimental set-up

A conventional WAAM system is used to deposit three components for training and then three additional components for testing. The set-up of the WAAM system is shown in Fig. 4. It consists of a control PC (Fig. 4a), an ABB robotic system (Fig. 4b, d), a Fronius welder (Fig. 4c, e), a weld process current and voltage collection system (Fig. 4g), and a national instrument USB 6008 board (Fig. 4h).

Fig. 4
figure 4

The set-up of the WAAM system in this experiment. (a) PC controller. (b)(d) ABB robot system: IRC5 controller and IRB2600 robot. (c) (e) Fronius welder and weld torch. (f) actual workpiece. (g) Self-build high-frequency current and voltage collection box. (h) NI USB 6008 board

The control PC has software that generates code files to control robot movements and synchronize welding parameter settings. The shielding gas consists of 100% Argon. BOC 4043 aluminium MIG wire is used in these tests. Synergic welding control generates weld parameter sets, meaning that the welder automatically determines current and voltage values through pre-stored programs in the Fronius weld controller. In order to get enough defects for the training dataset, the Fronius proprietary cold metal transfer (CMT) welding process is used to fabricate the training components because CMT mode provides lower heat input than the CMT-pulse mode at the same wire feed speed setting. The CMT-pulse process is used in the test components and case study to demonstrate realistic manufacturing scenarios.

3.2 Dataset construction

Three different components, featuring a variety of geometric features, were designed for the training experiments. Three components (square-shaped, hexagon-shaped and star-shaped) are designed in order to generate potential defects when processing right angles, obtuse angles and acutely angled features. Figure 5 provides an illustration of the required welding paths. During the manufacturing process, neighbouring contours were welded in clockwise and anticlockwise directions, as displayed in Fig. 5. Figure 5 also highlights the positions where defects are likely to appear. The primary defects that arise at these positions are surface voids because the distance between two vertexes is too large to be covered by the width of welding beads. Lack of fusion between weld beads is another defect encountered in the WAAM process, often generated when the overlap distance between weld paths is too wide.

Fig. 5
figure 5

Path planning for three components of different shapes: square, hexagon and star

Figure 6 shows the WAAM components after fabrication. During the manufacturing process, after the deposition of each layer, images of the deposited surface are taken with a camera mounted on the welding robot. Three components, each consisting of 30 deposited layers, were built (Fig. 6a–c), and images of each deposited layer were captured and cropped to be used in the training dataset. In total, 51 images are designated for use in the training dataset, whilst the remaining 17 images were used in the validation dataset. Since each image will contain a different number of anomalies, 1646 anomaly boxes are included in the training dataset and 602 anomaly boxes in the validation dataset. The width between welding beads was purposefully raised to increase the number of defects in the training set components. On the other hand, appropriate wire feed speed and travel speeds were used for the fabrication of components in the test dataset (Fig. 6d–f). Each component in the test dataset consists of 12 welded layers, each of which has several accompanying images to go along with it. In total, 146 images are stored in the test dataset. The welding parameters, wire feed speed (WFS) and travel speed (TS) for each test component are shown in Table 2. Furthermore, each image includes two types of labels: component and anomaly. In general, each image features only one component label.

Fig. 6
figure 6

Three welding components demonstration. From (a) – (c) and (d) – (f) are square, hexagon, and star components, respectively

Table 2 Components for test dataset’s welding parameters

3.3 Anchors

An Anchor provides prior knowledge about the dimensions of objects. Redmon and Farhadi [38] determined that nine different anchor box dimensions obtain the best performance on the COCO dataset with 80 object classes. However, in our welding dataset, there are only two classes (anomaly and component) in the whole dataset. Furthermore, the dimensions of the anomaly and component boxes are not similar to the objects in the COCO dataset. Thus, new anchor settings are required for our dataset.

K-means algorithm is used to discover clusters along with size information of the ground-truth boundary boxes. The IOU between the anchor boxes is used to optimize cluster centroids. Similar to the anchor sets for the COCO dataset, we keep nine anchors for the training model.

We modified the original COCO dataset’s anchor setting. As the relative size of the anomaly inside the image is small, only the anchor settings for the small object anchors and the largest anchor size (used for component detection) are modified, as displayed in Table 3.

Table 3 The modified anchor box settings

As shown in Table 3, “1-anchor-change” indicates that we only changed the largest anchor size used for component prediction. 2-anchor-change to 4-anchor-change modify anchors for small objects gradually to find out the best anchor setting for our dataset. Finally, 9-anchor-change means that all the ground-truth bounding boxes used during training are clustered into nine classes and completely replace the original anchor settings. To describe the modifications succinctly, we named the six different anchor settings from Group ‘A’ to ‘F’, as outlined in Table 3. It is worth mentioning that to determine the modified anchor box size, boxes in the training dataset are first clustered into nine clusters. Then, the nine anchors without the largest anchor size are clustered into clusters of 1–3, respectively, to obtain the changed anchor data for Group C to Group E.

3.4 Training

The training process is performed in two steps: Firstly, models are trained with only a small number of epochs and the best optimizer is then selected; secondly, the training epoch number is increased so as to compare the models obtained with different anchor settings. The model with the best performance is then selected.

This paper compared the model performances from different optimizers, including Adadelta [50], RMSprop [51], SGD [52] and Adam [53]. To select the optimizer having rapid convergence, a large learning rate and a small epoch number are set in advance, which are 1e-3 and 64, respectively. Then, the best optimizer is trained with a series of small learning rates. Several different random learning rates between 1e-5 and 1e-3 are compared. With the best optimizer and learning rate, we increase the iteration number to 128 and compare the performances on different anchor settings. Lastly, we find out the best model by comparing the performances in the test dataset. Mean average precision (mAP) is used to evaluate the performance of each model, which is calculated by taking the average precision over all classes. In this study, both component object and anomaly class are calculated. It is defined as below:

$$\mathrm{mAP}=\frac{\Sigma\;precision\;for\;all\;classes}{number\;of\;classes}$$
(8)

4 Results

4.1 Training results

To find the best optimizer for our training dataset, the number of epochs is first set to a small number (128), while the learning rate is set to a large value (1e-3). Optimizers tested are the aforementioned Adadelta, RMSprop, SGD and Adam methods. The training processes are shown in Fig. 7. It is found that SGD performs poorly under these hyperparameter settings. The evaluation criteria such as precision and recall are poorly reflected on our dataset. Adadelta was found to outperform RMSprop. Both precision and recall show unstable performances during the training. Figure 8 is the model performance on the validation dataset after each epoch. Based on Fig. 8, generally, the Adam optimizer performs better than the other three optimizers. Therefore, we select Adam as our optimizer in the subsequent experiments. After further tuning the learning rate, we found that the model performs best at the validation dataset when the learning rate is 1.26e-4.

Fig. 7
figure 7

Training performances: (a) total loss; (b) precision; (c) recall with 0.5 IOU; (d) recall with 0.75 IOU on different optimizers: Adadelta, RMSprop, SGD and Adam

Fig. 8
figure 8

Validation performances: (a) f1 score; (b mAP; (c) precision with 0.5 IOU; (d) recall with 0.5 IOU on different optimizers: Adadelta, RMSprop, SGD, and Adam

As shown in Fig. 8, the results on the validation dataset improve with increasing epoch number, which suggests that extending beyond 128 will further improve the prediction model. For further models’ comparisons, 500 is set for the number of epochs to achieve reliable results. To determine the best anchor setting, we designed and trained a series of experiments and models to find out the best model for anomaly detection.

Firstly, a model with nine anchors replacement (Group F, as described in Table 3) is trained. For comparison, a model having the original anchor setting (Group A in Table 3) is trained. The performance of the validation dataset is shown in Fig. 9. The model performance for Group A is always better than that of Group F, especially for the F1 score and the mAP criterion. It is mainly because the feature maps from YOLOv3 have three different sizes (13 × 13, 26 × 26, and 52 × 52), which are designed to identify objects of various sizes. However, sizes of the boundary boxes of the anomaly are always similar, which are located within the small object ranges. Therefore, if we replace all anchors with small object sizes, the feature maps from YOLOv3 will get confused when detecting medium- and large-sized objects, performing poorly on our training dataset. Furthermore, the erratic precision for Group F indicates that the model is not stable and reliable for defect prediction. Thus, the anchor setting in Group F cannot be used to detect anomalies in our dataset.

Fig. 9
figure 9

The model performances with anchor setting on Group A and Group F: (a) F1 score; (b) mAP; (c) precision with 0.5 IOU; (d) recall with 0.5 IOU

Secondly, the performance of Group A is compared with Groups B, C, D and E. After 500 epochs training, as shown in Fig. 10, Group E is not able to perform as well as Group A, while Groups B, C and D all have better performances than Group A. The comparison results of Group A E and F imply that too many anchor changes reduce the performance and stability of the model. Although Groups B, C and D all have better performance than Group A (based on F1 score and mAP), Group C performs more accurately on the anomaly prediction than the other two groups roughly within 500 epochs. On the other hand, Groups C and D are not as stable as Group B, since evident ups and downs for Groups C and D are presented in Fig. 10. Due to the fluctuation in the performance of Group D, it is hard to compare its performance with Group B.

Fig. 10
figure 10

Model performances on validation dataset of Groups A, B, C, D, E on (a) F1 score; (b) mAP; (c) precision with 0.5 IOU; (d) recall with 0.5 IOU

In conclusion, the model with Group B’s anchor setting demonstrates the most stable performance. Anchor settings for Groups C and D have obvious fluctuations. However, within 500 epochs, the performance of Group C is the best compared with Groups B and D. Besides, as shown in Fig. 10, when the epoch number is set to 500, the model’s performance improves with the increase in epoch number, especially for the F1 score and mAP criterion. Groups B and D have more obvious tendencies to improve performances at the end of the 500 epoch, while Group C has a faster convergence rate.

As discussed earlier, the best model’s performance cannot be obtained within 500 epochs, especially for Groups B and D. Thus, the epoch number is increased to 1000 for training. The model’s performance of anchor settings on Groups B, C and D will be compared directly on images in the test dataset to determine the best anchor setting. The performance of the model with anchor settings for Groups A–D will be displayed and compared in the next section.

4.2 Model performance on test images

With the best learning rate mentioned above (1.26e-4), the four best models on anchor settings of Groups A, B, C and D within 1000 epochs are selected and compared on the test dataset. To find out the correct confidence threshold and NMS threshold intervals, the performance on the validation dataset is compared in advance. Preliminary values for the confidence threshold vary from 0.3 to 1, and the box with confidence smaller than 0.3 is regarded as a negative sample. Through our comparison, high confidence thresholds (larger than 0.7) fail to detect the existence of anomalies. On the other hand, too large NMS thresholds keep boundary boxes with high intersection areas, resulting in redundant boundary boxes in predictions. Therefore, in this paper, confidence thresholds are set between 0.4 and 0.7, and NMS threshold is set as 0.05.

Table 4 reports the best model’s performance when we set the confidence thresholds and the NMS threshold in the determined intervals. NMS threshold of 0.05 indicates that almost all redundant boundary boxes with intersection areas will be removed after the NMS algorithm. The original anchor setting (Group A) performs not as well as the other three groups, which is consistent with the results shown in Fig. 10. Based on the results shown in Table 4, Groups B, C and D improve at least 5% on the mAP with respect to the original anchor setting. With the increase in the number of iterations (after 500 epochs), the performances of Groups B and D both achieve better results than that of Group C. Consistent with the previous discussion, Group C has a quick convergence rate, resulting in low improvements with the increase in epoch number, while Groups B and D improve their prediction results on both component and anomaly areas with the increased training of epoch number. Comparing with Groups B and C, Group D performs better both on component and anomaly identifications. Group D has a 5.8% improvement on the component identification compared with the original anchor setting (Group A) and an 8.9% improvement on anomaly prediction when the confidence threshold is set to 0.4. There is about a 7.5% improvement on Group D’s mAP when the threshold is set at 0.4 and 0.5 compared with Group A. About 2% mAP rises on Group D relative to Group B and C from Table 4.

Table 4 The four best models (at epoch 988, 795, 942, 947, respectively) performance on the validation dataset

The four best models are evaluated on the test dataset. Figure 11 displays results for the best models of Groups A–D at a 0.5 confidence threshold. Some obvious undetected anomalies are also highlighted. The best model of the original anchor setting (Group A) performs appropriately on both anomaly and component. However, the accuracy of anomaly detection and the size of boundary boxes are the main issues of the original anchor setting. Some anomalies are not detected in Fig. 11a–c, resulting in the drop of mAP. Prior knowledge of anchor size is not used in this group, bringing inconsistent anomaly boundary box sizes (as shown in Fig. 11a–c). The performance results are consistent with the evaluation data in Table 4: The models from Groups B, C and D identify component boxes in the right positions, and the performance of anomaly detection on Group D is the best, followed by Group B and then Group C. However, it is worth mentioning that the boundary boxes in Groups B and C (Fig. 11d–i) have acceptable dimensions for the component prediction, while the boundary boxes in Group D are too large for components. An abrupt and significant drop (23.5% from Table 4) of component identification accuracy happens when the confidence threshold increases from 0.5 to 0.6, indicating that the use of larger box sizes is improper.

Fig. 11
figure 11

Best model’s performance on three test components images at 0.5 confidence threshold, respectively, (a)–(c): original anchor setting (Group A); (d)–(f): 1-anchor-change (Group B); (g)–(i): 2-anchor-change (Group C); (j)–(l): 3-anchor-change (Group D)

Next, models from Groups B and C are candidate models for future applications. Further identification results on the test dataset are shown in Fig. 12. For the model with only 1-anchor-change, the prediction for anomalies is acceptable when the confidence threshold is 0.4 (Fig. 12a–c). Most of the unfused parts are identified and marked within boundary boxes (around 53% accuracy). The model with a 0.5 confidence threshold (Fig. 12d–f) removes some low confidence boundary boxes, resulting in missing some real anomaly parts. Similar results for 2-anchor-change models are shown in Fig. 12g–l. The high confidence threshold deletes some boxes with right anomalies inside. When comparing with 1-anchor-change model and 2-anchor-change model, it is noticed that the sizes of the component boxes are always larger in the 2-anchor-changes model than the 1-anchor-change model, even exceeding the size of the image sizes.

Fig. 12
figure 12

Performances comparisons on 1-anchor-change and 2-anchor-change: (a)–(c): 1-anchor-change at confidence threshold 0.4; (d)–(f): 1-anchor-change at confidence threshold 0.5; (g)–(i): 2-anchor-changes at confidence threshold 0.4; (j)–(l): 2-anchor-change at confidence threshold 0.5

Therefore, the model with 1-anchor-change and a 0.4 confidence threshold is selected lastly as the best prediction model for anomalies and component identifications. Based on the data in Table 4, the precision on component prediction achieves 100% precision, while anomaly prediction receives 53% accuracy, with a 76.5% mAP globally.

5 Case study

A component featuring right angles, obtuse angles and acute angles was fabricated to test the model performance in a case study replicating a real-world manufacturing scenario. The component, shown in Fig. 13, is comprised of eight layers, and welding parameters were kept constant throughout the manufacturing process: wire feed rate at 5.5 m/min and travel speed of the robot is 0.5 m/min. The shielding gas used in this case study is 100% argon. The welding process used for this component is Fronius’ CMT-pulse. Figure 13 also provides illustrations of the component structure, path planning layout and prediction results. In similar fashion to previously fabricated components, this component is manufactured in clockwise and anticlockwise commutatively, as shown in Fig. 13a, b.

Fig. 13
figure 13

Illustrations of the component and the detection results: (a)–(b): Path plannings and component dimensions. (c) Real photograph after eight-layer manufacture. (d)–(f): The detection results using the pre-trained best 1-anchor-change model

As discussed in Sect. 4, the 1-anchor-change model performs best for defect detection on our dataset and therefore is used to determine the performance for different shapes under 0.4 confidence. Figure 13d–f is the prediction result from the model. Although the shape of the component differs from our training dataset, most of the lack-of-fusion defects are detected accurately by the model. However, some defects still go undetected. For example, in Fig. 13e, f, the model misjudges component edges with lack-of-fusion defects on a number of occasions. However, for our application, a model that occasionally predicts normal part as defective is always better than a model that misses possible defects, which reduces the impact of these errors. For the detection of the component, the model produces a number of inaccurate predictions. The model only recognizes the component in Fig. 13d, indicating that the model can identify components of different shapes with low accuracy. The main reason causing this low accuracy in component identification is the small training dataset, especially for the component samples (less than 1000).

6 Conclusion and future work

In this work, the YOLOv3 algorithm is applied to a WAAM process to detect: a) the position of the welded component and b) the existence of welding defects on the welded surface. This paper investigates YOLOv3’s performance with different anchor settings with a small training dataset. The best model performance based on the validation dataset is selected for the component position identification and detection of defects. It was found that the original anchor setting for the COCO dataset is unsuitable for the WAAM defect dataset. Moderate changes to the anchor settings (1–2 changes) bring the prior knowledge of the component and anomaly sizes to prediction models, improving the prediction mAP by about 5%. In this paper, the 1-anchor-change model achieves 100% accuracy on component identification and 53% precision on the anomaly prediction. This model proves the feasibility of using a small dataset to train a YOLOv3 model to identify defects in a WAAM process. Furthermore, the accurate identification of component locations provides a prerequisite for high-precision assessment of WAAM components qualities.

The work presented in this paper forms an essential component in a wider intelligent monitoring system for WAAM. It will provide image-based defect detection capabilities that can be integrated with other defect detection subsystems, such as the electrical signals’ monitoring system presented in [54]. Future work will focus on embedding the model with a functional graphical user interface (GUI), which will assist technicians in the practical task of detecting the presence of anomalies. An interactive GUI will also provide a more streamlined means of integrating these models into the wider defect detection monitoring system.