1 Introduction

Laser beam welding is one of the most modern processes in the manufacturing industry for joining metal materials. Solidification cracking belongs to the serious faults in this process. In order to avoid unnecessary costs and ensure high quality, the long-term objective is to establish an intelligent computer-based control system capable of automatically detecting the formation of cracks to replace manual detection which is time-consuming and laborious. Furthermore, real-time crack detection will serve to adjust the welding process in real time through automatic control of the welding parameters. The formation of solidification cracks during welding mainly depends on metallurgical and thermo-mechanical factors. According to established theories, from a thermo-mechanical point of view, the strain and strain rate arising during welding near the solidification front are responsible for the solidification cracks. A solidification crack occurs in the so-called mushy zone, i.e. in the zone immediately behind the weld pool where solidification is still incomplete. If the strains exceed the ductility of the material within the mushy zone, solidification cracks will appear. The formation of cracks can be determined by measuring the strain state in the vicinity of the weld pool. Optical flow estimation can be used to calculate displacement and strain fields [1, 2]. Its main disadvantage is the high computational cost, which makes real-time monitoring difficult in practical applications.

In recent years, machine learning methods based on neural networks have been widely researched and applied in industrial fields. Among them, the convolutional neural network (CNN) [3] was first proposed for the recognition of handwritten digits, and then achieved remarkable success for large-scale classification in the ImageNet data set [4]. Nowadays, the CNN is widely employed in the computer vision field due to its good performance, e.g. in image classification, object detection, and action recognition. Crack detection can be regarded as a binary classification problem in image classification. By applying some classic networks, cracks or defects in static images can be successfully detected. However, these networks cannot capture motion information. In the welding process, cracks can occur at any time and the duration of the process is unknown. Locating the initiation and end time of crack formation in untrimmed videos is more complex. If frames are analysed individually, the formation moment of tiny cracks will be hard to identify, which leads to a delay in the alarm. In this situation it is beneficial to consider temporal information.

Action recognition is an extension of image classification. It has received much attention in recent years and has been widely applied in video analysis tasks, such as monitoring abnormal events in surveillance cameras. The commonly applied models can be divided into three types: (1) The two-stream model [5], which contains two CNN networks, one takes the optical flow as input to extract motion information and the other one takes RGB images as input. Each stream is followed by a softmax layer, and they can be fused by averaging or using a SVM. (2) Long short-term memory (LSTM) [6] has achieved outstanding progress in processing a data sequence, and LRCN [7] was proposed to utilize a CNN model to extract spatial features and then impose a LSTM on the result of the CNN to extract temporal features. (3) A three-dimensional convolutional network (3D-CNN) [8] is the third popular method which replaces the 2D convolutional kernel with a 3D kernel that contains an additional time dimension. The optimal configuration of a 3D-CNN was explored by systematic research and named C3D [9]. Its successful development makes it possible to learn crack generation features from welding videos. To the best of our knowledge, there is no work on capturing the dynamic crack generation process. However, LSTM [6] and 3D-CNN [8] models have shown promising results in capturing motion features in videos. In this study, we propose two networks based on CNN-LSTM and 3D-CNN for automatic crack detection. A two-stream model has not been chosen because of its low efficiency caused by the optical flow calculation. The contributions of this work are as follows.

  • A controlled tensile weldability test (CTW test) is set up to generate digital images containing solidification cracks. With this process 14 high-quality welding videos totaling over 50,000 frames are collected.

  • Two different models based on CNN-LSTM and 3D-CNN are developed and compared. To the best of our knowledge, this is the first work to detect temporal boundaries of crack formation during the laser beam welding process.

  • The proposed models are evaluated with several previously unseen test sets, and compared with the ResNet-18 model [10] in accuracy and calculation cost. Extensive experiments demonstrate that the boundaries of cracks located by our models are more accurate than those found using ResNet-18.

2 Related work

The rapid development of deep learning makes it a tempting candidate to replace human visual inspections in fault and defect detection. Two fault diagnosis methods are proposed in [11] and [12] to address the problem of scarce faulty samples and a deficit in labelled data. In static crack detection, the trained deep learning model can effectively detect cracks in different structures. A well-known CNN model has successfully found concrete cracks in civil infrastructures. Combined with a sliding window technique, it can detect cracks in images of any resolution, outperforming the traditional Canny and Sobel methods [13]. A fusion framework NB-CNN [14] has been proposed to detect cracks on metallic surfaces in nuclear power plants, while applying Naive Bayes decision to reduce the false positive rate. A shallow CNN architecture optimized on LeNet-5 [3] was proposed in surface concrete crack detection [15]. CrackViT [16] combined the advantages of CNN and transformer networks. It explored different fusion methods for the two models, implementing pixel-level crack extraction. In addition to crack detection, [17] developed and compared four CNNs with different receptive fields and performed the classification of different types of pavement cracks.

The use of transfer learning can solve the challenge of insufficient annotated data. For example, in [18], based on the VGG-16 [19] architecture pre-trained on the ImageNet data set, the influence of the model parameters were investigated using a small data set. In [20], the AlexNet [4] architecture in fully trained transfer learning and classifier modes was trained and compared with six common edge detection schemes. [21] detected two common defects (crack and corrosion) with two pre-trained networks, i.e. VGG-16 and ResNet-18. Moreover, in [22] the performance of 15 state-of-the-art convolutional neural networks which detected cracks was compared in terms of number of parameters, area under the curve (AUC), and inference time.

In the welding area, the application of machine learning in quality prediction and classification has also seen a rapid increase. By learning from digital images collected by high-speed cameras, many CNN-based models have been established to predict different laser welding defects, including porosity, level misalignment [23]; blowout, humping, and undercut [24]; and slag inclusions, cracks, floating holes, and lack of fusing (false friends) [25]. The literature classified welding defects into five categories [26] (conduction welding, stable keyhole, unstable keyhole, blowout, and pores) through X-ray images. The work in [27] extracted features from infrared image sequences and designed an ensemble deep neural network based on CNN and gated recurrent units (GRU), enabling detection of four critical welding defects (sagging, lack of penetration, lack of fusion, and geometric deviations of the weld seam). The authors of [28] designed a high-quality monitoring method for micro-plasma arc welding (MPAW), which extracts the contour of the molten pool region based on the method of Otsu and then applies a support vector machine (SVM) to identify the lack of fusion, humping, and sound weld states.

3 Methodology

A high-level description of our system architecture is shown in Fig. 1. The data generation and pre-processing will be introduced in the next section. This section is about the learning methods which we have applied. Two different machine learning models for analysing the videos of the welding process are introduced in Sect. 3.1. The architecture and hyperparameter details are provided in Sect. 3.2, and the complexity analysis is given in Sect. 3.3.

Fig. 1
figure 1

Overall architecture including the two different models for crack detection

3.1 Learning temporal features

Before exploring the model architectures, we concentrate on temporal features. Including temporal aspects in the models is the main novelty in this paper.

3.1.1 CNN-LSTM model

Since the formation and propagation of cracks are changing with time, a CNN-based model [29] aims to learn the features of previous information by stacking multiple images together as the input. The drawback of this approach is that the temporal information will be collapsed after each convolution operation. In order to focus on both, the spatial and temporal information, it has been proposed to combine the LSTM with a CNN [7] for the latter’s good performance in learning from the sequential data. The architecture of the combined CNN-LSTM is shown in the middle of Fig. 1, in the grey box. It is a simple sequential integration of CNN and LSTM. The input consists of several consecutive frames. First, each frame is sent to the CNN individually to extract visual features, then the feature maps obtained by the CNN are flattened into a one-dimensional vector and fed into the following LSTM in sequence.

Fig. 2
figure 2

Internal structure of CNN-LSTM

Figure 2 shows the internal implementation of the LSTM. The inputs are \(X_{t}\), the hidden state of the previous timestep is \(h_{t-1}\), and the memory cell state is \(C_{t-1}\). The outputs are \(h_{t}\) and \(C_{t}\). Both \(h_{t}\) and \(C_{t}\) retain the previous information and affect the present output. In addition, the LSTM incorporates three gates. The forget gate \(f_{t}\) determines how many previous hidden states should be forgotten, the input gate \(i_{t}\) determines how much current input should be updated, and the output gate \(o_{t}\) determines how much of \(C_{t}\) should be transferred to \(h_{t}\). These modules enable the LSTM to integrate previous and current data. The formulas of the LSTM are given in Eq. (1), where ‘x’ denotes the input matrix at time t, ‘\(W_i\)’ are weight matrices and ‘b’ is the offset value, ‘\(\odot\)’ denotes the matrix multiplication operation, ‘\(\sigma\)’ denotes the sigmoid function ensuring that the output lies between 0 and 1. It is applied to each gate.

$$\begin{aligned} \begin{aligned}&f_{t}=\sigma (W_{xf}x_{t}+W_{hf}h_{t-1}+b_{f})\\&i_{t}=\sigma (W_{xi}x_{t}+W_{hi}h_{t-1}+b_{i})\\&\widehat{c}_{t}=\tanh (W_{xc} x_{t}+W_{hc}h_{t-1}+b_{c})\\&c_{t}=f_{t} \odot c_{t-1}+ i_{t}\odot \hat{c}_{t-1} \\&o_{t}=\sigma (W_{xo}x_{t}+W_{ho}h_{t-1}+b_{o}) \\&h_{t}=o_{t} \odot \tanh (c_{t}) \end{aligned} \end{aligned}$$
(1)

3.1.2 3D-CNN model

Our second considered architecture is the 3D-CNN. The 3D-CNN is another network that is widely used in video analysis. Unlike CNN-LSTM, it can extract both appearance and motion features simultaneously. Compared with 2D convolution, the 3D-CNN uses an additional dimension in depth. Figure 3 illustrates the difference between multichannel convolution and 3D convolution applied to images. The upper structure is a multichannel convolution, and its input is a multichannel feature map. The feature map of each channel is convolved with a kernel, and the results are added; thus, the output is a single feature map. Multiple such images have been stacked as the input to the CNN model in [29], and each image is treated as a channel. As can be seen from the figure, after the first convolution layer, the temporal information is lost. The structure below is a 3D convolution, and L represents the depth instead of the number of channels. The kernel size of the 3D convolution is \(k\times k\times d\), generally \(d<L\). Its output feature map is still three-dimensional due to 3D convolution and 3D pooling operations. The kernel size of the 2D convolution is \(k\times k\), and the number of channels is C. The total number of parameters is therefore \(k\times k\times C\). The 3D convolution has \(k\times k\times d\times L\) parameters, many more than the 2D network.

Fig. 3
figure 3

Comparison of 2D (top) and 3D (bottom) convolution

The structure of the 3D-CNN is also shown in Fig. 1. It contains five convolution layers, and each layer is followed by a pooling layer. According to the findings for the C3D model [9], the best kernel size is \(3\times 3\times 3,\) so all filters are of dimension \(3\times 3\times 3\) with stride \(1\times 1\times 1\) in the convolution layer. The kernel size in the pooling layer is \(2\times 2\times 2\) with stride \(2\times 2\times 2\), except for the first layer which is equipped with \(1\times 2\times 2\) parameters in order to not merge the temporal information at an early stage.

3.2 Model designs

The feature learning module consists of three main layers: a convolutional, a pooling, and a fully connected layer. The network can be built in various ways by setting different hyperparameters and choosing different sequences of layers. The performance of the model can improve as the model becomes more complex, and the mainstream models (LeNet, AlexNet, VGG, GoogLeNet [30], and ResNet) are designed increasingly deep. However, the time-consuming models are inappropriate as the on-line crack detection system usually requires immediate response. In this section, we systematically build networks with different depth (number of layers) and width (numbers of filters) to investigate the improvement in accuracy.

Compared with the images in the database ImageNet, welding crack data contain fewer categories and has simpler features, so we empirically design models with six to eight layers (models A, B, C in Table 1); the pooling operation is not regarded as a layer because it does not contain parameters. In general, with an increase in depth of the CNN, the performance will be better because each layer can focus on different features. Next, we fix the network depth and change the width (models D, E). According to [31], the last column of Table 1 is the theoretical time complexity (relative to model A), calculated by Eq. (2):

$$\begin{aligned} \begin{aligned} O\left( \sum _{l=1}^{L} D_{k} ^{2}\cdot N_{l-1} \cdot N_{l} \cdot D_{w} \cdot D_{h}\right) \end{aligned} \end{aligned}$$
(2)

where L is the depth of the CNN network. \(N_{l-1}\) and \(N_{l}\) are the number of input and output feature maps. \(D_{k}\) is the spatial size of the convolutional kernel and \(D_{w}\), \(D_{h}\) are the spatial size of the output feature map.

Table 1 Configuration of different models
Fig. 4
figure 4

Effects of network depth and width

Figure 4 shows the training accuracy of the different models. The results illustrate that increasing the depth can improve accuracy, the accuracy of models B and C is higher than that of A. But after reaching a critical value, the accuracy becomes saturated and stagnant, model C seems not significantly more accurate than model B. The effect of width is not as significant as that of depth. Compared to B, the accuracy of models D and E is slightly improved, and their complexity is also higher. It can be seen that for our type of data, overly increasing the depth and width cannot further improve accuracy, but increases the complexity, which is unaffordable and unnecessary. To strike a good balance between cost and accuracy, based on Table 1 and Fig. 4, we ultimately chose model D as the feature extraction network.

In the LSTM layer, there are two hyperparameters that affect accuracy and computational cost, the number of the LSTM layers, and the number of features in the hidden units. The comparison of the best validation accuracy for different configurations is shown in Table 2. The accuracy reaches \(95.56\%\) with two LSTM layers and 128 hidden nodes, which cannot be significantly improved with higher configurations.

Table 2 Maximum validation accuracy for different LSTM configurations

For 3D-CNN, its structure is consistent with the convolutional layers of CNN-LSTM, but it inflates the model from 2D to 3D. The configurations of CNN-LSTM and 3D-CNN are listed in Table 3. The input size is \(T \times 64\times 128\), which represents the number of input images as well as their height and width. The selection of T is discussed in Sect. 5.2. The size of the feature map will be reduced by half after the pooling layer. When the input data passes through each layer, at L5, the size is reduced to \(2\times 4\). Then, the feature maps are flattened to vectors and input into the LSTM module in the CNN-LSTM model or into two fully connected layers in the 3D-CNN model. Finally, the softmax function is applied to scale the probability distribution into the range [0, 1]. The maximum probability is predicted as the final classification. The other hyperparameters are as follows: kernel size \(= 3\), number of epochs\(=20,\) batch size \(=64\), learning rate \(= 1e-4\), dropout \(=0.5\), the adaptive moment estimation (Adam) optimizer is used as well as the cross-entropy loss function, which are standard settings.

Table 3 Configuration of the models

3.3 Complexity analysis

Finally, we will calculate the time complexity of the proposed models. The complexity of the convolutional layer has been given in the previous section. The trainable parameters of LSTM are from Equation. (1), which are \(W_{xf}, W_{xi}, W_{xc}, W_{xo},\) all matrices of size \(m\times n\), m and n are the dimensions of the hidden state and the input; \(W_{hf}, W_{hi}, W_{hc}, W_{ho}\) which are matrices of size \(m\times m\) and bias vectors \(b_{f}, b_{i}, b_{c}, b_{o}\) of size \(m\times 1\). The total number of parameters in a LSTM block is \(W=4\,m\times \left( m+n+1 \right)\)[32], and the computational complexity is \(O(W)=O\left( m^{2} +mn \right)\). Therefore, the CNN-LSTM complexity can be calculated as the sum of the complexity of convolutional layers and LSTM layers (Eq. 3).

The 3D-CNN complexity is modified to Eq. (4) as its output feature map size is \(D_{w} \times D_{h} \times D_{d}\). It can be seen that the complexity of the CNN-LSTM model increases linearly with input length T, while in the 3D-CNN, when \(l=1\), \(N_{l-1} = T\), T only affects the complexity of the first convolutional layer. Compared to CNN-LSTM, its computational cost changes more slowly with the increase in T. In Sec.t 5.3, we will measure and compare their time complexity through the number of floating point operations (FLOPs) according to the model implementation.

$$\begin{aligned}{} & {} O\left( T\cdot \left( \sum _{l=1}^{L} D_{k} ^{2}\cdot N_{l-1} \cdot N_{l} \cdot D_{w} \cdot D_{h}+\left( m^{2} +mn \right) \right) \right) \end{aligned}$$
(3)
$$\begin{aligned}{} & {} O\left( \sum _{l=1}^{L} D_{k} ^{3}\cdot N_{l-1} \cdot N_{l} \cdot D_{w} \cdot D_{h} \cdot D_{d}\right) \end{aligned}$$
(4)

4 Data generation

In this section, we describe the data collected in laboratory experiments which we have performed in the welding laboratory at BAM.

4.1 Material and welding parameter

The welding experiments were carried out with the TruDisk 16002 disc laser from TRUMPF, with a maximum output power of 16 kW, a wavelength of 1030 nm and a beam parameter product of 8 mm x mrad. The welding experiments were carried out on sheets of austenitic steel grades 1.4301 (AISI 304) and H400 (EN 1.4376) with a thickness of 1.5 mm. The welding parameters applied were 2 kW laser power and a constant welding speed of 1.2 m/min at a focus position of +5 mm. Argon with a flow rate of 20 l/min was used as shielding gas.

Fig. 5
figure 5

Schematic representation for the experimental procedure

4.2 Solidification Crack Generation

The solidification cracks were generated during laser beam welding using the externally restrained hot cracking test. This type of hot cracking tests were developed to force the cracking in the specimen by external stress or stain. In these experiments, the controlled tensile weldability test (CTW test) was employed, in which the weld specimen is subjected to predefined strain and strain rate perpendicular to the welding direction during welding, as shown schematically in Fig. 5. The externally applied strain contributes to increasing local strains in the hot crack critical region (mushy zone) and generating cracks during welding. During the CTW test, a strain of 5\(\%\) and 7\(\%\) was applied for steel grade 304 and 5\(\%\) for steel grade H400 while welding. Those strain parameters of the CTW test cause cracks of a length between 12 mm and 18 mm in the weld. The strain variation affects only the generated crack length. Figure 6 shows the experimental setup in combination with the CTW test.

Fig. 6
figure 6

Experimental setup for laser beam welding in combination with the hot cracking test (CTW test) and the applied coaxial digital camera

4.3 Data Acquisition

The welding process was recorded using an sCMOS camera installed co-axially to the laser path. In order to obtain a valid recording of the welding process and the re-solidified material, an external laser illumination with a wavelength of 808 nm and an interference filter of the same wavelength and with a bandwidth of 20 nm were required, as shown in Fig. 5 schematically. The filter was placed in front of the camera, allowing only the wavelengths from the illumination to pass through and suppressing all other spectral ranges, thus eliminating all optical disturbances during recording. Figure 7 shows two frames before and after the crack initiation. The camera recordings were carried out for 15 cracked welded joints. The recording rate used was 800 fps. The images were stored in tif format with a resolution of \(640\times 240\) pixels.

Fig. 7
figure 7

Two images show the weld seam during welding, (top) before crack formation, (bottom) after crack formation

Table 4 Experimental settings

The experiments are carried out on two different steel plates applying different strains. Table 4 lists the experimental settings. The data sets with a strain of 5\(\%\) and strain rate 6 s\(^{-1}\) are used as training data. To estimate the models and tune parameters during training, 20% of the frame sequences in the training data are randomly divided into validation set. To verify the performance of the models, the test set is generated by setting the strain differently to 7\(\%\) or to different strain rates of 4 s\(^{-1}\) and 8 s\(^{-1}\). Finally, 14 different sets of welding data with over 50000 frames are obtained.

4.4 Data labelling

The welded specimen were examined by X-ray after welding to determine the location of the crack in the welded specimen. The X-ray images were the ground truth, used to precisely label the area of the image in the videos where the crack started and propagated. The X-ray images allow the precise identification of the frames in which crack formation, propagation, and termination are visible. Figure 8 shows an example of an X-ray image of a cracked weld specimen made of steel grade 304.

Fig. 8
figure 8

X-ray image of cracked weld specimen

4.5 Data pre-processing

The data need to be pre-processed before training. First, a region of interest (ROI) area near the weld pool is selected to reduce the calculation cost. The resolution of \(64\times 128\) pixels is sufficient, as further increasing the area no longer improves the accuracy, but increases the latency. Figure 9 shows the clipped images at different stages, (a) is the normal state, (b) is the stage when the crack begins to initiate, which is difficult to distinguish from the normal state at this time. (c) is the stage of crack propagation, and (d) is interference data that looks like a crack and may trigger a false alarm.

Fig. 9
figure 9

Frames in different states

To expand the training samples, and increase the diversity in the data set, data augmentation methods are utilized. The images will be randomly transformed with a 50% probability in the following ways. First, the cracks in the data set are horizontal with an inclination of no more than \(\pm 15^{\circ }\), and all frames in the sequence can be rotated by a certain degree, as shown in (b). (c) simulates brightness variations, and (d) shows the image processed with Gaussian blur.

Fig. 10
figure 10

Types of data augmentation

5 Results analysis

In this section, the performance of the two models is evaluated and compared. Both networks are implemented using Python and the Pytorch package [33]. They are trained on the graphical processing unit (GPU) NVIDIA Tesla P100 with 25 GB RAM. The accuracy and loss in the training process is presented first. Then, the accuracy and calculation cost are compared on test sets and, finally, the visualization of the optimal model is given.

5.1 Training results

Figure 11 shows the results of training and validation with respect to accuracy and loss.

Fig. 11
figure 11

Comparison of accuracy and loss on training and validation sets

To train the temporal model, the input is a video clip composed of consecutive frames; thus, a T frame wide window is slid over the videos to construct the input. In order to find the influence of sequence lengths, models with different window size T are built. The accuracy and loss in Fig. 11 show that the model converges faster with increasing window size T because longer motion and temporal information can be captured. For the CNN-LSTM, when T is 16, the maximum accuracy on the validation set is \(95.56\%\) at the 11th epoch. The models with \(T=32\) and 48 can achieve the same accuracy already at the 9th and 7th epochs, respectively. For the 3D-CNN, the maximum accuracy is \(97.13\% (T=16), 99.12\%, (T=32)\) and \(99.52\% (T=48)\) at the 16th, 18th and 19th epochs, respectively. Overall, as the window size T increases, the 3D-CNN can achieve higher accuracy on the validation set. Also in comparison with the CNN-LSTM, accuracy and loss are both better for the 3D-CNN.

5.2 Evaluation results

Next, we will evaluate the performance of the trained models through the test sets introduced in Table 4.

We note that accuracy is not a robust measure when dealing with unbalanced classes, as we find them in the welding data sets. To improve the model evaluation and better compare the performance of the models, in addition to model accuracy, some application-driven metrics such as sensitivity and specificity are introduced. The metrics can be calculated as defined in Eq. (5). True positives (TP) denotes the cracks correctly predicted, false positives (FP) denotes the normal frames that are mispredicted as showing cracks, and true negatives (TN) and false negatives (FN) denote the normal frames that are correctly and incorrectly predicted. Sensitivity indicates the proportion of true positives in real labelled samples, and specificity indicates that of negative cases. They both denote the ability of the model to correctly identify the positive and negative samples. In our crack formation detection application, early detection of cracks is very important, which some false alarms can be accepted. Therefore, sensitivity is the most important metric.

$$\begin{aligned} \begin{aligned} \text {Accuracy}&=(\textrm{TP}+\textrm{TN})/(\textrm{TN}+\textrm{FP}+\textrm{TP}+\textrm{FN})\\ \text {Sensitivity}&=\textrm{TP}/(\textrm{TP}+\textrm{FN})\\ \text {Specificity}&=\textrm{TN}/(\textrm{TN}+\textrm{FP}) \end{aligned} \end{aligned}$$
(5)
Fig. 12
figure 12

Results of ResNet-18 on test sets

As mentioned in Sect. 3.2, the use of very deep and complex networks may result in degraded performance, as welding videos only involve simple features. Therefore, ResNet-18 is used as a model for comparison, and the maximum number of channels is reduced to 196, which improves accuracy and inference time, and is consistent with our proposed networks. Figure 12 shows the overall accuracy, sensitivity, and specificity of ResNet-18 on four test sets. The sensitivity is not very high because the frames at the beginning of crack formation cannot be detected successfully.

Figures 13 and  14 show the results in three aspects of CNN-LSTM and 3D-CNN. It can be seen that compared to ResNet-18, the sensitivity of CNN-LSTM and 3D-CNN is significantly improved. Consistent with the results in the training process, the accuracy gradually improves with the increase in T. On the test set h400_5_4s, although the accuracy of 3D-CNN decreases, the sensitivity is greatly improved to values well above \(90\%\).

Fig. 13
figure 13

Results of CNN-LSTM on test sets

Fig. 14
figure 14

Results of 3D-CNN on test sets

The precision–recall curve (PR curve) shows the ability to detect positive samples and is more sensitive to imbalanced data. Precision represents the proportion of true positives in the positive samples predicted by the classifier. It reflects whether the model can detect more positive samples with fewer FP results. Recall is equal to sensitivity (Eq. 6). The PR curve is produced by calculating the precision and recall using a different threshold (prediction probability). The value of the average precision (AP) is equal to the area under the PR curve. The model with higher AP is better because precision and recall are both high. From Fig. 15, it can be seen that when T is set to 16 and 32, the APs of the two models are basically equal. When T is 48, the result of 3D-CNN is slightly better than that of CNN-LSTM.

$$\begin{aligned} \begin{aligned}&\text {Precision} =\textrm{TP}/(\textrm{TP}+\textrm{FP}) \\&\text {Recall} =\textrm{TP}/(\textrm{TP}+\textrm{FN}) \end{aligned} \end{aligned}$$
(6)
Fig. 15
figure 15

Precision–recall curves of different input length

Figure 16 displays the prediction result of the test set AISI304_5_8s, the ordinate in the figure is the predicted probability, and the number of frames at which cracks appear and end is given. After frame 1005, a crack starts to appear. ResNet-18 issues an alarm at frame 1041, when the crack has formed and propagated. The frames before and after the crack formation are very similar to the start and end of the crack region. Accurately predicting the number of frames is challenging, and raising an alarm before cracks occur is actually better. When the input length is 16, the CNN-LSTM and 3D-CNN models give predictions at frames 1000 and 1001. When the input window length increases to 48, they make predictions 5 and 21 frames in advance as marked in the figure. The frames at the end of a crack until it disappears are not detected. This is because the motion features are not obvious and cannot be distinguished from interference data. In our scenario, it is more important to detect the occurrence moment of the crack formation than its end, and therefore, missing the end of the crack does not define an error for us. Taking more images as input can predict the formation of cracks earlier, but it also requires more memory and processing time. The next section will discuss the computational efficiency of different input lengths.

Fig. 16
figure 16

Prediction results of the three models

5.3 Computation cost

In this section, the models are evaluated with respect to their computational cost, which is particularly important for real-time detection. Table 5 lists the processing time of the two algorithms at different sequence lengths on the test set AISI304_5_8s, which contains over 3000 frames, as well as that of ResNet-18 per frame. We find that the processing time of CNN-LSTM and 3D-CNN is very similar. It increases linearly with the sequence length.

Table 5 Calculation cost

The processing time of ResNet-18 is 0.39 ms, which is much faster than the processing time of both our models. It must be mentioned that the processing times are not comparable because ResNet-18 processes only one image at the time, while the temporal models need to process multiple images in each step.

However, we have considered possibilities to speed up the processing time of a larger sequence of frames. It is computationally very inefficient if the sliding window only moves by one frame at a time, because more than 90% of the frames will be processed repeatedly. The displacement between two consecutive frames is very small, thus increasing the distance in a sliding step can improve efficiency while maintaining accuracy. Figure 17 shows the prediction results when T is 48, and the sliding step size is 1, 12, 24, and 36 frames, which means the overlap of the sequences is \(98\%, 75\%, 50\%\) and \(25\%.\) From the small image in Fig. 17a, one can see that increasing the sliding step size will greatly reduce the inference time. When overlapping only \(25\%,\) the detected crack initiation frame is delayed by more than 20 frames. But when the overlapping is \(50\%\) or \(75\%,\) there is only a few frames difference. In terms of efficiency, a \(75\%\) overlap can achieve the same processing time as ResNet-18, while \(50\%\) overlap is twice as fast. In summary, when a quick alarm is required so that the welding process can be reset or adjusted promptly, T can be set to 48 and the sliding step is one quarter or half of T, which can meet both accuracy and efficiency requirements.

Fig. 17
figure 17

Prediction results of different overlapping sequences

5.4 Visualization

Since deep learning is treated as a black box, in order to give an interpretation to its output, two visualization methods are applied to the 3D-CNN model, guided backpropagation [34] and gradient-weighted class activation mapping (Grad-CAM) [35]. Figure 18 illustrates the visualization results. Figure 18a shows the original frames input to the network, including the normal, crack formation, propagation, and end stages. Figure 18b shows the reconstructed images obtained by inverting the feature map at the last layer through guided backpropagation. From the figure one can see that a crack is the most discriminative region. Figure 18c shows heat maps generated by Grad-CAM. It reflects the importance of spatial locations calculated by the feature map of the last convolutional layer and the predicted class. The crack regions are highlighted as they are considered important for the final crack prediction.

Fig. 18
figure 18

Different visualization algorithms results

6 Conclusions

In this paper, we have developed and studied two machine learning models based on CNN-LSTM and 3D-CNN to predict a common defect in welding processes, the solidification crack. To the best of our knowledge, we are the first to analyse welding videos instead of static images of welded specimen. While the method in this paper is used for laser welding, it is surely also applicable to other welding industrial applications monitored by weld pool videos. The evaluation on high-quality images collected by a high-speed camera shows that both methods can learn temporal features from image sequences and detect the range of crack formation accurately. The model 3D-CNN is slightly better than CNN-LSTM. The results are very promising and encourage us to explore this field further. The accuracy can be improved by stacking more images as input, but this will also linearly increase the processing time. This problem can be solved by increasing the moving step size of the sliding window.

In real-time monitoring systems, time efficiency is very important. In the future, we plan to further optimize the model with respect to its speed while maintaining high accuracy. We will study some lightweight networks, such as depthwise separable convolution in MobileNet [36] and pointwise group convolution in ShuffleNet [37]. On the other hand, a new kind of welding experiments will be conducted to generate a challenging data set completely different from the one used in this paper. The new data set will contain cracks that are inside the material and invisible from the surface, so they can only be located through the strain fields. In order to solve the problem of high computational overhead in the strain calculation, we will explore predicting strain fields with a machine learning algorithm, and then detecting cracks based on the predicted strain value.