Introduction

Product inspection constitutes a vital phase in manufacturing to ensure the quality of products. Manufacturers endeavor to execute meticulous inspections, resulting in the inspection process frequently constituting the most substantial portion of production expenses (Chin & Harlow, 1982). Visual inspection, which scrutinizes functional and cosmetic imperfections, is known as a versatile, simple, cost-efficient, and contactless approach (Babic et al., 2021; Chin & Harlow, 1982). However, it primarily relies on human inspectors’ capabilities (Harris, 1969), and the performance of visual inspection is susceptible to human factors that may be influenced by elements such as the inspector’s expertise, the intricacy of the inspection task, product defect rates, and repeatability (Chin & Harlow, 1982; Harris, 1969; Megaw, 1979). In other words, human factors might cause unreliable and costly outcomes despite the merits of visual inspection.

It is particularly evident that the issues are connected to production loss in the field of coating inspection, where both precision and speed are important due to the heightened vulnerability of coated surfaces to flaws, defects, or discrepancies (Doering et al., 2004). Rapid and accurate inspection is paramount, alongside the need for mitigating reliance on human factors to avoid those risks. As one of the potential solutions, machine vision reduced human error through the automation of data acquisition, analysis, and image assessment (Alonso et al., 2019). Nonetheless, its application is often restricted to specific situations, with limited scope for adaptability (Psarommatis et al., 2019), necessitating extensive feature engineering and sophisticated algorithms to function effectively across diverse scenarios.

Deep learning has enhanced visual inspection by automatically abstracting features from data even in intricate and high-dimensional scenarios (LeCun et al., 2015, p. 2; Ren et al., 2022; Rusk, 2016). Deep learning can be classified into supervised and unsupervised approaches, each with inherent challenges. Supervised learning necessitates extensive manually annotated datasets for model training, which can be labor-intensive and impede the efficiency of deep learning applications (Shinde et al., 2022; Wang & Shang, 2014). On the other hand, unsupervised learning, while not requiring annotated data, relies on manually set thresholds, often resulting in unacceptable performance in product inspection tasks (Chow et al., 2020; Kozamernik & Bračun, 2020). Addressing the challenge of minimizing human intervention in data labeling is thus important to enhance the efficacy of deep learning-based inspection methodologies.

This study focuses on advancing the capabilities of automated visual inspection (AVI) by developing a novel framework that autonomously annotates and inspects products with high accuracy. Given the cost-effectiveness and high precision the manufacturing sector requires, this research targeted pinpointing coating interfaces on fuel injection nozzles, which demands pixel-level accuracy. The novelty of this research lies in the application of self-supervised learning and autonomous data annotation algorithms, eliminating the need for human labeling and precisely locating the coating interfaces. Thus, a robust AVI framework is capable of pinpointing coating interfaces from scratch via autonomous data annotation, which is expected to improve the efficiency and reliability of the inspection process in real-world practical applications.

Moreover, most deep learning applications do not provide an explanation of their autonomous decision-making processes, and it hinders understanding the process and exacerbates the reliability of models (Gunning et al., 2019). Explainable artificial intelligence (XAI) offers transparency in decision-making and aids the robustness and applicability of deep learning models (Cooper et al., 2023; Lee et al., 2022; Liu et al., 2022; Wang et al., 2019, 2023). Consequently, this study investigated the interpretability and explainability to validate the reliability of the deep learning models based on t-distributed stochastic neighbor embedding (t-SNE) and integrated gradient techniques.

The major contributions are outlined as follows:

  • The autonomous deep learning-based coating inspection framework is proposed. It identifies coated nozzles, generates self-labeled datasets, trains models, and pinpoints coating interfaces without human intervention.

  • The interpretation and explanation of the autoencoder (AE) and convolutional neural network (CNN)-based detection models are provided based on the concept of XAI to validate the reliability of the developed models.

This paper consists of the following. “Related work” Section describes an overview of the related literature in machine vision, computer vision, deep learning, and visual inspection applications. “Methodology” Section explains the proposed framework. Then, the detailed analysis and discussion follow in “Results and discussions” Section. Lastly, “Conclusion” Section describes the concluding remarks of the research.

Related work

Industry 4.0 catalyzes the application of data-driven approaches in quality management, which plays a critical role in productivity and reliability in the manufacturing domain (Psarommatis et al., 2022). The surge in customer demand for diverse products increases the complexity of products and production systems and necessitates flexible and versatile automated inspection solutions for effective quality control (Jacob et al., 2018; Psarommatis et al., 2019). The emergence of contemporary technologies such as deep learning, the internet of things, sensing, and computer vision initiates a paradigm shift in product inspection techniques (Oztemel & Gursev, 2020).

Katırcı et al. (2021) introduced a novel automated inspection technique based on electrical and thermal conductivities. Their approach, however, necessitates immersing an object in a solution or scattering powders over it, potentially causing damage and deformation in some cases. To prevent damage to products, nondestructive and contactless inspection solutions have been investigated. A laser displacement sensor is one of the nondestructive and contactless automated inspection methods and has a high resolution and sampling rate, which fits for precisely measuring a surface thickness in real-time without contact (Gryzagoridis, 2012; Nasimi & Moreu, 2021). However, it is not applicable to identifying diverse textures of fuel injection nozzle interfaces. Additionally, such approaches demand complicated and finely calibrated measurement systems compared to machine vision-based AVI using Charge-Coupled Device (CCD) or Complementary Metal–Oxide–Semiconductor (CMOS) cameras. They are implemented to offer an autonomous pipeline that captures and processes images and detects geometric or texture features for decision-making based on defined criteria in machine vision systems due to their versatility and cost-effectiveness (Golnabi & Asadpour, 2007; Noble, 1995; Ren et al., 2022).

As captured images often encounter issues like noise and uneven contrast and brightness, the Fourier transform (Brigham & Morrow, 1967) and the wavelet transform (Graps, 1995) aid to adjust images, eliminate unnecessary data, and emphasize useful information. The two-dimensional discrete Fourier transform (Gonzalez & Faisal, 2019) is widely utilized for image processing, and the wavelet transform is a useful tool to denoise and find a sparse representation in an image (Khatami et al., 2017). The Stationary Wavelet Transform (SWT) is known for offering an approximation to investigate singularities at the interface and maintain the consistency of image size (Nason & Silverman, 1995; Wang et al., 2010). Despite the benefits of denoising and highlighting features (Bai & Feng, 2007), Ren et al. (2022) stated that transform techniques require high computing costs. It also introduced detailed wavelet transform research including denoising (Jain & Tyagi, 2015; Luisier et al., 2007), image fusion (Daniel, 2018; Xu et al., 2016), and image enhancement (Jung et al., 2017; Yang et al., 2010) and categorized visual inspection approaches into classification, localization, and segmentation problems.

Traditionally, visual inspection methods such as the histogram technique, histogram of oriented gradients, scale-invariant feature transform, and speeded-up robust features have been used (Ren et al., 2022). However, there are drawbacks like computational load, high requirements for image quality, and irrelevance of spatial features. The support vector machine (SVM) is a prevalent machine learning approach and proves the advantage of classification problems. Nonetheless, the SVM is not suitable for complex multi-classification problems and requires manual adjustment for hyperparameters. The k-means clustering algorithm is an unsupervised machine learning approach and can be applied to multi-classification problems. For instance, Park et al. (2008) demonstrated an AVI framework implementing computer vision and the k-means clustering algorithm to find a defect in cigarette packing by segmenting regions in a package. The k-means clustering algorithm, however, is also improper to use for large and complex datasets (Guan et al., 2003).

Recent advances in deep learning circumvent the shortcomings of traditional approaches. Deep learning techniques automatically extract features from datasets and aid to develop improved AVI frameworks. CNN is the most popular supervised learning architecture in various fields including the realm of computer vision including self-driving, facial identification, robotic manufacturing, and remote space exploration, which yields substantial advancements (Gu et al., 2018; LeCun et al., 1998; Lee et al., 2022; Li et al., 2022; Park et al., 2023; Yun et al., 2023a, 2023b), and have been broadly employed in AVI applications (Alonso et al., 2019; Park et al., 2016; Singh & Desai, 2022; Wang et al., 2019; Yun et al., 2020).

In terms of the coating inspection, Ficzere et al. (2022) presented an AVI method using RGB images from a CCD/CMOS camera that could be readily applied in the industrial field and showed the feasibility of accurately inspecting tablet coating quality in real-time using machine vision and deep learning techniques. Despite its advantage, their study focused on detection and classification rather than pinpointing defects, requiring manually labeled datasets. As there are not enough references to pinpoint the dimension of a defect within a few pixels of error, this paper aimed to develop a framework that precisely locates the interface of a fuel injection nozzle while minimizing manual data labeling and feature engineering processes with the simple configuration of data acquisition. In short, the shortcoming of supervised learning architectures is the necessity of labeled data (Wang & Shang, 2014) implying that human intervention in labeling is indispensable for assembling training datasets.

AE is an unsupervised learning or semi-supervised learning model comprised of encoder and decoder layers bridged by a latent space (LS) that compresses data and extracts distinct features (Yun, 2023a, 2023b). It enables unsupervised feature learning and assesses data similarity based on a reconstruction error from the loss function and would be a useful tool to improve the performance of deep learning at the initial stage of developing another model (Bengio et al., 2014). Erhan et al. (2010) introduced the concept of unsupervised pre-training to help deep learning performance and Feng et al. (2020) proposed a self-taught learning technique using unlabeled data to enhance the detection performance of target samples. Both studies emphasized that unsupervised feature learning can effectively substitute manual data annotation.

In the visual inspection domain, Chow et al. (2020) implemented a convolutional AE model to detect defects in concrete structures based on an anomaly detection method comparing reconstruction errors of surface images. Their model was trained by defect-free images, which used a non-labeling process. Nevertheless, a threshold should be defined to distinguish defects, and additional analysis was required to get their exact location. Moreover, the recall and precision were from 30.1% to 91.3%, and it has a limitation to be implemented in real-world inspection applications in the manufacturing domain. Kozamernik and Bračun (2020) introduced a method to detect defects on the surface of an automotive part automatically. The model, however, did not provide the position of the defects in a few pixels of the error range and the specific value of accuracy was not indicated. Studies on pinpointing a particular surface within a few pixels of the error based on the difference in texture for a practical industrial application have still been limited.

In short, although deep learning resolves many issues in the visual inspection domain, CNN approaches, supervised learning, still require human effort to annotate data, and AE methods, unsupervised learning, also necessitate defining a threshold and do not attain enough accuracy for practical applications. This prompted the development of an advanced AVI framework in the present study, which combines the benefits of automated data labeling via AE’s unsupervised feature learning and the precise pinpointing capabilities of CNN’s supervised feature learning. Meanwhile, many deep learning models are like a black box and do not illustrate their decision-making processes clearly. It prevents understanding and aggravates the credibility of models. To address this, XAI is getting significant attention for its ability to enhance the reliability of deep learning models. The XAI approach is advantageous for visualizing data structure and interpreting prediction basis to provide intuitive insights into a model’s decision-making. Recent studies have leveraged the advantages of XAI to present cases where the explainability of AVI models was substantially improved. Al Hasan et al. (2023) introduced the interpretable and explainable AVI system to detect hardware Trojans and defects in semiconductor chips. Gunraj et al. (2023), developed the SolderNet, an explainable deep learning system aimed at improving the inspection of solder joints in electronics manufacturing, by providing a more transparent approach.

As the prominent method in XAI for data visualization, t-SNE (Maaten & Hinton, 2008) is regarded as an unsupervised algorithm that visualizes high-dimensional data by reducing its dimensionality to levels, typically two or three dimensions, that can be visually perceptible by humans. The t-SNE is particularly effective in preserving probabilistic similarities among samples when translating high-dimensional data into a lower-dimensional space, thereby effectively revealing complex data structures and patterns. In contrast, principal component analysis (PCA) (Wold et al., 1987), while a standard approach for data reduction and visualization, is limited by its linear transformation approach focused on maximizing data variance, thus struggling with non-linear data relationships. While uniform manifold approximation and projection (UMAP) (McInnes et al., 2020) is efficient and adaptable for large datasets, it occasionally merges distinct clusters, which potentially obscures the finer local structures that are more effectively preserved by t-SNE.

Additionally, integrated gradients, which represent the intensity of the network, are computed by accumulating the gradients along the path line (Lundstrom et al., 2022). The heatmap technique, a popular visualization tool in the computer vision domain, aids in identifying the most important parts of an input image through neural networks of a deep learning model for classification problems using calculated integrated gradients (Qi et al., 2020; Selvaraju et al., 2020). The deletion metric quantitatively measures the impact of removing these significant parts, which if correct, would hinder the accurate detection by a deep learning model (Petsiuk et al., 2018).

According to the surveyed literature, this study implemented AE and CNN architectures to conjugate their advantages and built a feasible AVI framework for coating inspection based on self-supervised and self-annotation learning with AE and CNN architectures to mitigate the required resources for establishing the models. The explanation of the developed models was investigated to ensure their credibility. The following section describes the detailed methodology.

Methodology

This study proposes an AVI framework pinpointing coating interfaces with the autonomous data annotation technique to mitigate the necessity of human intervention. As Fig. 1 illustrates, the proposed framework comprises six parts: (1) Classifying nozzle types, (2) Cropping uncoated surface images, (3) Training autoencoder model 1 (AE1), (4) Generating datasets automatically annotated as coated and uncoated surface classes, (5) Training a CNN-based detection model, and (6) Improving performance and validating the trustworthiness of models. First, the framework identifies two uncoated nozzles with the highest GLCM energy values to train autoencoder model 1 (AE1). AE1 abstracts features from uncoated nozzles and classifies mixed nozzle images into coated and uncoated classes. Subsequently, autoencoder model 2 (AE2) is trained with cropped surface images from the classified uncoated nozzles. Coated and uncoated partial surface images were autonomously extracted and annotated according to the region exhibiting the highest reconstruction error on each nozzle. Thirdly, an initial CNN-based detection model is constructed utilizing the automatically collected and labeled images. This model is refined through an iterative training strategy with transfer learning, using datasets generated by a previously trained CNN model to improve accuracy. Lastly, the YOLOv8 (Jocher et al., 2023) algorithm accelerates localization speed by narrowing down the size of a detection area.

Fig. 1
figure 1

Pipeline of the proposed framework

This framework addressed not only the manual data annotation issue but also the imbalance of the initial dataset. As an unsupervised learning approach, AE is capable of training a model with unclassified data. With just the two uncoated nozzle images identified by the GLCM method, AE1 was effectively trained, categorizing the imbalanced dataset into distinct coated and uncoated nozzle classes. On the other hand, AE2 was developed using exclusively uncoated surface images, thereby resolving the dependency for dataset balancing. For the construction of CNN models, cropped surface images were utilized. Having both coated and uncoated surfaces, it was possible to generate balanced datasets from the coated nozzle images. While the exact number of images in each class varied during the iterative learning process, approximately 120,000 coated and 100,000 uncoated surface images were collected.

In this section, the proposed framework is explained in detail. First, the image acquisition process is demonstrated. The second and third sections explain the texture analysis process and the two AE models that classified the types of nozzles and surfaces and generated an initial training dataset. Subsequently, how CNN-based detection models were built is illustrated. Lastly, the final deep learning model fabricated by integrating the YOLOv8 algorithm and the best CNN-based detection model is described.

Procedure of data collection

Figure 2 depicts the setup to collect raw fuel injection nozzle images. The image acquisition system was comprised of an industrial monochrome camera (Cognex In-Sight 9000) connected to a laptop of which the specification was AMD Ryzen 5 5500U, 8 GB RAM, 256 GB SSD, and Windows 11. Every side of a part was captured by placing a nozzle on a turntable rotating with a speed of 1.6 rpm. Moreover, direct lighting was applied by an LED desk lamp to provide constant brightness. Eventually, 15 images per nozzle were acquired at 4096 × 3000 pixel resolution, yielding a dataset of 75 images from 5 uncoated and 525 images from 35 coated nozzles. Since the collected images included unnecessary background, the region of interest (ROI) of each image was cropped to 512 × 1024 pixels based on detecting significant differences between adjacent pixels in both vertical and horizontal directions as boundaries by the developed computer vision algorithm. The image was cut further inward horizontally to exclude the curved interface region. The resultant images were mixed to begin this investigation from scratch. After the classification of the mixed image set into uncoated and coated nozzle categories by gray level co-occurrence matrix (GLCM) and AE methods, the set was further divided into training and test sets as per the requirements of the deep learning model. Table 1 details the configuration of the dataset.

Fig. 2
figure 2

Experimental setup for image acquisition and processing steps

Table 1 Configuration of dataset

GLCM & autoencoder model 1—classification between coated and uncoated parts

AE1 model was developed under the hypothesis that it could produce variable reconstruction errors for coated and uncoated parts from their distinct features, and the reconstruction errors would serve as classification metrics. However, training an AE model requires a pre-categorized set of images, which was not initially available in the early stage of this investigation. To address this issue, a GLCM method was employed for obtaining an initial categorized image set for AE training. Proposed by Haralick (1979), GLCM is a statistical texture analysis technique that quantifies a two-dimensional histogram of paired pixels within a specific spatial distance.

GLCM of an image presents the frequency of pairs of pixels with a specific intensity based on parameters (Gadkari, 2004) as shown in Fig. 3. \(x {\text{and}} y\) is the horizontal and vertical coordinates of a point in an image. The size of an image is represented by \(w\, \text{and} \,h\). An angle \(\theta \) is evaluated by the degree of rotation from the horizontal line passing through a reference point to a line penetrating another point and a reference point. Typically, angle \(\theta \) is quantized by 0, 45, 90, and 135 degrees. Furthermore, \(i\, \text{and} \,j\) are indices corresponding to the axis of a co-occurrence matrix and indicate pixel intensities. \(d\) is the distance between a reference point to another one. In consequence, the \(\left(i,j\right)th\) entry of GLCM \(g\) is defined as Eq. (1).

$$ \begin{aligned} &g_{{ij,\theta = 0^{ \circ } }} = \sum\limits_{x = 0}^{w} {\sum\limits_{y = 0}^{h} {\left\{ {\begin{array}{*{20}l} {1,} \hfill & {if I\left( {x,y} \right) = i \,and\, I\left( {x + d,y} \right) = j} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right.} } \hfill \\ &g_{{ij,\theta = 45^{ \circ } }} = \sum\limits_{x = 0}^{w} {\sum\limits_{y = 0}^{h} {\left\{ {\begin{array}{*{20}l} {1,} \hfill & {if I\left( {x,y} \right) = i \,and\, I\left( {x + d,y - d} \right) = j} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right.} } \hfill \\ &g_{{ij,\theta = 95^{ \circ } }} = \sum\limits_{x = 0}^{w} {\sum\limits_{y = 0}^{h} {\left\{ {\begin{array}{*{20}l} {1,} \hfill & {if I\left( {x,y} \right) = i \,and\, I\left( {x,y - d} \right) = j} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right.} } \hfill \\ &g_{{ij,\theta = 135^{ \circ } }} = \sum\limits_{x = 0}^{w} {\sum\limits_{y = 0}^{h} {\left\{ {\begin{array}{*{20}l} {1,} \hfill & {if I\left( {x,y} \right) = i \,and\, I\left( {x - d,y - d} \right) = j} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right.} } \hfill \\ \end{aligned} $$
(1)

With a GLCM parameter distance of 1 and angles at 0, 45, 90, and 135 degrees, GLCM energy values were calculated for each whole surface of coated and uncoated nozzle parts by Eq. (2). Two nozzle parts with the highest GLCM energy values were selected for training AE1 since the higher GLCM energy value demonstrated more consistent textures in uncoated parts.

Fig. 3
figure 3

Parameters of GLCM

$$Energy= \sum_{i}\sum_{j}{g}_{ij}^{2}$$
(2)

Subsequently, autoencoder model 1 (AE1) was developed to separate mixed images into coated and uncoated nozzle images thoroughly. AE1 comprised two CNN layers with a two-pixel stride and 3-pixel zero padding connected with fully connected layers and the latent space with 64 neurons as depicted in Fig. 4. This architecture was determined through a grid search. Among the gathered training sets, one uncoated part was used to train the model, and another one was utilized to get a maximum reconstruction error. The Adam optimizer was employed with a learning rate of 0.001, and the mean squared error (MSE) loss function was applied. The number of epochs for the training was 100. This training was done by PyTorch 1.13.1 with GPU acceleration (Paszke et al., 2019) operating on Ubuntu 20.04.5 LTS with an Intel i7-11700K CPU, 64GB RAM, and Nvidia RTX A5000 GPU. The same system was also utilized to develop the other deep learning models of this investigation.

Fig. 4
figure 4

AE Model 1 configuration

Evaluating the performance of AE1 was anchored on maximum reconstruction errors, which were automatically established at 1.529 during the prediction by AE1 on the validation set. Notably, the minimum reconstruction error recorded for the coated nozzle parts was 1.565, consistently exceeding the maximum reconstruction error of 0.812 for the uncoated nozzle parts. This led to a flawless 100% classification accuracy by AE1 in distinguishing between the coated and uncoated nozzle parts. Additionally, the findings were supported by Fig. 5, which presented a validation of the GLCM energy analysis. While GLCM alone was not fully capable of differentiating between coated and uncoated parts, it did reveal a pattern: uncoated parts predominantly registered higher GLCM energy values compared to their coated counterparts.

Fig. 5
figure 5

GLCM energy values of the nozzle images

Autoencoder model 2—generating training dataset without manual data labeling

Figure 6 illustrates GLCM analysis on cropped coated and uncoated surfaces at a size of 512 × 8 pixels. In this analysis, GLCM correlation values were computed by Eq. (3), where \(\mu \) and \(\sigma \) are the mean and standard deviation of \(g\). It was also used to quantify linear dependencies across each type of surface (Gadkari, 2004).

Fig. 6
figure 6

GLCM analysis on coated and uncoated surfaces

$${\text{Correlation}}= \frac{\sum_{i}\sum_{j}\left(ij\right){g}_{ij}-{\mu }_{x}{\mu }_{y}}{{\sigma }_{x}{\sigma }_{y}}$$
(3)

While uncoated surfaces generally yielded higher GLCM values, it was insufficient for precisely distinguishing between the two types of surfaces. Specifically, utilizing a threshold for GLCM correlation values, which was established through k-means clustering, resulted in a classification accuracy of less than 7% within a ± 6-pixel error range. Given these limitations, the study transitioned to implementing deep learning techniques for more precise identification of coating interfaces.

AE2 was designed to estimate interface locations and autonomously generate datasets for training CNN-based detection models, eliminating the need for manual labeling. The uncoated nozzle parts were already classified by AE1, enabling the immediate extraction of segmented images of uncoated surfaces. The input height of segmented images was defined as 8 pixels via a grid search, then cropped to dimensions of 512 × 8 pixels in the vertical direction, employing a 1-pixel stride on the uncoated nozzle parts. Similar to AE1, the Adam optimizer was deployed with a learning rate of 0.001, and the MSE loss function was applied to the training process. The number of epochs for the training was 100. Figure 7 outlines the configuration of AE2, which was determined through a grid search.

Fig. 7
figure 7

AE Model 2 configuration

In a manner analogous to AE-based anomaly detection (Yun et al., 2023a, 2023b), interface locations were estimated at regions displaying the highest reconstruction errors during prediction on a coated nozzle part. The mean average reconstruction error was 0.515 and the mean maximum reconstruction error was 1.972. In addition, the mean standard deviation of the reconstruction error was 0.226. AE2 located the interfaces with an accuracy of 84.38%. This study defined the success criteria as the capability to pinpoint the interface location in an image within a ± 6-pixel margin of error, identically corresponding to the standard error range of ± 0.127 mm (± 0.005 inches). The reference interface locations were determined by manual visual examination.

CNN-based detection model—improved interface region detection

To develop a CNN-based detection model, partial images of a size of 512 × 8 pixels were extracted from the coated nozzles with a 1-pixel stride. According to the estimated interface location by AE2, the upper side and the lower side of the interface were categorized autonomously as coated and uncoated image sets, respectively. Figure 8 depicts the configuration of a CNN-based detection model, determined through a grid search. The model deployed an Adam optimizer with a learning rate of 0.0001 and the cross-entropy loss function. The training process included the early stopping method. Specifically, the training was terminated when the training loss was below 0.0001.

Fig. 8
figure 8

Configuration of the reference CNN-based detection model configuration & detection outline

The initial CNN-based detection model achieved 90.00% accuracy within a ± 6-pixel error range. Further training, both with and without transfer learning, was conducted on a new dataset based on the estimated interface location by a prior CNN-based detection model. This process was repeated, resulting in 13 CNN-based detection models as described in Fig. 9. The layers were not frozen during the transfer learning processes. Overall, the models with transfer learning exhibited superior performance as shown in Fig. 10. The best model, CNN6_T, achieved 95.33% accuracy within the ± 6-pixel error range.

Fig. 9
figure 9

Pipeline of transfer and non-transfer learning

Fig. 10
figure 10

CNN-based Detection Model Interface Detection Accuracy & Speed Comparison (Higher is better)

CNN-based detection model + YOLO —improved detection speed

The CNN-based Detection Models, especially CNN6_T, demonstrated high accuracy, yet their detection speed was approximately 1.5 to 1.8 images per second as shown in Fig. 10. As the developed framework executed detections from the bottom of a part image with a 1-pixel stride, it led to numerous detections to ascertain the estimated interface location. This is a typical detection speed issue regarding object detection problems when using a stride for detection. To address speed constraints in object detection, CNN-based algorithms such as Region-based Convolutional Neural Networks (R-CNN) (Girshick et al., 2014), Fast R-CNN (Girshick, 2015), and Faster R-CNN (Ren et al., 2015) have been developed.

You Only Look Once (YOLO) (Redmon et al., 2016) is a one-stage real-time object detection algorithm that provides a balanced and acceptable combination of speed and accuracy for object detection (Terven & Cordova-Esparaza, 2023). A study (Kim et al., 2020) comparing Faster R-CNN, YOLO, and Single Shot MultiBox Detector (SSD) (Liu et al., 2016) demonstrated that while SSD was the fastest, its F1-score, Precision, Recall, and mAP was up to 11% lower. On the contrary, YOLO exhibited the best performance with a speed only 20% slower than SSD. Consequently, YOLO was chosen for this investigation requiring a balanced and acceptable combination of speed and accuracy.

Although employing YOLO, as the initial approach yielded 0% accuracy for pinpointing interfaces within a few pixels error range, it showed reasonable accuracy in identifying broader interface regions. To harness the advantage, this study devised a two-step prediction strategy, which integrated a YOLO model with a CNN-based prediction model, aimed to accelerate the prediction speed while maintaining high accuracy: Initially, YOLO suggests a probable interface area, which is then followed by a more focused CNN-based prediction within the narrowed-down region.

The YOLOv8n model was used to propose a region containing the interface by 0.25 of the default confidence threshold with the CNN6_T model predicting within the proposed region. The datasets for training the YOLO model were extracted by a height of 128 pixels according to the same interface location used for developing the CNN6_T model. The sample images of proposed interface regions by the trained YOLO model are depicted in Fig. 11. As Fig. 10 demonstrates, the detection speed with the region proposal from the trained YOLO model was approximately 4 times faster, reaching 7.18 images per second in comparison to other outcomes while maintaining the highest accuracy at 95.33%.

Fig. 11
figure 11

Sample results by YOLOv8 detections

Results and discussions

Annotating datasets is inevitable for supervised learning and results in human intervention. Hence, additional resources are consumed to build a model for an inspection application. The proposed framework overcame the issue by implementing multi-stage deep learning strategies. Without pre-defined datasets, manual labeling, and extensive feature engineering processes, it pinpointed the interfaces of the coated nozzles as well as categorized the types of the nozzles from a single uncoated nozzle. Consequently, this framework automated the inspection process thoroughly. In this section, the explanation and interpretation of the developed models are described based on t-distributed stochastic neighbor embedding (t-SNE) and integrated gradient methods to validate the reliability of the models. Moreover, the detection results of the CNN-based detection models are analyzed.

Figure 12 presents the latent space of AE2, the most compact dimensional region of partially cropped images corresponding to the three areas of uncoated, coated, and interface regions. It interprets the phenomena in the latent space and elucidates how AE2 discriminated between coated and uncoated surfaces and identifies the interface. Inputs for AE2 were partially cropped images (512 × 8 pixels) compressed in latent space features via the encoder. Subsequently, each latent space feature, represented by 16-dimensional values per AE2 architecture, was further reduced to a two-dimensional representation using t-SNE (Maaten & Hinton, 2008). The t-SNE calculates pairwise similarities between data points in the high-dimensional space, assigning higher probabilities to pairs of points that are close together and lower probabilities to those further apart. It achieves this by minimizing the Kullback–Leibler (KL) divergence between the high-dimensional and low-dimensional distributions of pairwise similarities, effectively making similar objects appear close together and dissimilar objects far apart in the lower-dimensional space; this optimization problem could be solved through gradient descent.

Fig. 12
figure 12

AE2 latent space visualization result for a sample coated image (cropped) based on t-SNE

This reduction enables each partially cropped image to be represented as a single point in a two-dimensional plane, visualizing the latent space of AE2 for 1017 partial images extracted from a single image. The resulting visualization displays clusters within the latent space of AE2 according to uncoated, coated, and interface regions. Remarkably, the interface area, despite its adjacency to the uncoated region in the original image, appears farther from the uncoated area in the latent space, where AE2 underwent training. This consistent pattern across multiple sample images, as depicted in Fig. 13, affirms the ability of AE2 to estimate interfaces.

Fig. 13
figure 13

AE2 latent space visualization results for sample coated images (cropped)

The initial CNN-based detection model, CNN1, trained on the AE2-generated dataset, yielded an accuracy of 90.00%. Accuracy increased to 95.33% via iterative learning with transfer learning and ultimately stabilized at 94.67% as shown in Fig. 10. Conversely, iteration without transfer learning precipitously reduced accuracy from 91.33% to 69.00% by the third cycle. Figure 14 demonstrates precision, recall, and accuracy in surface classification detections, surpassing all 98.00%, yet plummeting precision and accuracy without transfer learning. Precision, recall, and accuracy became stable from CNN4_T onwards in both training and testing datasets when applying transfer learning, suggesting their consistency contributes to interface detection accuracy convergence, as similar training datasets were utilized from CNN4_T. In contrast, without transfer learning, the variation in training datasets, not utilizing pre-trained neural network weights, led to inconsistent results. Especially, there are metric discrepancies between surface classification detection and interface location. Given the objective of this study, the detection accuracy of interface locations was deemed the most vital metric.

Fig. 14
figure 14

Detection precision, recall, and accuracy of surface classification

Figure 15 shows the mean error, defined as the discrepancy between estimated by the CNN-based detection models and actual interface locations on coated training parts. The blue line represents the mean error that predicted interface locations were above the actual interface, and vice versa for the red line with the black line demonstrating the overall mean error. Notably, with iterative learning without transfer learning from CNN1 to CNN2, the red line shows a decrease from 6.37 pixels to 8.90 pixels, while the blue and black lines remain consistent. Given that the height of the partially cropped images was 8 pixels, it is inferred that dataset labeling along the interfaces in the red line case for CNN2 was inaccurate since an overall error range exceeded 8 pixels. Consequently, numerous coated part images estimated by CNN2, CNN3, and CNN4 included uncoated surfaces, causing a continuous decrease in accuracy. These models, trained without transfer learning and hence without any pre-trained weights or bias, resulted in deteriorating accuracy during training and were built with high errors. On the contrary, the models utilizing transfer learning displayed convergence in mean error as they were built upon a pre-established network with pre-trained weights and bias, reducing vulnerability due to misclassified images.

Fig. 15
figure 15

Detection mean error from the actual interface location (for coated train part)

To interpret and validate the reliability of the optimal CNN-based detection model, CNN_6T, integrated gradients were used to generate heatmaps, and the deletion metric was applied (Qi et al., 2020; Selvaraju et al., 2020). In the jet colormap, red signifies key pixels for uncoated class detection, and blue represents those crucial for coated areas. Figure 16 depicts the heatmap examples of correctly predicted images and Fig. 17 illustrates heatmaps for correctly predicted images, including those with complex interface geometry. The black and white lines indicate the actual and estimated interface locations, respectively. The highlighted heatmaps confirm the clear separation between two classes along the interface, even in instances of complex geometry with diverse shapes as in Fig. 17a, b, or blurred boundaries as in Fig. 17c that make interface identification challenging. Furthermore, the heatmap examples of incorrectly predicted images are displayed in Fig. 18. Figure 18a suggests excessive light reflection obscuring the interface feature, Fig. 18b demonstrates the failure of the detection model to extract features near the interface, and Fig. 18c shows wrong detection due to the existence of multiple candidates of potential interface location. Therefore, future research involves developing a multiple-layered and configured deep learning model to resolve these issues and enhance accuracy.

Fig. 16
figure 16

Heatmap examples of correctly predicted images

Fig. 17
figure 17

Heatmap examples of correctly predicted images having complicated geometry of interface

Fig. 18
figure 18

Heatmap examples of incorrectly predicted images

The deletion metric was implemented to examine the effects of the weight of each pixel in the CNN6_T detection model. As in Fig. 19, pixels with high weights were firstly deleted as per a specified deletion ratio, with zero-weighted pixels being disregarded during this operation. Figure 20 illustrates that the detection accuracy declined steeply upon the removal of the important pixels. This observation validates the proper construction of the CNN6_T detection model and the weighing of the pixels during interface detection. Furthermore, the area under the curve (AUC) was assessed, yielding a score of 0.1135, confirming the reliability of the developed predictive model according to the reference (Petsiuk et al., 2018).

Fig. 19
figure 19

Pixel deleted sample images

Fig. 20
figure 20

Result of deletion metric

Lastly, a YOLO model was employed to improve the detection speed for industrial applications. The developed CNN-based detection model predicted from the bottom of an image with a 1-pixel stride, and it resulted in a time-consuming detection speed of 1.5 to 1.8 images per second. This study applied YOLO alone at first, showing 0% accuracy for the detection within the ± 6-pixel error range. Therefore, a two-stage detection pipeline was devised proposing the broad range of an interface region by a YOLO model and predicting from the bottom of the proposed region by the CNN model. Even though YOLO could not predict accurately the interface within the error range, it successfully proposed an interface region with a wide range of an area which led to up to 4 times faster detection speed, 7.18 images per second, without a reduction in accuracy.

Conclusion

This study proposed the framework for distinguishing the types of fuel injection nozzles and locating interfaces between coated and uncoated surfaces with autonomous data annotation. While computer vision techniques reveal the progress in the visual inspection domain, it suggests that an elaborate algorithm would be necessary to detect complex objects. The features of the fuel injection nozzles varied across different parts and even the viewing angles. As a result, this research implemented multi-stage deep learning strategies to develop an automated visual inspection method. By conjugating autoencoder and CNN configurations, it addressed challenges in the application of coating inspection to classify the coated and uncoated nozzles and pinpoint the interface locations. Furthermore, the application of the YOLOv8 improved the detection speed of the CNN-based detection model. Finally, the interpretation and explanation of the deep learning models were described and validated their robustness.

Nonetheless, the framework still has limitations. Although the GLCM approach identified coated and uncoated nozzles without pre-defined datasets, it required at least two parts to be distinguished for developing an AE model. Additionally, this approach is not applicable to parts with more than two surfaces. The detection speed still requires optimization to exceed 30 images per second for real-time applications. Lastly, this framework is not validated for images from different sources and environments. Therefore, future research will focus on the adaptability of the framework concerning other coatings, materials, various inspection conditions, and intricate geometries, along with investigating further enhancements in detection speed and accuracy. Furthermore, a more affordable image acquisition setup will be investigated.

Then, this framework is expected to be implemented in a wide range of industries such as aerospace, automotive, and electronics manufacturing, where the feature of surfaces plays an important role in enhancing the performance, durability, and protection of components. Eventually, it is anticipated that this investigation would be a practical solution for real-time applications in industries by the achieved reasonable detection accuracy and speed even with less effort in training deep learning models. In addition, the ability to generate training datasets with autonomous data annotation could lead to cost savings, increased efficiency, and reduced potential for human error. Consequently, it is expected that this framework will aid cost and reliability issues in the inspection process of manufacturing domains.