Introduction

Recently, with the development of electronics and optoelectronics and power semiconductor devices, the demand for ultra-high-quality silicon wafers has been continuously rising. In particular, insulated gate bipolar transistors (IGBTs) have received significant attention for their promising potential use in electric vehicles, due to their relatively high-speed switching characteristics and excellent compatibility with silicon-based systems (Kajiwara et al., 2019). The silicon wafers used in IGBTs are required to be free from as-grown defects, oxide precipitates, and thermal donors (Hourai et al., 2019). As the first step in wafer fabrication, crystal growth is naturally required to meet higher requirement on quality, cost and productivity. There are mainly two techniques for producing monocrystalline ingots from polysilicon, including Czochralski (CZ) crystal growth and Float-Zone (FZ) crystal growth. In the CZ process, a single crystal ingot is grown by pulling an oriented single crystal seed from the polysilicon melt kept in a quartz crucible, while in the FZ process, the single crystal is grown by shifting the polysilicon downward to the induction heater to form the molten zone where a seed crystal grows. Although the CZ process currently holds the largest share due to its relatively low cost (Prakash et al., 2019), CZ silicon has an unavoidable drawback: high incorporation of oxygen from the melt into a silica crucible, resulting in the generation of oxide precipitates and thermal donors (Kajiwara et al., 2019). Compared to the CZ process, the crucible-free FZ process allows to produce a higher purity silicon crystal with much lower concentrations of impurities, in particular, low oxygen concentration (below \(10^{16}\) \( cm^{-3}\) (Mullins et al., 2018)). Therefore, FZ single-crystal silicon is typically used in high-efficiency solar cells and high-power devices where the purity requirement is significant.

However, an oxide layer, as contamination for the FZ process, may occasionally form on unmelted polysilicon surface, leaving an appearance of a brighter region (darker region if it is the case of ghost curtain) from the perspective of the FZ vision system, as shown in Fig. 1. The oxidation may occur at the very beginning of the FZ process, however it only becomes more evident when a portion of the oxide melts away as it approaches the inductive heater, thus exposing the substrate polysilicon underneath the oxide with a darker appearance in the image. On the one hand, the oxide acting as a contamination may threaten the product quality. Considering that the FZ process is a batch process lasting for 10–20 h, the formation of the oxide poses a risk of wasting production resources including time, energy, etc. On the other hand, the formation of oxidation indicates the degradation of the capability of the FZ machine to keep the FZ chamber sealed. Therefore, it is necessary to optimize both the FZ process and the FZ machine by investigating oxidation in the following steps: identification of the oxide, root cause analysis of the oxide, and automatic responses to the oxide. Thus, this study is the first step of oxidation investigation, aiming at providing automatic identification and classification of the oxides. At present, there is only one research related to the monitoring of the FZ process with respect to the surface characteristics of polysilicon (Chen et al., 2022), but it aims only at identifying the anomaly on the surface of the polysilicon. Therefore, this paper is the first study to undertake a comprehensive investigation of FZ process monitoring by identifying and classifying oxide on the polysilicon surface.

Fig. 1
figure 1

The polysilicon images in different FZ crystal growth productions. Note that in the upper left corner, the confidential information regarding to process parameters and machine details is concealed using a black rectangle: a polysilicon surface with homogeneous intensity under normal condition b polysilicon surface with high-contrast appearance under oxidation condition. The brighter region is the oxide while the darker region is the polysilicon underneath the oxide after it melts away as it approaches the inductive heater

The oxide can present different characteristics, including spot, shadow as well as ghost curtain. In addition, the oxide may occur in combinations of these characteristics. The reason for the occurrence of different oxides remains currently unknown. Currently, human identification of oxide is still the norm, which is time-consuming and inefficient. In this study, we took advantage of FZ images captured from an FZ vision system built for controlling the FZ process and attempted to develop an automated oxide classification solution. Considering the possibility that different oxides may form simultaneously, the classification task becomes a multi-label oxide classification problem.

Multi-label classification (MLC) is an extension of single-label classification where only one class is associated to one image (Zhang and Zhou, 2014). Recently, MLC has gained increasing importance since it is realized that using single label is generally not suitable for real-world applications (Cevikalp et al., 2020). The data in MLC are normally confronted with several challenges, such as a data imbalance problem and label dependency (Braytee et al., 2019). Existing methods for multi-label classification can be grouped into two prevailing categories: problem transformation methods and algorithm adaption methods (Zhang and Zhou, 2014). Problem transformation methods aim at fitting data to algorithm by transforming a multi-label problem to a single-label problem or regression problem, such as Binary Relevance (Boutell et al., 2004) and Label Powerset (Tsoumakas et al., 2011). Algorithm adaption methods instead fit the algorithm to the data by adapting learning techniques to deal with multilabel data, for instance, multilabel K-nearest neighbor (ML-KNN) (Zhang and Zhou, 2007). Conventionally, image classification starts with feature extraction, which involves finding an optimal image descriptor to transform an array-like image into a vector, thus enabling subsequent analysis by a classifier. Recently, Convolutional Neural Networks (CNNs) have gained significant attention in image classification due to their powerful capability for feature extraction and pattern learning. CNNs is a type of artificial neural networks, consisting of multiple convolutional layers and pooling layers for downsampling, which empowers CNNs to autmatically capture patterns from grid-like data. However, due to the large number of model parameters to be optimized, CNNs require a large amount of training samples, which poses difficulties in the case of multi-label dataset because of the high burden of annotations (Wei et al., 2016) as well as its intrinsic long-tailed distribution (Guo and Wang, 2021). Transfer learning has emerged as a solution to the issue of data-demanding models. It was observed that the features that many deep neural networks learn from large-scale data appear to be general, so they are applicable to similar or even dissimilar tasks (Yosinski et al., 2014; Neyshabur et al., 2020). Therefore, the main concept of transfer learning is to repurpose the learned features in large-scale data and transfer them to a target dataset by copying first n layers from a pre-trained network to the first n layers of the target network (Yosinski et al., 2014). Recent research demonstrates that initializing with transferred features can improve generalization performance in a new task (Yosinski et al., 2014). Therefore, on top of the pre-trained CNNs, current research works regarding multi-label classification can be categorized into three directions: (1) object localization assisted MLC, (2) label correlation optimization based MLC, (3) loss optimization based MLC.

Object localization-assisted MLC utilizes object localization information to help improve multi-label classification performance (Ge et al., 2018). However, this method acquires not only category information but also localization information (Li et al., 2022; Zhang et al., 2021) or segmentation information (Sampath et al., 2023), thus further increasing the annotation burden. Additionally, this method usually involves two steps, including localization and classification, which suffers from high computational cost. Label correlation learning-based MLC aims to model label correlations for introducing additional gains in MLC by Graph Convolutional Networks (GCNs) (Chen et al., 2019) or Recurrent Neural Networks (RNNs) (Wang et al., 2016). However, GCNs normally need manually defined adjacency matrices (Zhu and Wu, 2021), while the effectiveness of RNNs in multi-label task is to be proved (Zhu and Wu, 2021). Loss optimization-based MLC attempts to design a loss function for adapting multi-label task and tackling the data imbalanced problem in multi-label dataset (Ridnik et al., 2021). This method is a relatively simple solution without the need for a carefully designed architecture and additional external information (Ridnik et al., 2021). Considering the fact that we have only a few classes in our task, we turned to loss optimization-based MLC.

In spite of superior performance achieved by CNNs, they are still considered a black box due to the lack of decomposability (Selvaraju et al., 2020). Consequently, they cannot explain why they are successful or why they fail in some specific cases. This issue poses a trust problem in CNN-based intelligent systems, especially in industry domains where reliability with low error tolerance is required (Mohamed et al., 2022). Therefore, in order to adopt CNNs-based models in industry, it is necessary to enhance the transparency of the CNNs-based models and to explain their decisions about predictions. Several studies (Selvaraju et al., 2020; Simonyan et al., 2013; Zeiler and Fergus, 2014; Zhou et al., 2016) have proposed to visualize CNNs’ predictions by highlighting the pixels that contribute the most to predictions, thus fostering a better understanding of the model’s decisions. These approaches can be mainly categorized into perturbation-based methods and gradient-based methods (Ivanovs et al., 2021). Perturbation-based methods involve making small changes to the input data and measuring how these changes affect the model’s output. This technique was applied in CNNs by Zeiler and Fergus (2014) who proposed Deconvnet to project the feature activations back into the input space by reconstructing the input from its output. SHAP (SHapley Additive exPlanations) (Castro et al., 2009) is also a perturbation-based method where an additive feature importance score is computed for each prediction considering all possible feature combinations. However, these methods tend to be computationally intensive as the number of features grows (Ancona et al., 2018) due to the need to predict on perturbed data. Compared to the perturbation-based method, gradient-based methods are computationally efficient as they only require backpropagation to calculate gradients (Ivanovs et al., 2021). Simonyan et al. (2013) visualize CNNs by extracting saliency maps using gradients of the output over the input. However, this method is not class-discriminative (Selvaraju et al., 2020). Zhou et al. (2016) addresses this limitation by using global average pooling to produce a class Activation Mapping (CAM). It improves the accuracy of heat maps, however it requires the neural network architecture to have a specific structure, typically consisting of a series of convolutional and pooling layers followed by a global average pooling layer and a final fully connected layer for classification (Zhou et al., 2016). Grad-CAM Selvaraju et al. (2020) is an extension of CAM that addresses the limitation of architecture compatibility, by using gradient information directly to calculate importance scores. Therefore, Grad-CAM is employed to visualize the decisions of CNNs in this study.

Therefore, the main contributions of the present work can be summarized as follows.

  1. 1.

    Novel characterization of surface texture of polysilicon in the FZ process. The information gained can be used for FZ process monitoring and lay the foundation of root cause analysis in the future.

  2. 2.

    Carrying out multi-label oxide classification to address the unique challenge, the oxide problem emerged in the FZ process. To address the problems of limited dataset and data imbalance, transfer learning and asymmetric loss were leveraged in this study. Besides, we deeply investigate the effectiveness of asymmetric loss compared with conventional binary cross entropy loss by controlling the levels of components in asymmetric loss.

  3. 3.

    Grad-CAM Selvaraju et al. (2020) is applied to explain the decisions made by the models, thus enhancing trust for the subsequent integration in industry. Besides, the results implied the application in localizing the region of oxide.

The remainder of the paper is organized as follows. Section 2 introduces the FZ process and characterization of the surface texture of polysilicon in the FZ process. Section 3 illustrates the proposed methodology. Sections 4 and 5 demonstrate experimental setup and experimental results and discussion, respectively. The paper closes with a summary and conclusion in Sect. 6.

Oxide in float-zone silicon crystal growth process

Float-zone silicon crystal growth process

FZ process is a crucible-free crystal growth method to convert polysilicon into monocrystalline silicon with high resistivity. As shown in Fig. 2, inside the FZ chamber, the polysilicon and the seed crystal are grasped by the polysilicon holder and the crystal holder, respectively. In the FZ process, the inductor heats the bottom of the polysilicon into the melt, where the seed crystal begins crystal growth. There is an optical access to the FZ process given by a quartz window, through which the FZ vision system is able to acquire FZ images and carry out image analysis, for instance, geometrical measurements for controlling the FZ process.

Fig. 2
figure 2

The schematic of the FZ process.Inside the FZ chamber, the polysilicon and seed crystal are grasped by the polysilicon holder and the crystal holder, respectively. During the FZ process, the polysilicon and seed crystal are moved downwards while they are rotated contrarily to stabilize the melt flow. Meanwhile, a vision system monitors the FZ process through a quartz window

The entire FZ process mainly consists of six phases: 1) melt drop phase, 2) feed tip phase, 3) neck phase, 4) cone phase, 5) cylinder phase and 6) closing phase (Werner, 2014). It is more likely to observe oxide from the beginning of the cone phase.

Oxide characterization

During the crystal growth process, the polysilicon is supposed to be heated homogenously, and its surface shown in the vision system is supposed to be homogeneous as well. However, during the process, anomalies might occur on the polysilicon surface, which is reflected in the FZ images captured from the vision system: an appearance of a relatively high-contrast region on the polysilicon surface.

In order to understand the inhomogeneity of the polysilicon surface, some cooled polysilicon with a high contrast region were analyzed with Energy Dispersive X-Ray (EDX) for elemental analysis. It turned out that the oxygen content in the brighter region of the FZ image was much higher than that in the reference sample and that in the darker region. Hence, it was deduced that the particularly bright region in the FZ image is an oxide layer. The dark region near the melt edge shown in Fig. 1 (b) with a lower oxygen content is possibly the substrate polysilicon. This could be explained by the dissolution proceeding at a relatively higher temperature near the inductive heater via \(\hbox {SiO}_2 + \hbox {Si} \rightarrow 2\hbox {SiO}\), where SiO would easily evaporate at temperatures above \(1100^{\circ }\hbox {C}\) (Ammon, 2004). The oxide is likely to be generated from water vapor within the FZ chamber as a result of residuals remaining or the air due to chamber leakage.

Furthermore, the oxides can present in different characteristics. Based on the experimental observation of the FZ process, the oxide can be mainly categorized into three classes, including spot, shadow, and ghost curtain as seen in Fig. 3.

Spot is a point-like melted oxide on the polysilicon surface. Spot grows into more dots as it moves closer to the inductive heater. Shadow is most found in the FZ process. Shadow appears as a regional melted oxide with a relatively large area and a smooth edge compared to spot, and it usually comes with a relatively low contrast. This implies that compared to the spot, the oxide in the shadow could be thinner than the oxide in the spot. In the case of ghost curtain, the oxide is separated but not entirely detached from the substrate polysilicon. Normally the ghost curtain appears darker than other oxides.

Among all oxide types, ghost curtain is the most serious case since the peeling layer of the ghost curtain is likely to drop off, which has the potential of causing machine breakdown or product contamination. Furthermore, since ghost curtain is an extra layer that can distort the shape of the polysilicon, it may introduce noise in geometrical measurements, such as polysilicon diameter, which is an important component for modeling the behavior of the FZ process (Muiznieks et al., 2015). This can further impact the control performance of the FZ process on the final product. In terms of spot and shadow anomalies, they would sometimes finally disappear and become homogeneous surface again as they move downward to be heated. Despite the possible influence of these two classes on the growth of the crystal is unknown, the presence of a high oxygen content may disturb the formation of a single crystal in the FZ process (Richter et al., 2014).

Fig. 3
figure 3

Oxide types: a spot, point-like melted oxide b shadow, regional melted oxide c ghost curtain,the oxide that are separated but not entirely detached from the substrate polysilicon

Additionally, the frequent presence of the oxides indicates that the ambient inside the FZ chamber is contaminated due to the introduction of oxygen. This implies that the performance of FZ machines is, in fact, declining and that machine maintenance is needed. Therefore, it is necessary to optimize both FZ process and FZ machine by investigating around oxidation in the following steps: identification of the oxide, root cause analysis of the oxide, and automatic responses to the oxide. Therefore, this study is the first step of oxidation investigation, aiming at providing automatic identification and classification of oxides, thus enabling subsequent investigation. Since the oxides typically occur in the cone phase, the FZ images captured in the beginning of the cone phase were selected.

Oxide classification

Problem setup and dataset description

The purpose of this study is to predict a set of oxide types given a FZ image \( \textit{X} \in {\mathbb {R}}^{m\times n}\). The prediction for \(\textit{X}\) is a set of k binary labels \(\{y_1,y_2,..,y_k\}, y_i \in {0,1}\). The class i is relevant with X when \(y_i=1\). Typically, the normal case is denoted as a vector filled with zeros, which means that none of the oxides is activated. Therefore, a classifier f is required to map an unlabeled image to a set of labels: \(\hat{{{\textbf {y}}}} = f(\textit{X})\).

In this paper, a total of 2143 images were obtained from 6 different FZ machines during a two-year period from 2019 to 2021. These different machines produce various types of products, for instance, products with different sizes or distinct crystal structures. Nevertheless, there is little difference on polysilicon appearance under normal or oxidation conditions among different FZ machines as observed by the FZ vision system. Therefore, machine variability is not considered in this study. Each image represents one production run, and it was gathered at the beginning of the cone phase, where the crystal diameter is around 21 mm. This time point was chosen because the oxides evidently appear from this time point. The early identification of oxides can allow us to take corrective reactions. On the other hand, later detection will have a risk of missing spot and shadow, which may finally disappear as they move downwards. The ultimate purpose of this study is to find out the reasons for emerging oxides. Therefore, it is desired to collect as much data with the presence of oxides before they disappear as possible. Finally, all images were annotated with oxide types according to their characteristics.

Fig. 4
figure 4

Co-occurrence matrix of the oxide dataset. Data imbalance is observed in the dataset in two aspects: within a class and between classes

Figure 4 demonstrates the co-occurrence matrix of oxide types. Here, we denote the instances belonging to a specific class c as positive samples of c class and the instances do not exhibit the specific class as negative samples of the c class. Besides, in the context of a classification problem, ‘easy’ and ‘hard’ are usually used to describe the difficulty level for a model to correctly predict a sample. Normally hard samples can include images that are of low quality or are from a rare class, adding a challenge to the decision-making process. In contrast, easy samples are instances with distinct features or from a dominant class that make the model confidently to predict. Note that in Fig. 4, the oxide-free process did not dominate the entire dataset, which fosters the motivation for this study. In Fig. 4, it is clear to observe data imbalance from two aspects within a class and between classes. Data imbalance within a class means the imbalance between positive and negative samples for a specific class. The data imbalance between classes means the imbalance between a dominant class and a rare class. Data imbalance would pose difficulties for model training for instance the rare class will be ‘neglected’ since the total loss is almost dominated by easy samples, in particular easy negative samples. However, as mentioned before, ghost curtain as a rare class can not be neglected as it significantly affects both the FZ process and FZ machine. Although the probability of the occurrence of ghost curtain is quite low compared with other oxides, the economic loss caused cannot be ignored. Therefore, the data imbalance problem should be addressed in this study.

Regarding the data imbalance problem, resampling methods are commonly used in single-label classification by over-sampling minority class or under-sampling majority class (Tarekegn et al., 2021). However, resampling may not work in multi-label classification due to label correlation. Another solution is to adopt a designed loss function to mitigate the problem, such as focal loss Lin et al. (2020), or asymmetric loss Ridnik et al. (2021). This method aims to address the imbalance by adjusting the loss contribution between hard samples and easy samples.

Proposed method

Considering the limitation of small dataset and data imbalance in our case, we proposed multi-label oxide classification taking advantage of transfer learning and promising performance of asymmetric loss as shown in Fig. 5, which will be discussed in detail in the following subsections.

Fig. 5
figure 5

Overall architecture of multi-label oxide classification. The features learned on large-scale dataset are transferred to our network by transfer learning, while asymmetric loss is employed to address the data imbalance problem. To build trust in the proposed model for its application in the industry, Grad-CAM is used to enhance the transparency of model decision process

Transfer learning

Neural Networks are a type of ML technique that is modeled after the human brain and its structure (Shrestha and Mahmood, 2019). A simplest neural network is composed of an input layer, a hidden layer with many interconnected neurons and an output layer (Chen et al., 2023). While a single hidden layer was proven to be capable of solving any continuous problem (Hornik, 1991), its representation ability was not sufficient as the deep neural networks (DNNs) that we are familiar today, however, there was no efficient way to train a DNN at that time (Shrestha and Mahmood, 2019). It is the advent of back-propagation learning (Rumelhart et al., 1986) that allows the shift from shallow to Deep Learning (DL) with multiple hidden layers (Shrestha and Mahmood, 2019). This has enabled DL to learn hierarchical data representations, eliminating the need for manual feature engineering (LeCun et al., 2015) and resulting in the prevalence of DL. The rapid development of DL has led to the emergence of various DL architectures, such as Multi-Layer Perception (MLP), CNNs, and RNNs (LeCun et al., 2015). However, with the growing complexity of neural networks, the large number of model parameters need to be optimized, which makes the performance of deep networks highly rely on the scale and quality of the training dataset. Therefore, one may need to gather a substantial amount of data for each specific task and train the model from the beginning, which is inefficient and expensive in terms of resources. Transfer learning has emerged as a solution to the issue of data-demanding models. It was observed that the features that many deep neural networks learn on large-scale dataset appear to be general, thus they are applicable to similar or even dissimilar tasks (Neyshabur et al., 2020; Yosinski et al., 2014). Therefore, the main concept of transfer learning is to repurpose the learned features from large-scale data sets as well as low-level statistics and thus transfer them to a target dataset by copying the first n layers from a pre-trained network to the first n layers of the target network (Neyshabur et al., 2020; Yosinski et al., 2014). One can choose to freeze these layers by keeping their weights unchanged or fine-tune these layers by updating their weights with backpropagation during training. Transfer learning on the one hand, can speed up the training process for another specific task with smaller amounts of data. On the other hand, it can help avoid overfitting, as the initial layers have already learned to capture general features. In common practice, transfer learning begins with selecting a base model that has been pre-trained on a large and diverse dataset, and then modifies the model to give us good predictions on the target dataset.

Currently, many state-of-the-art CNN models enable superior performance in the application of computer vision. In this case, considering the potential application of real-time detection, we would like a model with high performance and low computational cost. Therefore, according to the comparison of popular CNN models (Canziani et al., 2016), ResNet50 and InceptionV3 were chosen as base models. ResNet50 is from the family of residual neural networks (ResNet) (He et al., 2016), with a shortcut connection that allows training of a deeper neural network with faster convergence. InceptionV3 is an extensive GoogleNet and is characterized by smaller convolution filters and the addition of an auxiliary classifier (Szegedy et al., 2016).

In this study, we took advantage of the parameters learned from natural images of the ImageNet database. Although the images in the oxide dataset are visually dissimilar to the source domain, it is shown in (Neyshabur et al., 2020) that we can still enjoy the benefits of pre-trained weights by fine-tuning due to the reuse of features and low-level statistics. Therefore, fine-tune is employed for all the models that are trained from pretrained weights. Originally, each base model trained on ImageNet has 1000 output units in its output layer to classify images into 1000 classes. In order to make the model adapt to our task, the original output layer of the base model is replaced by the sequential layers shown in Table 1. To prevent the model from relying heavily on the source domain, the dropout layer was used for regularization, thus helping avoid over-fitting. Besides, in this study we employed adaptive learning rate strategy during fine-tuning to achieve better convergence and prevent negative transfer. The one-cycle learning rate policy (Smith and Topin, 2019) is adopted in the experiments, which is a learning rate schedule that rapidly increases the learning rate to a maximum for quickly adapting to the target domain and then gradually decreases to stabilize learning. Furthermore, to mitigate the overfitting problem, we employed image augmentation to increase the amount and diversity of the training data by adding random changes over different batches.

Table 1 Sequential layers used to replace the output layer of pre-train models. Backbone features mean the intermediate representation before the output layer extracted by the base model
Fig. 6
figure 6

The principle of Grad-CAM with taking ghost curtain class as an example. The gradients of the ghost curtain class with respect to the feature maps of the last convolutional layer are used to generate a heatmap that highlights the important regions of the image

Loss function

Binary cross entropy (BCE) loss is frequently used in multi-label classification. Each logit of the output vector \(p_i\) from the network represents the probability of the corresponding class i. Then the total loss is obtained by aggregating a BCE loss from all labels:

$$\begin{aligned} L_{bce} = - {\sum \limits _{i = 1}^{M}{y_{i}{\log \left( p_{i} \right) } -}}{\sum \limits _{i = 1}^{M}{\left( {1 - y_{i}} \right) {\log \left( {1 - p_{i}} \right) }}} \end{aligned}$$
(1)

where \(y_i\) is the ground-truth of class i, M denotes the number of classes.

To prevent easy negative samples from overwhelming the loss contribution, Focal loss (FL) Lin et al. (2020) is designed to decrease the loss contribution from easy samples and applied in object detection. Particularly, in focal loss, a focusing parameter \(\gamma \) is introduced to decay the error contributed from easy samples.

$$\begin{aligned} L_{fl}= & {} - ~{\sum \limits _{i = 1}^{M}{y_{i}{{\left( 1 - p_{i} \right) ^{\gamma }\log }\left( p_{i} \right) } }}\nonumber \\{} & {} -{\sum \limits _{i = 1}^{M}{\left( 1 - y_{i} \right) p_{i}^{\gamma }{\log \left( {1 - p_{i}} \right) }}} \end{aligned}$$
(2)

where \(\gamma \) denotes the focusing parameter, which aims to decrease the loss from easy sample and let the network focus on the misclassified hard samples.

However, since the same focusing parameter \(\gamma \) is imposed on positive and negative samples, the loss contribution from positive samples is decreased as well. Considering the imbalanced problem where positive samples from rare classes only represent a small portion of the dataset, this may eliminate the gradients from rare positive samples (Ridnik et al., 2021). Therefore, asymmetric loss (ASL) Ridnik et al. (2021) is designed to overcome this problem by decoupling the focusing parameters of positive and negative samples.

$$\begin{aligned} L_{asl}&= - {\sum \limits _{i = 1}^{M}{y_{i}{{\left( 1 - p_{i} \right) ^{\gamma _{+}}\log }\left( p_{i} \right) } }}\nonumber \\&\quad -{\sum \limits _{i = 1}^{M}{\left( 1 - y_{i} \right) p_{i - m}^{\gamma _{-}}{\log \left( {1 - p_{i - m}} \right) }}} \end{aligned}$$
(3)
$$\begin{aligned} p_{i - m}&= max\left( p_{i} - m,0 \right) \end{aligned}$$
(4)

where \(\gamma _+\) and \(\gamma _-\) denote the focusing parameters for positive and negative samples, respectively. m is probability margin used to perform hard thresholding of very easy negative samples. Therefore, in this case, we chose asymmetric loss as a loss function to update the model.

Visualization by Grad-CAM

Although CNNs can provide promising performance, the lack of transparency makes it difficult to interpret the results. This issue poses a trust problem in CNN-based intelligent systems, especially in industry domains where reliability with low error tolerance is required (Mohamed et al., 2022). Therefore, in order to adopt CNNs-based models in industry, it is necessary to enhance the transparency of the models and explain their decisions about predictions. Class Activation Map (CAM) Zhou et al. (2016) is introduced to reveal the inferential process by identifying discriminative regions with a global average pool. However, due to the requirement of feature maps directly preceding softmax layers, CAM is only applicable to CNNs without any fully connected layer (Selvaraju et al., 2020). Therefore, to overcome this problem, Gradient-weight Class Activation Map (Grad-CAM) Selvaraju et al. (2020) is designed to be a generalization of CAM to generate visual explanation for any CNN-based models by using the gradient information. In this paper, Grad-CAM was utilized to provide an inferential explanation.

Fig. 7
figure 7

An example image before and after processing. The original image is simplified here since it involves the confidential information

Fig. 6 shows the principle of Grad-CAM with taking ghost curtain class as an example. The gradient for the class ghost curtain, \(\frac{\partial y_{ghost}}{\partial A^{i}}\) for \(i^{th}\) feature map is first calculated with respect to the feature map which is denoted as \(A^i\). With global average pooling, we can obtain the weights of the feature maps.

$$\begin{aligned} \alpha _{i}^{ghost} = \frac{1}{Z}~{\sum \limits _{w = 1}^{W}{\sum \limits _{h = 1}^{H}{~\frac{\partial y_{ghost}}{\partial A_{wh}^{i}}}}} \end{aligned}$$
(5)

where Z is the number of pixels in the feature map. \(A_{wh}^i\) is the activation at location (wh) in the feature map \(A^i\).

By weighted summing the feature maps followed by ReLU, the coarse heatmap is generated in Eq. (6). The final Grad-CAM heatmap can be obtained by resizing to the original image size. This heatmap serves as a form of explanation of the model’s decision-making process. Regions with higher values shown in red in the heatmap indicate that those features contribute most to the model prediction. By visualizing these regions, we can know where the model looks when making the prediction. Therefore, this can increase the reliability of the model by verify model predictions and activated regions.

$$\begin{aligned} L_{Grad - CAM}^{ghost} = ReLU\left( ~{\sum \limits _{i = 1}^{N}\alpha _{i}^{ghost}}A^{i} \right) \end{aligned}$$
(6)

Experiments

Data gathering and preprocessing

The FZ vision system is composed of a BlackFly monochrome camera with a neutral density filter. The camera films in monochrome at approximately 20 frames per second, capturing gray-scale images with the dimensions of 1600x1200 pixels. However, in this study, we only utilized the images captured by the camera at the beginning of the cone phase, where the crystal diameter is approximately 21 mm. The images were processed by the FZ automation system for controlling the FZ process, thus containing some embedded information in the upper left corner like timestamp, machine serial number, etc. On the one hand, these information involve confidential information, thus they were marked with a black rectangle. On the other hand, these information embedded in the upper left corner could be potentially used as a shortcut for differentiating oxides, for instance, the oxide occurs more frequently in recent production or more frequently shown in the specific machine. However, we want the model to focus only on the oxide region itself, rather than memorizing other details. Therefore, the black box can prevent the model learning from these information. The dataset was preprocessed first by removing noises, cropping, resizing, and normalizing the images. The images were then cropped into a fixed sized region of interest and resized to the shape of (320x800), as shown in Fig. 7. The reason why we did not resize to square size is that it would cause distortion of spatial information or even lose features of small oxides.

Thereafter we split the dataset in a random but stratified fashion until 70%, 10% and 20% of the dataset were in the training split, validation split and test split, respectively. All images were first scaled into [0,1] by dividing by 255, and then normalized into the range [\(-1\), 1] by utilizing the mean of 0.5 and the standard deviation of 0.5. Though the mean of 0.5 and the standard deviation of 0.5 are not the real values of the dataset, but they are rather the estimated values to convert the data distribution center around 0. During training, the training split data were augmented by randomly changing the brightness and contrast across different batches. This can help increase the amount and diversity of the training dataset, thus mitigating model overfitting.

Baselines and implementation details

Taking into account the requirement for lightweight models for real-time deployment, we investigated ResNet50 and InceptionV3 to evaluate the proposed method. The models’ weights were extracted from the learned weights in the ImageNet database. And the output layer of each model was replaced by the sequential layers shown in Table 1. For comparison, the experiments were conducted with the following baselines: the models that trained from random initialization with BCE loss (denoted as BCE-S), the pre-trained models trained from transferred weights with BCE loss (denoted as BCE), the pre-trained models with FL and ASL, as well as the pre-trained models using Powerset (PS) and Binary Relevance (BR). It should be noted that for all pre-trained models, each entire network was fine-tuned. For a fair comparison, FL and ASL shared the same focusing parameter \(\gamma \) of 2 for negative samples. In ASL, we set \(\gamma _+=0\) following the recommendation in Ridnik et al. (2021) while \(\gamma _+=\gamma _-=2\) in focal loss. The probability margin m of 0.05 was set in ASL.

Considering the models trained from scratch may need more iterations to achieve convergence, we trained the pre-trained models and the models trained from random initialization for 50 epochs and 100 epochs, respectively. We used the Adam optimizer with \((\beta _1,\beta _2) = (0.9, 0.999)\) and one-cycle policy, with a maximum learning rate of 1e-4. The batch size was set to 32. During training, each model’s performance is monitored by evaluating the loss on validation split. For each model, after training is completed, the model checkpoint with the best performance in the validation split is selected for comparison.

Fig. 8
figure 8

Performance of ResNet50 models. The x axis shows the five different metrics and y axis represents the corresponding score in percentage. Different colors in the legend corresponds to different models

Evaluation metrics

Compared to single-label classification that normally utilizes confusion matrix-based metrics, MLC needs more complex metrics since classes may occur simultaneously. Evaluation metrics in MLC can be mainly divided into two groups: instance-wise metrics and label-wise metrics (Zhang and Zhou, 2014). In this study, we chose subset accuracy and hamming score and micro average F1 score, which are from instance-wise metrics, and macro average F1 score, which are from label-based metrics. Since the above metrics are computed based on a constant threshold, these metrics might be sensitive to the selection of the threshold. Therefore, we also chose a threshold-independent metric, mean average precision.

  • Subset accuracy (SA) is the ratio of correctly classified samples, i.e. the predictions exactly match the ground-truth labels (Zhang and Zhou, 2014).

    $$\begin{aligned} SA = \frac{1}{N}{\sum \limits _{i = 1}^{N}{I\left( Y_{i} = {{\hat{Y}}}_{i} \right) }} \end{aligned}$$
    (7)
  • Hamming score (HS) is computed based on hamming loss (HL). Hamming score is a widely used performance measure in multi-label classification, and represents the symmetric difference of predicted label and ground truth by using XOR operation in Boolean logic (Tarekegn et al., 2021).

    $$\begin{aligned} HL= & {} \frac{1}{N}{\sum \limits _{i = 1}^{N}\frac{\left| {Y_{i}\mathrm {\Delta }{{\hat{Y}}}_{i}} \right| }{M}} \end{aligned}$$
    (8)
    $$\begin{aligned} HS= & {} 1 - HL \end{aligned}$$
    (9)

    where \(\Delta \) denotes XOR operation. The higher the value of HS, the better the performance.

  • Micro average F1 score \((F_{mi})\) is the balanced measure of micro average precision and micro average recall (Kubany et al., 2020). \(F_{mi}\) treats each sample equally. However, this also means that \(F_{mi}\) can be easily affected by dominant classes.

    $$\begin{aligned} F_{mi}&= 2\frac{P_{mi}R_{mi}}{P_{mi} + R_{mi}} \end{aligned}$$
    (10)
    $$\begin{aligned} P_{mi}&= \frac{\sum \limits _{i = 1}^{M}{TP}_{i}}{\sum \limits _{i = 1}^{M}\left( {TP}_{i} + {FP}_{i} \right) } \end{aligned}$$
    (11)
    $$\begin{aligned} R_{mi}&= \frac{\sum \limits _{i = 1}^{M}{TP}_{i}}{\sum \limits _{i = 1}^{M}\left( {TP}_{i} + {FN}_{i} \right) } \end{aligned}$$
    (12)
  • Macro average F1 score \((F_{ma})\) is the harmonic mean of macro average precision and macro average recall (Kubany et al., 2020). Unlike \(F_{mi}\), \(F_{ma}\) assigns equal weight to each class.

    $$\begin{aligned} F_{ma}&= 2\frac{P_{ma}R_{ma}}{P_{ma} + R_{ma}} \end{aligned}$$
    (13)
    $$\begin{aligned} P_{ma}&= \frac{1}{M}{\sum \limits _{i = 1}^{M}P_{i}},~where~P_{i} = \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i}} \end{aligned}$$
    (14)
    $$\begin{aligned} R_{ma}&= \frac{1}{M}{\sum \limits _{i = 1}^{M}R_{i}},~where~R_{i} = \frac{{TP}_{i}}{{TP}_{i} + {FN}_{i}} \end{aligned}$$
    (15)
  • Mean average precision (mAP) represents the average of average precision (AP) over all classes.

    $$\begin{aligned} mAP= & {} \frac{1}{M}{\sum \limits _{i = 1}^{M}{AP}_{i}},~where~\nonumber \\ {AP}_{i}= & {} {\int _{0}^{1}{p_{i}\left( r_{i} \right) dr_{i}}} \end{aligned}$$
    (16)

    where \(p_i\) and \(r_i\) are precision and recall under different thresholds for class i, respectively. As seen in the Eq. (16), \(AP_i\) also represents the area under the precision-recall curve. Since mAP considers all possible thresholds, mAP is always used to be the most important metric for evaluating model performance over all classes. The higher mAP means the better model performance.

Since neural network optimization is nondeterministic, thus yielding uncertainty in model performance, a fixed random seed is needed to obtain repeatable results. To test the model robustness against randomness, which is particularly important in small dataset, each experiment was replicated with three different random seeds (0, 42 and 2023).

Fig. 9
figure 9

Performance of InceptionV3 models. The x axis shows the five different metrics and y axis represents the corresponding score in percentage. Different colors in the legend corresponds to different models

Results

Comparison results

Table 3 compares the performance of two model architectures with different methods on the test set. The results are also drawn in bar chart, as seen in Figs. 8 and 9. Except from mAP, all results were obtained under the global threshold of 0.5 for each class. It should be noted that, for the baseline PS, the prediction for each class is inferred from the predictions of class combination. For instance, if the prediction of “spot and shadow” is over the threshold of 0.5, then both spot class and shadow class are activated and denoted as “1”, while the rest of classes are noted as “0”. Therefore, mAP is not available for the baseline PS since there is no threshold to consider.

As seen in Figs. 8 and 9, the models with the pre-trained weights outperformed the models trained from random initialization. The highest mAP of the models trained from random initialization is 79.39% while the highest mAP of the pre-trained models is 94.12%, as seen in Table 3. Besides, a relatively large variance in terms of \(F_{mi}\) and \(F_{ma}\) can be observed in the models from random initialization, indicating their sensitivity to randomness. Above observations indicate random weight initialization may introduce difficulties to find optimal values.

Fig. 10
figure 10

The ResNet50 model’s performance obtained by different values of probability margin m in terms of SA, HS, \(F1_{mi}\), \(F1_{ma}\), and mAP. The shaded region represents the standard deviation

Among all pre-trained models, the models with ASL overtook other models in terms of five averaged metrics. Particularly, the highest mAPs obtained by ASL verified the robustness of ASL against different thresholds. In addition, compared to FL, the highest F1 scores obtained by ASL indicated the effectiveness of asymmetric focus on emphasizing the feature learning from positive samples. BR achieved the second highest mAP while the rest of the metrics did not perform well among all models, indicating that a specific threshold function might be needed, which might be relevant with the lack of label consistency.

Effects of various components in the asymmetric loss

Asymmetric focusing parameter \(\gamma \) as well as probability margin m are two significant components in asymmetric loss. In this section, the effects of these two components on ASL on top of ResNet50 architecture were deeply investigated. Considering that our dataset size is relatively small, it is not possible to draw a conclusion on the selection of optimal hyper parameters based on our experiments. Therefore, in this case we mainly focus on the comparison between conventional BCE loss and ASL under different levels of hyper parameters and investigate the conditions of effectiveness of ASL in our case. Following the recommendation in Ridnik et al. (2021), we fixed \(\gamma _+=0\). Fig. 10 demonstrates the performance obtained by different values of the probability margin m on top of six levels of the asymmetric focus parameter \(\gamma _-\) from 1 to 6.

Effect of probability margin

Probability margin m is a tunable hyperparameter in ASL that can attenuate easy samples and reject probably mislabeled samples.

As seen in Fig. 10, with respect to the threshold-independent metric mAP, the averaged performance of the most asymmetric levels was above the reference mAPs from BCE loss, along with overlapped uncertainty. However, a downward trend in each of the rest threshold-dependent metrics was observed with increasing probability margin, which was particularly apparent in large asymmetric levels. Specifically, \(\gamma _-\) ranging from 3 to 6 reached their peaks in the probability margin of below 0.2 in all threshold-dependent metrics, while no significant changes were observed in \(\gamma _-\) of 1 and 2. In general, the threshold-dependent performances obtained by \(\gamma _-\) below 3 were always above the reference values of BCE loss in the probability margin below 0.2.

Fig. 11
figure 11

Probability Vs. Loss gradient of negative samples under \(\gamma _-\) of 2

This might be explained by the loss gradient of the negative samples. The definition of the loss gradient of negative samples with respect to the input logits z can be seen in Eq. (18) Ridnik et al. (2021). Figure 11 compares the loss gradient of negative samples at different values of m (with \(\gamma _-=2\)). The peak of the loss gradient decreased with increasing probability margin, resulting in increasingly more loss gradients from positive samples, while gradually underemphasizing the learning from negative samples. Furthermore, the loss imbalance from the large probability margin is more apparent when using larger \(\gamma _-\) since the loss contribution of the negative samples is further down-weighted. Therefore, to obtain better performance than BCE loss, it is better to choose a smaller probability margin.

$$\begin{aligned} \frac{dL_{-}}{dz}&= \frac{\partial L_{-}}{\partial p}\frac{\partial p}{\partial z} \end{aligned}$$
(17)
$$\begin{aligned}&= \left( p_{m} \right) ^{\gamma _{-}}\left[{\frac{1}{1 - p_{m}} - \frac{\gamma _{-}{\log \left( {1 - p_{m}} \right) }}{p_{m}}} \right]p(1 - p) \end{aligned}$$
(18)

where \(p=1/(1+e^{-z})\) (sigmoid).

Effect of focusing parameter for negative samples

Table 2 Comparison of Grad-CAM heat maps generated from Resnet50-ASL and InceptionV3-ASL. Predicted classes indicated in green mean they are correctly predicted, while the predictions in red show they should have been found but they were missed by the algorithm

The focusing parameters, \(\gamma _-\) and \(\gamma _+\), are of vital importance in asymmetric loss. If \(\gamma _-\) is set too low, the down-weighting of easy negative samples is not enough (Ridnik et al., 2021), which still results in the accumulation of more loss gradients from the negative samples. If \(\gamma _-\) is set too high, the degradation of the contribution from easy negative samples is significant, then it may excessively focus on the hard samples. Besides, the model performance will be easily affected if there are some noises mixed in hard samples.

As seen in Fig. 10, the smaller \(\gamma _-\), the better performance in threshold-dependent metrics at the same probability margin, and the steadier performance against different probability margin levels. On the other hand, with increasing \(\gamma _-\), the degradation of threshold-dependent performance along with the increasing probability margin was more obvious. Besides, note that when it comes to the large \(\gamma _-\) of 6, all metrics tend to have a larger uncertainty. This might be associated with the fact that the model is more sensitive to the loss contribution from positive samples. Nevertheless, the averaged mAP of using asymmetric focus was always above the reference mAP in BCE loss. This indicates that the threshold for larger \(\gamma _-\) might need to be optimized.

Visual analysis by Grad-CAM

Given the well-trained models, Grad-CAM Selvaraju et al. (2020) was employed to verify the correlation between images and labels, thus to build trust for application in industry. Table 2 represents the Grad-CAM heat maps generated from ResNet50-ASL, InceptionV3-ASL, under the random seed of 42. The ground truths and the predictions from the models are shown on the bottom of the images.

As can be seen in Table 2, the models using ASL were able to identify the oxide types. In addition, it is worth mentioning that the models were capable of successfully highlighting the features for the activated class as seen in the heat maps. This implies that the heat maps generated from Grad-CAM can be utilized to localize the oxides. Interestingly, compared with Resnet50 model, InceptionV3 model seemed to be able to capture more comprehensive and precise features. This might be related to the strong ability of feature extraction with using different scales of filters in InceptionV3. We also showed a misclassified image in the end, where shadow class was miss-detected. However, in the shown heat maps, we could still see that the interesting regions activated by the shadow class were close to shadow regions. Therefore, missed detection might be due to lower confidence on shadow class than the threshold of 0.5. This can be improved by optimizing the threshold or exploring the label correlation.

Conclusions

In this work, the oxide classification in Float-Zone silicon crystal growth production was studied. We investigated the oxide characterization by classifying oxide types into three varieties: spot, shadow and ghost curtain. Then we took advantage of FZ images captured from the vision system integrated on an industrial FZ machine to establish an oxide dataset. Targeted for data imbalanced problem and limited dataset size in our case, a method based on transfer learning and asymmetric loss was presented. The results showed that the model trained with pre-trained weights and asymmetric loss can achieve the highest averaged subset accuracy of 88.73\(\%\) and the highest averaged mAP of 94.12\(\%\) over other baselines. This study also investigated the effectiveness of asymmetric loss compared to conventional BCE loss by studying the effect of various components on asymmetric loss. Finally, to build trust for the subsequent integration of the model in industry, Grad-CAM was employed to verify the models by visualizing the correlation between inputs and labels.

As mentioned above, this study is the first step in oxidation investigation, which lays the foundation for the subsequent root cause analysis and offers the possibility of automatic responses for mitigating oxides. Besides, since at present only one frame image was extracted for each production run, as a future development, it is planned to extend the presented single-frame oxide classification method to real-time oxide classification for the early detection of oxide, for example by analysis of in-process video recorded by the integrated vision system, thus enabling evaluation of oxide formation, growth and extension. Therefore, more images will allow one to obtain a holistic understanding of the process and track the growth of oxides.