Introduction

In semiconductor manufacturing, wafer is an important fundamental component in integrated circuits (IC). A single wafer can involve several hundred integrated circuits (ICs) after hundreds of sophisticated processes (Alam & Kehtarnavaz, 2022). Any abnormality in these production processes may lead to the generation of defects in the wafer map. Hence, due to the complexity of these processes, it is impossible to produce wafers without any defects (Jin et al., 2019; Liu & Chien, 2013).

After completing the wafer fabrication processes, every wafer undergoes a testing procedure including a series of multiple electrical tests to determine whether each individual chip (or die) meets its product specifications. Specifically, a probe test bench is utilized to detect the electrical characteristics of the chip dies (Cheng et al., 2021). Then, according to the quality level, the chip dies are marked in different colors on the wafer map. Typically, the captured defect patterns are divided into two types: global random defects and local systematic defects. In the global ones, defects are distributed randomly across the wafer without any spatial arrangement, even in normal production conditions. In contrast, in the local ones, spatial correlations are observed in specific regions of a wafer, resulting in patterns such as center circles, edge rings, local zones, and scratches (Wu et al., 2014).

Integrated circuits manufacturing requires high investment, precise technology, and a complex manufacturing process. Thus, an analysis of the wafer map is essential to improve the yield, quality, and reliability of the IC manufacturing process. Even so, manually annotating wafer maps with their defect types is time-consuming and expensive, especially with large production lines (Shankar & Zhong, 2005). Moreover, engineers judge the defect types of wafer map based on their professional knowledge and work long hours which can be exposed to visual fatigue and raises the risk of erroneous classification. Hence, automatic inspection of the wafer map defect is a necessary step which can reduce the time and cost.

With the advancements of machine learning and deep learning algorithms, building an effective automatic fault detection model has become a hot topic in the research community (Kim & Behdinan, 2023; Theodosiou et al., 2023). These wafer fault detection models can be categorized into segmentation models (Cheng et al., 2021; Chu et al., 2022; Jin et al., 2019; Kim & Kang, 2021; Lee et al., 2010; Nag et al., 2022; Yan et al., 2023) and classification models (Baly & Hajj, 2012; Kyeong & Kim, 2018; Saqlain et al., 2019; Yu et al., 2019; Jin et al., 2020; Chen et al., 2021, Chen et al., 2022; Kang & Kang, 2021; Kim et al., 2021; Wang et al., 2021, Wang et al., 2022; Yu et al., 2021a, b, c; Zheng et al., 2021; Shin et al., 2022; Xuen et al., 2022; Yoon & Kang, 2022; Yu et al., 2022; Zhang et al., 2022; Alqudah et al., 2023). The classical supervised recognizers have achieved some good results in wafer map defect recognition (Alqudah et al., 2023; Baly & Hajj, 2012; Cheng et al., 2021; Saqlain et al., 2019). Nevertheless, their performances relied on the effectiveness of the feature extraction step. In addition, the spatial resolution and noise of the wafer maps significantly affect the performance of such techniques. Accordingly in recent times, deep feature learning and extraction has been widely applied in the field of wafer defect recognition (Kyeong & Kim, 2018; Yu et al., 2019, 2022; Jin et al., 2020; Chen et al., 2021, Chen et al., 2022;; Kang & Kang, 2021; Kim et al., 2021; Wang et al., 2021, Wang et al., 2022; Yu et al., 2021a, b, c; Zheng et al., 2021; Shin et al., 2022; Xuen et al., 2022; Yoon & Kang, 2022; Zhang et al., 2022; Xu et al., 2023). However, employing the traditional 2D convolutional neural network (CNN) in the direct classification of defects can easily lead to instability in the classification results (Xu et al., 2023), especially with very small image resolution, like the employed WM-811K wafer map dataset. This instability occurs because of the lack of spatial information to learn features and make proper predictions. In addition, 2DCNNs generally releases high dimensional features that may contain many redundant information, which brings some challenges at the final classification stage, like increasing the computational complexity, reducing memory efficiency, and increasing the chance of overfitting (Yu et al., 2019; Jin et al., 2020; Chen et al., 2021, Chen et al., 2022; Kang & Kang, 2021; Wang et al., 2021, Wang et al., 2022; Yu et al., 2021a, b; Zheng et al., 2021; Xuen et al., 2022; Yoon & Kang, 2022; Yu et al., 2022; Zhang et al., 2022). Many studies have tried different feature engineering steps with deep models to get over these challenges in (Jin et al., 2020; Yu et al., 2021b; Zheng et al., 2021). Table 1 summarizes the most recent studies that employ deep models, indicating the main procedure, strengths, and issues in each one.

Table 1 Summary of the state-of-the-art techniques in wafer map fault detection following hybrid methods in inspection

Accordingly, the main target in this work is how to build a deep recognition system that can provide precise salient features, despite the very low-resolution of wafer maps, with the highest recognition performance. In addition, this recognition system can avoid redundant information that leads to instability and bad interpretability. Thus, we go for converting the 2D wafer map fault detection problem to 1D detection one to get the most salient features with the least dimensionality. Therefore, the redundancy and the high dimensionality of features can be reduced. Nonetheless, new challenges are raised, such as how to assign suitable embedding that kept the 2D spatial information in 1D representation, proper 1D feature ranking, and suitable 1DCNN classifier. Accordingly, the main contributions of the proposed wafer map defect classification model can be summarized as follows.

  1. (1)

    We have exploited the importance of encoder-decoder networks, i.e., autoencoders (AE), in two ways. First, AE is employed as a new convolutional synthetization model to get over the high imbalance problem in the employed wafer map dataset. This model proves efficiency by reconstructing the original wafer maps with a total loss of 0.0011. Second, AE is introduced in a new structure as an embedding representation step with dimensionality reduction behavior, named as sparsity-boosted autoencoder (SBAE). The resultant encoded sparse maps from SBAE guarantee more discriminative features with a 50% reduction in size compared to the original wafer maps. Despite this reduction in size, inspection accuracy of 99.48% can be obtained while working with initial wafer map resolution of 27 × 25.

  2. (2)

    An enhanced red deer optimization (ERD) with a new tinkering strategy is proposed. ERD is applied to 1D squeezed sinograms of the previous sparse maps. The ERD algorithm results in a final average feature pool of ~ 15 bases, i.e., ~ 1.5% of the initial wafer map size which has resolution of 33 × 29. The performance of ERD is compared, in an ablation behavior, to other different metaheuristic algorithms, such as Genetic (GA), Equilibrium (EO), Grey Wolf (GWO), Sine cosine (SCA), and particle swarm algorithms (PSO). The proposed ERD achieves the least feature pool size with approximately the same accuracy as its alternatives, because of the proposed tinkering strategy that makes the ERD algorithm reaches the global optimum solution with the least number of discriminative features to avoid any possible redundant information.

  3. (3)

    Intensive experiments, with a new predictive 1DCNN model, are performed on different resolutions of wafer maps in 8- and 9-fault type prediction. An average accuracy of 95.2% is achieved for unseen 62% testing part of the dataset, while an average accuracy of 98.1% is achieved for unseen 20% testing part of the dataset of in a train–test–validation evaluation. Despite the aggressive dimensionality reduction, the proposed inspection model proves efficient generalization. In addition, the proposed 1DCNN network proves a great balance between the number of parameters and the targeted accuracy in fault detection compared to other common 1DCNNs, such as 1D-VGG16, 1D-ResNet50, 1D-LeNet-5, and 1D-Inception.

The rest of the paper is organized as follows. Details about the targeted wafer map dataset are demonstrated in "Wafer map dataset" section. "Methodology" section describes the details of the proposed methodology of wafer map fault detection. "Experimental results and discussion" section indicated the performed experiments with its results. Finally, the conclusion is offered in "Conclusion" section.

Wafer map dataset

The WM-811K (Wu et al., 2014) is the employed wafer map dataset and it is publicly available at Kaggle website (WM-811K, 2014) It contains 811,457 instances collected from 46,293 lots during the semiconductor fabrication process (Wu et al., 2014). Only a subset of 21.3% (172,950) is labeled by professionals with one of the following nine categories: Center, Donut, Edge-Loc, Edge-Ring, Loc, Random, Scratch, Near-Full, and None, while the rest of the group is still unlabeled, check Fig. 1. As indicated from Fig. 1, the dataset is mostly unlabeled and most of the labeled part is free of fault “None” and the faulty part is very imbalanced. In Table 2, the distribution and the main cause of different defects are pointed out. This dataset provide a single-type defect in a single wafer map (Baly & Hajj, 2012; Saqlain et al., 2019; Yu et al., 2019, 2022; Jin et al., 2020; Chen et al., 2021; Kang & Kang, 2021; Wang et al., 2021, Wang et al., 2022; Yu et al., 2021a, b; Zheng et al., 2021; Chen et al., 2022; Xuen et al., 2022; Yoon & Kang, 2022; Zhang et al., 2022; Alqudah et al., 2023), but there are multiple studies that target mixed-type defect patterns in their inspection models, such as the ones in Kyeong et al. (2018), Kim et al. (2021), Sin et al. (2022), Yu et al. (2022), and Xu et al. (2023).

Fig. 1
figure 1

Labeling and categories distribution of WM-811K dataset

Table 2 The observed defect patterns in WM-811K with counts and causes

Methodology

The main steps of the proposed model for wafer map fault detection are shown in the graphical abstract of Fig. 2, which combines a flow chart with a design purpose for each block. These steps are detailed in the following subsections. Algorithm 1 summarizes a pseudo code for the whole proposed wafer fault detection model.

Fig. 2
figure 2

Graphical abstract of the proposed fault type prediction in wafer maps

Wafer data synthetization model

As presented in Table 2 and Fig. 1, WM-811K is a highly imbalanced dataset. Each wafer map consists only of three types of pixels: 0 for the background, 1 for the normal pixels and 2 for the defected ones. Consequently, two main preprocessing steps are performed. The first is one-hot encoding to convert the grey wafer maps \(X(m,n)\) to colored ones \(XX(m,n,c)\), where c is the number of channels, as \(xx\left(m,n, {c}_{i=x\left(m,n\right)}\right)=1, where\,\,i\in \left\{\mathrm{0,1},2\right\}, x\left(m,n\right)\) and \(xx(m,n,c)\) denotes the grey and colored pixel value. RGB wafer maps help to extract multi-scale features in the upcoming procedures. The second preprocessing step is a synthetization (augmentation or balancing) model. In this work, an autoencoder-based synthesizing model is used for data augmentation.

An autoencoder (AE; Li et al., 2023) is a type of ANN that learns efficient mappings or codings of unlabeled data in an unsupervised manner. AE extracts output data to reconstruct input data and compare it with original input data. After numerous times of iterations, the value of cost function reaches its optimality, which means that the reconstructed input data is able to approximate the original input data to a maximum extent. The introduced convolutional autoencoder (CAE) shows superiority to the traditional AE by incorporating convolutional layers which preserves the local image structures by incorporating spatial relationships between pixels in images.

The introduced CAE consists mainly of two parts: encoder and decoder, see Fig. 3. The encoder converts the input map to a bottleneck low dimensional feature map, while the decoder performs deconvolution operations to expand the latent feature map to reconstruct the original wafer map. The encoder in the proposed architecture consists of one 2D convolutional layer with 64 filters with a kernel weight of \(\left(3\times 3\right)\), and a MaxPooling layer. The extracted feature maps from the convolutional layer of the encoder are represented as

$$H=\mathcal{A}\left(XX*W+B\right),$$
(1)

where \(\mathcal{A}\) is the activation function which is employed as ReLU. \(XX\) is the encoded colored wafer maps. \(W\) and \(B\) are the weights (convolutional kernel) and the bias, respectively.

Fig. 3
figure 3

The CAE-based synthetization model for wafer maps augmentation

The extracted feature maps from the convolutional layer is fed into a MaxPooling layer to provide the targeted bottleneck low dimensional feature map \({H}_{c}\), as

$${H}_{c}=\mathop{\max\limits_{i=0,\dots ,\mathcal{r}-1, j=0,\dots ,\mathcal{r}-1}}H\left({x}^{\prime}+i,{y}^{\prime}+j\right),$$
(2)

where \(\mathcal{r}\) is the MaxPooling operator. \({x}^{\prime}, and {y}^{\prime}\) are the pixels coordinates. At this point, random Gaussian noise (\(\mu =0, \sigma =0.1)\) is added at the bottleneck embedded map \({H}_{c}\) to provide more robustness to the synthetization model. Now, the decoder network tries to construct the input maps \(XX\) through employing two transposing convolution (deconvolutional) layers and UpSampling layer, revise Fig. 3. The output of decoder is the restored feature maps of \({H}_{c}\) as

$$\mathcal{X}=\mathcal{A}\left({H}_{c}*/*W+B\right),$$
(3)

where \(*/*\) is the deconvolution (Conv2DTranspose) operator. The first Conv2DTranspose (TConv2D) layer in the decoder network employs 64 filters with a kernel weight of \(\left(3\times 3\right)\) and activation function \(\mathcal{A}\) as ReLU. The second one employs three filters with a kernel weight of \(\left(3\times 3\right)\) and activation function \(\mathcal{A}\) as Sigmoid. The proposed CAE is trained over certain epochs to minimize the reconstruction error between the original wafer maps and the synthesized ones, in terms of the mean squared error as

$$MSE=\frac{1}{N}\sum_{i=1}^{N}{\left({XX}_{i}-{\mathcal{X}}_{i}\right)}^{2},$$
(4)

where \(\mathcal{X}==\widehat{XX}\) is the new synthesized wafer maps and \(N\) is the number of the inserted wafer maps. Figure 4 indicates a sample wafer map and its corresponding synthesized one. As indicated, they look very similar in the indicated zoomed-up versions of maps. In addition, in the same figure, the reconstruction error over the training epochs is shown with a final loss 0.0011at epoch 30.

Fig. 4
figure 4

The performance of the proposed CAE-based synthetization model―Sample wafer map with its synthesized one and the training reconstruction error over epochs

Sparse feature learning and encoding

At this stage, as low-dimensional 2D wafer maps are targeted, a new sparse autoencoder model is introduced. Sparsity is the property of being sparse or having a lot of zero entries (Sun et al., 2022). In the context of machine learning, sparsity is often used to refer to the number of zero weights in a neural network. An autoencoder is called sparse when its hidden layer activations are encouraged to be sparse (Ng, 2011). The sparsity constraint is added to the loss function of the traditional convolutional autoencoder. The sparsity constraint can be based on the L1 norm of the hidden layer activations or on the KL divergence between the distribution of activations in the hidden layer and a target distribution that is sparse. For the main difference between the traditional autoencoder and the sparse one, check Fig. 5.

Fig. 5
figure 5

Visual comparison between the traditional autoencoder in a and the sparsity-induced autoencoder in b. The dark green nodes are firing whereas the red nodes are constrained, i.e., we are in effect reducing the number of firing neurons

Algorithm 1.
figure a

The proposed wafer map fault detection

In the proposed sparsity-boosted autoencoder (SBAE), a sparsity-reinforced layer is added to the last layer of the encoding phase. Therefore, the network is encouraged to learn an encoding by activating only a small number of nodes. The cost function of the proposed SBAE utilizes three terms: a reconstruction term combined with other two regularizers, i.e., a weight decay term and another sparsity-boosting term. The reconstruction term is similar to the previous CAE. The weight decay term helps to decrease the magnitude of the weights and prevent overfitting. The sparsity-boosting term induces a sparsity penalty in the training criterion. For the configuration of the proposed SBAE, check Table 3.

Table 3 Detailed configuration of the proposed SBAE

Assume the synthesized wafer maps of \(N\) samples \(\left({\mathcal{X}}_{1}, {\mathcal{X}}_{2},\dots , {\mathcal{X}}_{N}\right)\), where \({{x}}_{i}\) represents the ith input of sample \({\mathcal{X}}^{i}\). The cost function considering only the reconstruction term with the weight decay term can be expressed as

(5)

where \({\mathcal{A}}_{W, B}\left({\mathcal{X}}^{i}\right)=\mathcal{A}\left(W*{\mathcal{X}}^{i}+{\text{B}}\right)\) is the activation or mapping of the input \({\mathcal{X}}^{i}\) at layer \(l\). \({n}_{l}\) denotes the number of layers \(l\) in the targeted network. \({o}_{l}\) is the number of nodes or units in layer \(l\). adjusts the weight of the decaying term; large can cause overfitting, while small values may cause underfitting. In the proposed SBAE configuration, the value of adjusted empirically via multiple experiments.

For the sparsity-boosted term, the following term \(\mathcal{K}\left(\mathcal{P}\Vert \widehat{\mathcal{P}}\right)\) is inserted in the cost function in Eq. 5 to be reformulated as

(6)
$$\mathcal{K}\left(\mathcal{P}\Vert {\widehat{\mathcal{P}}}_{j}\right)=\mathcal{P}{\text{log}}\frac{\mathcal{P}}{{\widehat{\mathcal{P}}}_{j}}+\left(1-\mathcal{P}\right){\text{log}}\frac{1-\mathcal{P}}{1-{\widehat{\mathcal{P}}}_{j}},$$
(7)

where \(\mathcal{K}\left(\mathcal{P}\Vert \widehat{\mathcal{P}}\right)\) is Kullback–Leibler divergence which seeks to reduce the deviation between \(\widehat{\mathcal{P}}\) and \(\mathcal{P}\). \(\mathcal{P}\) denotes the targeted sparsity parameter, typically a small value close to zero. \(\widehat{\mathcal{P}}\) is the average output of all hidden neurons, \({\widehat{\mathcal{P}}}_{j}=1/N\sum_{i}{ \mathcal{A}}_{W,B}\left({\mathcal{X}}^{i}\right)\). \(\gamma \) is the sparse penalty coefficient.

Using the introduced SBAE, the wafer map of size \((m\times n\times c)\) is converted to a sparse encoded wafer map \({\mathcal{X}}^{s}\) of size \(\frac{m}{2}\times \frac{n}{2}\times c\), which means the encoded map spatial size is reduced to the half. The encoded map \({\mathcal{X}}^{s}\) is obtained from the bottleneck layer of SBAE. To visualize the clustering performance of the resultant encoded maps, T-SNE (Van der Maaten & Hinton, 2008) is used, see Fig. 6. It helps to visualize the high-dimensional data by mapping the clustered features into low-dimensional space. As indicated, the encoded feature maps show better clustering performance compared to the original maps.

Fig. 6
figure 6

t-SNE comparison before (a) and after (b) the sparse encoding by SBAE

A new sinogramic red deer feature ranking

The main target of feature engineering steps is to reduce the feature dimensionality while keeping the best performance. At this stage, we intend to convert the sparse encoded wafer maps \({\mathcal{X}}^{s}\) to 1D signal without losing the spatial information of the 2D maps. Wherefore, the sparse encoded feature maps \({\mathcal{X}}^{s}\) are converted to sinograms by employing Radon transformation (Leavers, 1992). After that, each sinogram now can be converted to 1D signal \({y}^{s}\) of size \(\left(1\times \frac{mnc}{4}\right)\), so for N samples, we have 1D feature pool \({Y}^{s}\) of size \(\left(N\times \frac{mnc}{4}\right)\). Then, the proposed enhanced red deer (ERD) algorithm is applied to the resultant sinograms to assign the optimal reduced features.

The conventional red deer algorithm

Red deer (RD) algorithm is a new nature-inspired optimization technique (Fathollahi-Fard et al., 2020). It belongs to the family of population-based metaheuristics algorithms. The main advantage of RD algorithm is that it equally maintains the exploitation and exploration phases, which helps to assign the salient features with low complexity.

Red deer is male or female (hinds). A group of hinds is called a harem. Each harem is assigned a male commander. A competition is set among male RDs to get the harem with more hinds via roaring and fighting. According to the strength of the roaring phase, male RDs are categorized into commanders and stags. Only the strongest male after a fierce fight with the other males will be the commander of the harem. Here, the \({\left(\frac{mnc}{4}\right)}^{th}\) sparse encoded 1D features in , where of size \((N, 1)\) is the feature vector, are considered as a group of RDs. Figure 7 indicates a flow chart of the RD algorithm. The main objective of optimization problem concentrates on the determination of near global or optimal solution evaluated with respect to the variables associated with the problem.

Fig. 7
figure 7

Flow chart of the main Red Deer algorithm for sinogramic feature selection

Stage 1: Initialize the population

At this stage, an sparse sinogramic features to initialize red deers as, . Among this population, are chosen as the male red deer features, \({G}_{male}\), while the rest are the hind group, \({G}_{hind}\). The best features in the selected population are selected as males according to their fitness value. The proposed objective or fitness function is a collaboration between the classification accuracy, based on KNN classifier, and the proportion of the selected number of red deers through a weighted sum as

$$f=\upomega . acc+\left(1-\omega \right)\frac{{\vartheta }}{\Upgamma},$$
(8)

where \(acc\) denotes the classification accuracy of the currently selected red deers or features. \({\vartheta }\) represents the number of the currently selected red deers, while Γ is the total number of features in the targeted feature pool. \(\omega \) is a weighting coefficient in the range of \([\mathrm{0,1}]\).

Stage 2: Roaring phase

The male agents are currently the superior solutions. Roaring is a local search for other neighboring best features. The updating rule is as follows.

$${RD}_{male}^{new}=\left\{\begin{array}{c}{RD}_{male}^{old}+{\alpha }_{1}\times \left(\left({\fancyscript{u}}-{\ell}\right)*{\alpha }_{2}+{\ell}\right), if {\alpha }_{3}\ge 0.5,\\ {RD}_{male}^{old}-{\alpha }_{1}\times \left(\left(\fancyscript{u}-{\ell}\right)*{\alpha }_{2}+{\ell}\right), if {\alpha }_{3}<0.5,\end{array}\right.$$
(9)

where \({RD}_{male}^{new}\) and \({RD}_{male}^{old}\) are the current and the previous positions of male red deer solutions. \(\mathcal{u}\) and \({\ell}\) are the upper and lower limits of local search of neighboring solutions. \({\alpha }_{1}\), \({\alpha }_{2}\), and \({\alpha }_{3}\) are randomly generated coefficients from a uniform distribution that ranges from 0 to 1.

Then, the male red deer solutions are categorized into commander and stage red deer solutions. The number of male commanders, is considered as , where \(\delta \) is a random value in a range from 0 to 1. The number of stags can be expressed as

Stage 3: The fighting phase between stags and commanders

Here, every commander is allowed to fight with random stags. According to the solution space, we let the group solution of commanders \({G}_{com}\) approaches that of stags \({G}_{stag}\). Accordingly, two new group of solutions are generated as

$${G}_{{new}_{1}}=\frac{{G}_{com}+{G}_{stag}}{2}+ {\beta }_{1}\times \left(\left({\fancyscript{u}}-{\ell}\right)*{\beta }_{2}+{\ell}\right),$$
(10)
$${G}_{{new}_{2}}=\frac{{G}_{com}+{G}_{stag}}{2}- {\beta }_{1}\times \left(\left({\fancyscript{u}}-{\ell}\right)*{\beta }_{2}+{\ell}\right),$$
(11)

where \({G}_{{new}_{1}}\) and \({G}_{{new}_{2}}\) denote the new generated solutions due to the fighting process. \(\fancyscript{u}\) and \({\ell}\) are the limits of the search space. \({\beta }_{1}\) and \({\beta }_{2}\) are randomly generated coefficients from a uniform distribution that ranges from 0 to 1. Now, we have four solutions, i.e., \({G}_{com}\), \({G}_{stag}\), \({G}_{{new}_{1}}\), and \({G}_{{new}_{2}}\), the solution with the best cost function \(F\) will be selected as the final commander.

Stage 4: Forming harems

Here, the assigned new commander is responsible for forming harems. A harem consists of a male commander and a group of female deer (hinds). The hinds are distributed into separate harems in a random manner, based on the power of the commander in roaring and fighting, i.e., it’s fitting value. Hence, the number of hinds in a harem \(i\), , is calculated as

(12)

where \({f}_{i}\) denotes the normalized power, fitting value, of the commander.

Stage 5: Mating phase

After forming harems, there are three possibilities of mating. The first is that the commander of harem \(i\) mates with \(\rho \) of its harem’s hinds, and the second occurs when the commander mates with \(\vartheta \) of hinds of other harems. Commanders attack another harem to expand his command area. The third possibility is that each stag mates with the closest hind, regardless of harem restrictions. Due to the mating phase, new offspring RDs, \({G}_{{\text{OS}}}\), i.e., solutions, are generated as

$${G}_{{\text{OS}}}=\frac{{G}_{com}+{G}_{hind}}{2}+\theta \times \left({\fancyscript{u}}-{\ell}\right),$$
(13)

where \({G}_{com}\), and \({G}_{hind}\) are the group of solutions that represent commanders and hinds. \(\theta \) is a random number between 0 and 1. In the third possibility of mating \({G}_{com}\) is replaced by the stag solutions \({G}_{stag}\).

Algorithm 2.
figure b

Detailed steps about the proposed enhanced red deer algorithm (ERD)

Finally, the next generation is assigned from all the best commanders (a certain percentage of the best solutions), and the best hinds from all hinds and offspring, via the use of a roulette wheel. These previous stages are performed repeatedly until the maximum number of iterations (Maxiter) reached and the optimal features are defined.

The proposed enhanced red deer optimizer (ERD)

The main idea in the proposed ERD is a tinkering strategy that enhances the traditional mating phase in the conventional red deer optimizer represented in Eq. 13. The tinkering strategy is about adding members from the worst harems, i.e., with the worst fitting values, toward the “best-so-far” harems which have the best fitting value. This tinkering strategy is performed based on the degree of diversity among harems.

The tinkering strategy

For the tinkering strategy, we need to define the best harems and the worst harems, according to the fitting value, while keeping . The worst of the worst hinds of the \(U\) group will tinker the best harems \(Q\), i.e., they join the original hinds. However, this tinkering strategy may lead to local minima entrapment. Accordingly, we need to control the targeted number of worst harems, , as follows.

(14)

where is the total number of constructed harems. \({F}_{harem}^{worst},\) and \({F}_{harem}^{best}\) denote the whole worst and best fitting value, respectively. Ï is the current iteration number corresponding to the maximum assigned number of iterations, \(maxiter\). ṋ is a fixed number of harems to be updated within each iteration.

After determining the worst harems, the best harems are assigned as . Then, the number of hinds from the worst harems are distributed equally, in number, between the best harems, while the best of the best harems assigned the worst hinds in fitting simulating the effect of mutation in metaheuristic algorithms. Now, the harems are reformulated according to the tinkering strategy. According to the mating phase, the new offspring RDs, \({G}_{{\text{OS}}}\), are generated as

$${G}_{{\text{OS}}}=\left\{\begin{array}{c}\frac{{G}_{com}+{G}_{hind}}{2}+\theta \times \left({\fancyscript{u}}-{\ell}\right), if v>\zeta ,\\ \frac{{G}_{com}+{G}_{hind}+{G}_{tink}}{3}+\theta \times \left({\fancyscript{u}}-{\ell}\right), otherwise,\end{array}\right.$$
(15)

where \({G}_{tink}\) is the \(\xi \%\) of the new tinkering hinds. \(v\) is the diversity in fitting between the tinkered harem and the tinkering harem, as it is defined as

(16)

where \({f}_{{harem}_{i}}\) is the whole fitting values of hinds in harem \(i\). Algorithm 2 summarizes the main steps of the proposed ERD.

The proposed 1DCNN classification model

For the final prediction stage to differentiate between the different types of faults in the wafer map, we propose a new 1DCNN-based network which receive the resultant feature pool from the proposed ERA algorithm. 1DCNN is a specific type of CNN that is designed to basically operate on one-dimensional signal. Therefore, it employs one-dimensional convolutions and sub-sampling layers for feature mapping and extraction. Following the optimum majority of CNNs, an input layer, CNN layer group (a convolution layer and a pooling layer), a fully connected layer and output layer form a basic 1DCNN. The resultant vector from each convolutional layer, activation layer, or pooling layer can be considered as a one-dimensional “feature vector”, that can be used somehow in the 1D targeted task. For the configuration of the proposed 1DCNN network, check Fig. 8.

Fig. 8
figure 8

The proposed 1DCNN for the final classification stage of wafer map fault types

Experimental results and discussion

In this section, different experiments have been performed to test the performance of the proposed model of wafer map fault detection. Intensive visual and computitative comparisons are introduced with an ablation study to indicate the impact of the different steps in the introduced model.

Experimental setup all experiments were performed on the Colab Pro environment of Google using different Python libraries, such as Keras, NumPy, Tensorflow, and Sklearn. For the feature reduction stages: check Table 3 for the configuration of SBAE and Algorithm 2 for the parameters of proposed tinkered red deer optimization (ERD). For the final prediction stage, we used the advanced optimization algorithm Adam with its default parameters to exploit the optimum weighting coefficients. We set the learning rate parameter to 0.0001.The 1DCNN is trained with batches of 32 due to memory limitations. For the model assessment, see Fig. 9, different metrices are introduced, such as accuracy, precision, recall, and F1-score.

Fig. 9
figure 9

The employed metrics in the evaluation process of fault detection

Comparison to the state of the art

In Table 4, a comparison is set among different methods in wafer map fault detection (Jin et al., 2020; Chen et al., 2021; Yu et al., 2021a), DenseNet-GCF (Yu and Chen et al., 2022; Wang et al., 2021, 2022; Zheng et al., 2021), WDP-BNN (Alqudah et al., 2023; Zhang et al., 2022), and the proposed method. All the prementioned competitors are based on deep neural networks with supportive modules to enhance the detection performance. Most of them employ a cross-validation evaluation (Jin et al., 2020; Yu et al., 2021b, Chen et al., 2022) or just a train–test evaluation (Chen et al., 2021; Yu et al., 2021a; Xuen et al., 2022; Wang et al., 2022; Zhang et al., 2022). Following the cross-validation evaluation, Chen et al. (2022) achieved high accuracy of 98.34% for nine fault classes but with a very complicated model of two 2DCNNs employing more than 53,000 sample. On the other hand, Jin et al. (2020) declared 98.43% accuracy but with 8 fault types employing hybrid model of 2DCNN with error-correcting output codes and support vector machines for classification with 20,000 samples.

Table 4 Computitative comparison between the state-of-the-art methods with proposed detection method

For the other types of evaluation, i.e., train–valid evaluation, it is a fragile evaluation because it can simply lead to overfitting and doesn’t guarantee that the model will perform well on other unseen data from the same distribution. Wang et al. (2022) with 95:5 evaluation, for 9 fault types within 13,435 samples, declared 98.35% accuracy but with complicated model of 2DCNN and residual blocks. In addition, residual blocks can make networks more prone to overfitting. Chen et al. (2022) with 80:20 evaluation, for 8 fault types within 33,256 samples, gained 96% accuracy employing very complicated model of two 2DCNNs with error-correcting output codes and support vector machines for classification.

In Table 4, two methods, i.e., Zheng et al. (2021) and Alqudah et al. (2023), have targeted the same evaluation and feature reduction concepts as ours. Zheng et al. (2021) follows a train–validation–test evaluation (60:20:20). They declared accuracy of 93.8 with 4000 sample size for 9 fault types. Zheng et al. (2021) build their own 2DCNN for classification. Their obtained accuracy is adequate with the limited sample size and the lack of preprocessing steps. The proposed model achieves 98.77% and 99.24% accuracy for 9 and 8 fault types, respectively, with 60:20:20 evaluation within around 16,000:18,000 samples.

Alqudah et al. (2023) have followed the concept of feature reduction. Their features are extracted from the 2D wafer maps contrary to the proposed method which are from 1D sinograms. Alqudah et al. (2023) have utilized 59 feature bases in total (Density, radon, and geometry-based features). They declared accuracy of 82.7%, despite using around 29,000 sample, because they have employed a simple SVM classifier. In the proposed fault detection method, three different Train–validation–test evaluations have been introduced, i.e., 60:20:20, 36:15:49, and 23:15:62. We have got a minimum accuracy of 98.5% when the detection model is tested with 62% unseen data for 9 fault types within around 18,000 sample and 15 features. On the other side, we have gained accuracy 98.61% for 62% unseen part for 8 fault types within around 16,000 sample and 19 features. In addition, with the proposed method, the accuracy just changed from 98.77 to 98.50% and from 99.24 to 98.61% when the unseen testing data part changed from 20 to 62% for 9 and 8 fault types, respectively. This very small change proves that the proposed model generalized well by exploiting the most suitable discriminative features by the assigned feature engineering steps (SBAE and ERD). For detailed classification reports, confusion matrices, and train–validation performance of the proposed detection method for 9 fault types, check Figs.10, 11, and 12. From Figs. 10 and 11 we can see that the “none” fault type has the least metrics. It is the most confusing class. It shares similar structures (features) with other faults, check the fault images in Table 2. In addition, “none” fault is totally unbalanced with other faults, so many studies (Alqudah et al., 2023; Chen et al., 2021; Jin et al., 2020) have ignored this fault to avoid additional preprocessing steps. Figure 12 indicates the train–validation performance which shows semi-identical performance that refuting the chance of overfitting.

Fig. 10
figure 10

The confusion matrices of the proposed wafer map fault detection model for 60:20:20 evaluation (ours1) in a, 36:15:49 evaluation (ours2) in b, and 23:15:62 evaluation (ours3) in c

Fig. 11
figure 11

The corresponding classification reports of the confusion matrices in Fig. 10 of the proposed wafer map fault detection model for 60:20:20 evaluation (ours1) in a, 36:15:49 evaluation (ours2) in b, and 23:15:62 evaluation (ours3) in c

Fig. 12
figure 12

Train–validation performance over epochs of the proposed wafer map fault detection in terms of accuracy and losses for 60:20:20 evaluation (ours1) in a, 36:15:49 evaluation (ours2) in b, and 23:15:62 evaluation (ours3) in c

The impact of wafer map resolution in performance

WM-811K dataset originally has different sizes of wafer maps, 632 in total, and varies in resolution from 6 × 21 to 300 × 202. Figure 13 indicates the resolution distribution in terms of the number of wafer maps. There are 5 resolutions only with more than 10,000 samples. The largest resolution among them \(39\times 37\). As the wafer maps are not rich in color, i.e., each map contains only three colors, the initial preprocessing steps to adjust the size can simply affect the fault type. Hence, different experiments have been performed on the resolutions that have more than 8000 samples and they are only 7 different resolutions. These samples for each resolution are indicated in Table 5 before and after (W/O and W/) the CAE-based synthetization model as a balancing model. The “All” wafer map set groups the seven other sets and mapped them to resolution 26 × 26. As shown, in Table 5, the “none” fault type has the largest number of maps in each resolution. In addition, the resolution \(33\times 29\) is the most challenging as it has the least number of samples in each fault type.

Fig. 13
figure 13

the maps resolution distribution in WM-811K

Table 5 The number of samples under each employed wafer map resolution, from the labeled maps, before and after the CAE-based synthetization model

Having different resolutions affect the structure of the fault types, and accordingly affect the discriminative features that affect the recognition rate of each class and thus affect the final metrics. Table 6 indicates the performance of different resolutions under different split ratios. As demonstrated, there are two resolutions that have only 8 fault types or classes, i.e., \(25\times 27\) and \(27\times 25\), but they don’t have the same types of faults. The resolution \(25\times 27\) misses the fault “donut”, while the resolution \(27\times 25\) misses the fault “Edge-ring”, revise Table 4. The other resolutions have 9 fault types. For the category of 8 classes, the resolution \(27\times 25\) shows better performance than \(25\times 27\). With only 19 features, the resolution \(27\times 25\) provides 99.24% accuracy with 20% testing and 98.61% accuracy with 62% testing corresponding to 98.36% and 94.82% for \(25\times 27\) resolution with 22 features. We can see that the accuracy changed slightly from 99.24 to 98.61% (very good generalization) in \(27\times 25\) and aggressively from 98.36 to 94.82% (good generalization) in \(25\times 27\) by changing the unseen part from 20 to 62%. As indicated from the classification reports in Fig. 14, the recognition metrics of “Loc” and “none” classes degrade a little at resolution \(27\times 25\). At the other side, at resolution \(25\times 27\), the recognition metrics of three classes degrades aggressively, i.e., “none”, “loc”, and “Edge-Loc”, while the recognition of the “center” degrades slightly. The main cause behind a degradation in recognition metrics, which affects the grouped performance, is that the extracted discriminative feature fails to recognize this class effectively like the other classes because of sharing similar structures. Check Fig. 15 for samples of wafers for “none”, “loc”, “Edge-Loc”, and “center”.

Table 6 Computitative comparison between the performance of different resolutions of wafer maps at different split ratio
Fig. 14
figure 14

Visual comparison between different resolutions of wafer maps in terms of the classification report, \(27\times 25\) (8 classes) in the first row and \(25\times 27\) (8 classes) in the second row. The first column at 60:20:20 while the second at 23:15:62 (train–validation–test) evaluation

Fig. 15
figure 15

Samples of wafer maps from fault types: “none”, “Loc”, “Edge-Loc”, and “Center”

For the resolution sets that have 9 fault types, as presented in Table 6, the resolution \(33\times 29\) comes at the first place with the minimum feature size (15), and with the best performance in terms of the recognition metrices, despite the variety of the split ratio. It proves a good generalization by changing the unseen part from 20 to 62%. The resolution \(39\times 37\) is the worst with the largest feature size (27), the least metrices, and the least generalization. Figure 16 indicates the classification reports of the resolutions \(33\times 29\) and \(39\times 37\) at different unseen part percentages, i.e., 20% and 62%. As demonstrated, the recognition metrics of “none”, “Loc”, and “Edge-Loc” at the resolution set \(33\times 29\) are high which means that the faults shapes discriminatively differ from each other, on contrary to what we have with the resolution set \(39\times 37\). This means the recognition rate of the fault type is independent of higher-resolution maps, but the fault type structure. In other words, the resolution set \(39\times 37\) has more complicated and indiscriminative fault types that are not efficiently recognizable, like the resolution set \(33\times 29\). For more results of other resolution sets, see Table 10 in “Appendix”.

Fig. 16
figure 16

Visual comparison between different resolutions of wafer maps in terms of the classification report, \(33\times 29\) (9 classes, 15 features) in the first row and \(39\times 37\) (9 classes, 27 features) in the second row. The first column at 60:20:20 while the second at 23:15:62 (train–validation–test) evaluation

The impact of feature engineering steps

In the proposed fault detection method, two main techniques have been used for feature engineering after the data balancing step. The first is a sparse feature learning and encoding by the proposed SBAE, where the resultant sparse encoded feature maps are extracted from the bottleneck of SBAE. Inducing sparse regularization in a traditional convolutional autoencoder enhances the reconstruction process and accordingly efficient sparse embedding can be gained at the bottleneck of SBAE. In Fig. 17 a comparison is made between the performance of SBAE and the traditional convolutional AE (CAE) that have the same configuration excluding the sparsity regularization. As shown, without the induced sparsity, the reconstructed wafer maps at CAE are blurry without any fine details contrary to SBAE. The second feature engineering step is applying the proposed tinkered red deer feature ranking (ERD) to 1D sinograms of the resultant sparse encoded features from the bottleneck of SBAE. Figure 18 indicates visual comparison for the convergence of fitness over iterations between the conventional red deer optimization and the proposed ERD. Better convergence and higher fitness belongs to the proposed ERD.

Fig. 17
figure 17

Comparison between the performance of SBAE and CAE

Fig. 18
figure 18

The convergence of fitness over iterations for the traditional red deer in a and the tinkered version in b

The impact of these prementioned feature engineering steps on performance can be checked through four conditions: the first is without any feature engineering steps, the second and the third adopt only one feature engineering step, i.e., SBAE or ERD, the fourth is about applying both feature engineering steps. These conditions are presented in Table 7 as computitative comparison. As shown, the worst performance belongs to the first condition (W/O ERD, W/O SBAE), despite owning the largest 1D feature pool. The redundant information in this large pool limits the performance of the proposed classifier 1DCNN. We can see that the most wafer maps affected by the absence of the feature engineering steps are the set of “\(33\times 29\)”, while the least affected are the set of “\(30\times 34\)”. This means that the structure of fault types in the set \(33\times 29\) offer more redundant information than that of the set “\(30\times 34\)”. For the second and the third condition of applying one feature engineering step, i.e., (W/ ERD, W/O SBAE) and (W/O ERD, W/ SBAE), respectively, better performance than the first condition is gained with smaller 1D feature pool.

Table 7 Comparison between the different cases of applying the adopted feature engineering steps, i.e., ERD and SBAE

As pointed out, in (W/ ERD, W/O SBAE), considering the original features without the sparse learning by SBAE increases the number of features selected by the proposed ERD algorithm seeking the global optimum feature ranking. The set “\(30\times 34\)” has the largest feature pool but with high accuracy, while the set “\(33\times 29\)” has the smallest feature pool but with low accuracy. On the other hand, depending on the sparse boosted features from the bottleneck of SBAE without employing ERD at the third condition (W/O ERD, W/ SBAE) provides a fixed size 1D feature pool which shows adequate performance with some sets, like \(27\times 25\) and \(33\times 29,\) but fails with other sets, like \(26\times 26\), \(30\times 34\). The adequate performance means that the extracted features guarantees a suitable discrimination between classes while failing means that features are not discriminative enough. Finally, at the fourth condition (W/ ERD, W/ SBAE), as signified, the least 1D feature pool sizes with the best performance is achieved. The sparsity induced in SBAE allows to get more sparse fine details that simplifies the mission of ERD to find the global optimum with the least number of features. For more computitative results of the four conditions at other different resolutions, see Table 11 in “Appendix”. Figure 19 presents the classification reports of the map set “\(26\times 26\)” at the previous four conditions. In addition, Fig. 20 indicates the train–validation performance over epochs. The small gap between train and valid means good generalization while large gap means bad generalization.

Fig. 19
figure 19

Visual classification reports comparison of different conditions of applying the feature engineering steps, i.e., SBAE and ERD, to wafer maps of size \(26\times 26\); a W/O ERD and W/O SBAE (676 features), b W/ ERD and W/O SBAE (73 features), c W/O ERD and W/ SBAE (169 features), and d W/ ERD and W/ SBAE (22 features)

Fig. 20
figure 20

Train–validation performance of different conditions of applying the feature engineering steps, i.e., SBAE and ERD, to wafer maps of size \(26\times 26\); a W/O ERD and W/O SBAE (676 features), b W/ ERD and W/O SBAE (73 features), c W/O ERD and W/ SBAE (169 features), and d W/ ERD and W/ SBAE (22 features)

The impact of the proposed ERD compared to other metaheuristic algorithms.

Here, the impact of the enhanced tinkered red deer algorithm (ERD) is discussed. A comparison is set in Table 8 between different metaheuristics algorithms compared to the proposed ERD. A metaheuristic algorithm (Abdel-Basset et al., 2018) is a high-level procedure or heuristic designed to find, generate, tune, or select a heuristic (partial search algorithm) that may provide a sufficiently good solution to an optimization problem. In the prementioned comparison, Genetic (GA; Sohail, 2023), Equilibrium (EO; Altantawy & Kishk, 2023; Houssein et al., 2022), Grey Wolf (GWO; Faris et al., 2018), Sine cosine (SCA; Zhou et al., 2022), and particle swarm algorithms (PSO; Shami et al., 2022) have been utilized. As demonstrated from Table 8, all assigned metaheuristic algorithms provide a superior performance in the fault type prediction, but the proposed ERD provides the absolute least feature pool size with the same accuracy, approximately. For the 8-fault type detection problem, the average drop in accuracy from the best performer is 0.63%, while in 9-fault type problem, the average drop in accuracy is 1.22%. In the set of “All”, it is noticed that the accuracy degraded a little bit compared to other sets, because of the resizing process of wafer map to a fixed common size which affects the shape of fault patterns. Table 8 indicates the results for the resolution sets: \(26\times 26\), \(27\times 25\), and All. For more results of other resolution sets, see Table 12 in “Appendix”. For fair comparison in Tables 8 and 12, we have utilized the same population size and seek the parameters that keep the most possible fitting score on all employed metaheuristic algorithms.

Table 8 Comparison of different common metaheuristic algorithms for the proposed fault detection in wafer maps

The main cause that the proposed ERD selects a smaller number of features is generally the tendency to focus more on exploitation than other metaheuristic algorithms, such as GA, PSO, EO, GWO, and SCA. The tendency of emphasizing on exploitation rises from the mating behavior of red deer, where the dominant stag (leader) mates with the most fertile hinds. This mechanism symbolizes the selection of features with higher fitness, driving the algorithm towards refining a smaller set of relevant features. In contrast, algorithms like GA and PSO tend to explore the search space more extensively, potentially leading to the selection of a larger number of features. This is because their mechanisms encourage the exploration of diverse solutions and the recombination of features, which can sometimes introduce less relevant features into the selected set. While ERD focus on exploitation can be beneficial in reducing the risk of overfitting, it may also limit the algorithm's ability to capture complex relationships between features. This could potentially affect its performance on datasets where such relationships are crucial for accurate prediction or classification.

The impact of the classification stage

For the prediction stage, different common 1D deep networks (Kiranyaz et al., 2021) have been tested, like 1D-VGG16, 1D-ResNet50, 1D-LeNet-5, and 1D-Inception, against the proposed 1DCNN. A comparison is set at Table 9 between these networks. As demonstrated, the 1D-VGG16 provides the best average accuracy of 98.27%, but with larger parameters which are three times the parameters of the proposed 1DCNN which comes in the second place with average accuracy of 98.08%. The largest parameters and the worst performance belong to 1D-ResNet50 (16 M, 96.68%). The smallest number of parameters, 16 K, belongs to 1D-Inception with an average accuracy of 97.23%. The introduced 1D-LeNet-5 shows an average accuracy of 97.96% with 240 K number of parameters. For more computitative results, see Table 13 in “Appendix”. Figure 21 introduces a visual comparison between the prementioned classifier in terms of the classification report. The main difference between classifiers that affect the total performance is the recognition metrics of “none”, “loc”, and “edge-loc”, as the main difficult fault types in recognition. In Fig. 22, the training and validation losses are indicated over epochs. As shown, the proposed 1DCNN and 1D-LeNet-5 demonstrate the least losses and the smallest gap between validation and training which provides good generalization.

Table 9 Comparison of different common 1D CNNs for the proposed fault detection in wafer maps
Fig. 21
figure 21

Visual classification reports of different 1D deep classifier for the prediction of fault type in wafer maps with resolution of \(26\times 26\). a 1D-VGG-16, b 1D-ResNet50, c 1D-LeNet-5, d 1D-Inception, and e the proposed 1DCNN

Fig. 22
figure 22

Visual classification reports of different 1D deep classifier for the prediction of fault type in wafer maps with resolution of \(26\times 26\). a 1D-VGG-16, b 1D-ResNet50, c 1D-LeNet-5, d 1D-Inception, and e the proposed 1DCNN

Conclusion

In this paper, a hybrid deep model for fault type prediction in wafer maps is proposed. The proposed model has targeted three objectives. The first is getting over the highly imbalanced dataset. The second targets obtaining more discriminative reduced features in 1D form instead of the 2D form of the original wafer maps. The final is an effective classifier that achieves adequate balance between accuracy and complexity. For the first objective, a new unsupervised synthetization model as a CAE is proposed which succeeded in reconstructing the inserted wafer maps with a very low loss of 0.0011. For achieving more discriminative features, as the second objective, firstly, a new sparsity-boosted autoencoder (SBAE) is proposed to get sparse encoded maps with 50% reduction in spatial size compared to the original maps. Secondly, an enhanced tinkered red deer optimization (ERD) is applied to 1D sinograms of the previously obtained sparse maps to get an average final 1D feature pool of ~ 25 feature bases (~ 1.5% of the original maps). In an ablation study, the adopted feature engineering steps prove efficiency in getting the least possible feature bases with high accuracy, especially when it is compared to other metaheuristic algorithms, such as Genetic (GA), Equilibrium (EO), Grey Wolf (GWO), Sine cosine (SCA), and particle swarm (PSO) algorithms. For the third objective, a new 1DCNN model is proposed for the 9- and 8-fault type prediction, which achieves an average accuracy of 98.1% with 180 K of trainable parameters. The proposed 1DCNN is compared to other common 1DCNNs, such as 1D-VGG16 (98.26%, 590 K), 1D-ResNet50 (96.7%, 16 M), 1D-LeNet-5 (98%, 240 K), and 1D-Inception (97.23%, 16 K). Despite the achievements of the proposed wafer map inspection model, being a hybrid deep model still increases the computational cost of the inspection procedure. Accordingly, as a future work, we intend to work on the feature engineering steps to reduce its complexity. We will try to extend the proposed detection model to other classification problems in wafer maps and other datasets, as well.