1 Introduction

Nanoscience, a burgeoning research field focused on the exploration of new materials and the characterization of their microscopic properties, has witnessed remarkable growth [1, 2]. This expansion has led to the development of nano-sized particle imaging systems tailored for materials science applications [3]. Among the various tools utilized by scientists in this domain, the scanning electron microscope (SEM) holds a prominent position, enabling the visualization of surface topography and composition for samples of interest [4, 5]. Conventionally, these acquired SEM images are manually annotated by laboratory scientists and subsequently archived in dedicated data repositories [6]. Recognizing the need to facilitate the sharing of the ever-increasing volume of SEM images, the Nanoscience Foundries & Fine Analysis (NFFA)–EUROPE project, a distributed research infrastructure spanning Europe, has established the Information and Data Repository Platform (IDRP) [7]. The IDRP serves as a centralized entry point, ensuring harmonized data policies and facilitating access to SEM data for the scientific community [7]. For effective data sharing, the data must be findable, accessible, interoperable, and reusable [8]. Consequently, it becomes evident that automated classification of SEM images, integrated into the data warehouse as a complementary curation function, is essential for the IDRP or any other significant SEM dataset [6, 9].

The proliferation of SEM images capturing various types and sizes of composite materials has presented challenges in their classification, prompting the application of machine learning techniques [6, 10, 11]. In [11], deep learning techniques were employed for nanoparticle detection, while [10] utilized machine learning methods for mineral classification. A U-Net method was adopted to classify scanning electron microscopy-energy-dispersive X-Ray spectroscopy (SEM–EDS) images, achieving an F1-micro score of 88.32% across 12 classes. Ge et al. [12] presented a deep learning model in a review paper that leveraged the computer vision capabilities of convolutional neural networks to extract morphology, distribution, and intensity information from microscopic images. Modarres et al. [9] explored transfer learning in a deep learning-based SEM image classification model, utilizing four pre-trained convolutional neural networks (Inception-slim, Inception-v3, Inception-v4, and ResNet) from the ImageNet1k dataset [13]. These networks were employed to extract features from an SEM dataset comprising 18,577 images distributed across ten classes [6]. The Inception-v3 model achieved approximately 90% accuracy. Transfer learning, in this context, involves fixing the pre-trained initial layers of the convolutional neural networks and training only the final few layers on the target dataset to learn specific features. Consequently, feature extraction becomes less computationally intensive compared to training an entirely new network. Additionally, computer vision techniques have gained widespread adoption for automated image classification tasks in recent years [14]. In [15], a multilayer perceptron with backpropagation training algorithm was employed to automatically segment and classify high-resolution micrographs of cast iron images for non-destructive testing, yielding results comparable to manual human visual classification. Similarly, in [16], a “bag of features” approach was utilized to construct microstructural signatures for classifying 105 micrographs of metallic materials based on similarity matching with local image patterns, achieving an accuracy of 83%. The success of computer vision techniques in automated microstructural analysis has stimulated efforts to explore their applicability in SEM image classification. Osenberg et al. [17] introduced a feature engineering approach for the classification of SEM images. They employed threshold-based models to extract features and subsequently utilized random forest (RF) classifiers to classify the selected features. Their method achieved an impressive classification accuracy of 94%. However, the authors did not provide a comprehensive presentation of their results. In related work, Han et al. [18] proposed a novel synthetic image generation model. They further introduced an attention-based convolutional neural network (CNN) architecture by incorporating two pooling functions and multiplicative operations. Moreover, they addressed the issue of vanishing gradients by employing residual blocks. Their model achieved a higher accuracy of 95% and was compared against well-known CNN architectures such as MobileNet, VGG16, and ResNet50. Nonetheless, the authors did not explore the utilization of high-performing CNN models, such as DenseNets or EfficientNets. Dahy et al. [19] proposed a feature selection model utilizing a metaheuristic optimization technique. They applied their feature selector to the deep features extracted from SEM images. Unfortunately, their model lacked novelty, as they solely focused on evaluating the performance of their proposed feature selector. Moreover, it is worth noting that metaheuristic optimization-based feature selectors tend to exhibit high time complexity. Scott- Fordsmand and Amorim's paper, [20], extensively discussed the profound impact of leveraging machine learning models for the automatic classification of nanomaterials. The authors emphasized the intrinsic significance of this field, as it directly influences various aspects of human life. However, the paper lacked the presentation of any proposed models or the provision of classification results.

1.1 Motivation and the proposed model

SEM images play a crucial role in the field of material sciences. To reduce classification costs, machine learning models have gained popularity for automating the classification of SEM images [6, 10, 11]. In this study, we propose a feature engineering model that combines patch division techniques inspired by computer vision approaches [21] with transfer learning using a pre-trained convolutional neural network.

Traditional fixed-sized patch division models, such as the vision transformer [21] and multilevel perceptron-mixer [22], have demonstrated impressive classification performance. However, their utility has been limited by the high dimensionality of the extracted feature vectors. To address this limitation, we introduce a novel nested patch division method that divides input images into non-fixed size patches. This approach reduces the number of patches required to cover the entire image and enhances computational efficiency compared to standard fixed size patch division methods.

Our inspiration for feature extraction stems from the successes achieved by pre-trained convolutional neural networks and patch-based models. However, standard fixed-size patch-based models often impose significant computational burdens. To mitigate this challenge, we employ a nested patch division model that necessitates fewer patches to encompass the entire input image. Specifically, we utilize DenseNet201 as our feature generator, which is a 201-layer convolutional neural network pre-trained on ImageNet1K [13]. Accordingly, our proposed model is named NFSDense201.

In our approach, we employ iterative neighborhood component analysis (INCA) as the feature selector, followed by a support vector machine (SVM) as the classifier. This combination allows us to effectively extract discriminative features from the merged feature vector obtained by concatenating the features extracted from the nested patches. The SVM then performs the final classification task.

By integrating these components, our NFSDense201 model offers a comprehensive solution for SEM image classification. The proposed nested patch division, feature extraction using DenseNet201, and the combination of INCA and SVM collectively contribute to the model's effectiveness in classifying SEM images.

1.2 Novelties and Contributions

The contributions of this work are outlined below, highlighting the key aspects of our approach:

  • Local Image Feature Extraction: In computer vision, fixed-size patches are commonly employed for extracting local image features. However, this often leads to high-dimensional feature vectors. To address this, we introduce a novel nested patch division method, enabling comprehensive coverage of the input image with fewer non-fixed size patches. This approach effectively reduces the dimensionality of the extracted feature vectors while preserving the necessary information.

  • Transfer Learning with DenseNet201: Training image classification models from scratch using unseen images can be computationally demanding. In our study, we leverage the pre-trained DenseNet201 architecture, which was originally trained on the ImageNet1k dataset [23]. By utilizing the already learned features in the initial layers of DenseNet201, we focus on training the last layers specifically for SEM image feature extraction. This strategy significantly reduces the training time required for the model.

  • Efficient Feature Selection with INCA: NFSDense201 generates redundant features due to the parallel extraction of features from multiple overlapping nested non-fixed size patches. To address this issue, we employ iterative neighborhood component analysis (INCA) to efficiently filter out redundant features. This process results in a highly condensed selected feature vector that contains the most discriminative features [24]. By iterating the feature selection function during the learning process, we determine the optimal length of the selected feature vector specific to our study dataset.

These contributions collectively make NFSDense201 a highly efficient SEM image classification model. We validate our model through extensive training and testing on a large-scale SEM image dataset. Leveraging a standard shallow support vector machine (SVM) classifier, our model achieves excellent classification performance. These results provide strong justification for our design decisions, specifically incorporating the learning of non-fixed size nested patches using a pre-trained deep network into our SEM image classification model.

2 Dataset

The open-access SEM image dataset [6], comprising a total of 21,272 images, was utilized in our study. These images were previously annotated by domain experts and categorized into ten distinct classes: “biological,” “fibers,” “film-coated surfaces,” “MEMS devices and electrodes,” “nanowires,” “particles,” “pattern surfaces,” “porous sponge,” “powder,” and “tips,” encompassing 972, 162, 326, 4590, 3820, 3925, 4755, 181, 917, and 1624.jpeg images, respectively. To ensure consistency, we uniformly resized all images to dimensions of 1024 × 768 pixels.

3 NFSDense201 model for SEM image classification

Our classification model comprised the following sequence of operations: image resizing, nested patch division, hybrid deep feature extraction, feature vector concatenation, feature selection, and classification (Fig. 1). The steps are detailed as follows:

  • Step 1 Resize SEM image to 224 × 224 sized images.

  • Step 2 Apply nested patch division to resize images.

  • Step 3 Extract features by using two layers of the pre-trained DenseNet201.

  • Step 4 Merge the generated feature vectors.

  • Step 5 Choose the best features by applying INCA.

  • Step 6 Classify the generated feature using SVM with a 75:25 split ratio.

Fig. 1
figure 1

Schema of the NFSDense201 SEM image classification model. Here, p: patches, f: feature vectors

Explanations of the individual steps are detailed in the following subsections.

Figure 1 illustrates the architecture of the NFSDense201 SEM image classification model. The input image was divided into four nested patches, each serving as a distinct input to the pre-trained DenseNet201. Through the utilization of the network's global average pooling and fully connected layers, two deep feature vectors were extracted from each patch, resulting in a total of eight (= 4 × 2) feature vectors per input image. These feature vectors were subsequently concatenated to form a merged feature vector.

To enhance discriminative power, the iterative neighborhood component analysis (INCA) technique was employed. INCA iteratively evaluated the loss value and selected the most significant features during the learning process. This iterative feature selection process enabled the generation of a final feature vector of optimal length that was specific to the dataset under consideration.

The resulting final feature vector was then fed into a standard support vector machine (SVM) classifier for classification. The SVM utilized the extracted features to assign the input image to one of the predefined classes. This sequential process of feature extraction, iterative feature selection, and classification formed the core of the NFSDense201 model's functionality.

3.1 Feature extraction

To preprocess the SEM images from the dataset, a resizing operation was performed, resulting in images of size 224 × 224, which aligns with the dimensions used in the vision transformer [21]. This image size was also chosen as the input dimension for the DenseNet201 model. A nested patch division approach was employed to create four patches with incremental dimensions: 56 × 56, 112 × 112, 168 × 168, and 224 × 224 (the last patch being identical to the input image) (refer to Fig. 2). Each patch was then fed into the DenseNet201 model for inductive-based feature extraction.

Fig. 2
figure 2

Nested (non-fixed size) patch division of a sample input SEM image that had been resized to 224 × 224 (top panel). Defining the initializing unit for patch division as 56, four (= 224/56) non-fixed sized patches that were centered on the input image were created with the following incremental dimensions: 56 × 56, 112 × 112, 168 × 168, and 224 × 224 (bottom panel). The fourth patch had the same dimensions as the input image, which allowed for global feature extraction, in addition to local feature extraction from the first three smaller patches

The DenseNet201 architecture, which had been pretrained on the ImageNet1K database, containing approximately one million images across 1000 classes, was utilized to extract local features from each patch. Specifically, the last fully connected layer (fc1000) and the global average pooling layer (avg_pool) of the DenseNet201 network were employed to generate two deep feature vectors of lengths 1000 and 1920, respectively. The global features were extracted from the last patch, which was identical to the input image, while the local features were extracted from the first three smaller patches.

The resulting feature vectors from the four patches per SEM image were concatenated to form a merged feature vector with a length of 11,680 (= [1000 + 1920] × 4). This merged feature vector captured the combined information from the different patches. The steps involved in this process are summarized as follows:

  1. 1.

    Perform image resizing to obtain images of size 224 × 224.

  2. 2.

    Utilize a nested patch division algorithm (Algorithm 1) to create non-fixed size patches. The initial patch size was fixed at 56 × 56, resulting in the generation of four patches per input image (as shown in Fig. 2).

By following these steps, the input SEM images were appropriately processed, and the necessary local and global features were extracted using the DenseNet201 model. Algorithm 1 depicted the presented nested patch division.

figure a
  1. 3.

    Extract deep features using the fc1000 and global avg_pool layers, respectively, of the pre-trained DenseNet201.where \(f{v}_{i}\) represents the ith feature vector; \(\zeta (.)\), fc1000 layer; \(\varrho (.)\), global avg_pool layer; and \(\gamma (.,.)\), merging function. From each patch, two deep feature vectors of lengths 1000 and 1920 were generated using the fc1000 and global avg_pool layers of DenseNet201. There were concatenated into a feature vector of length 2920.

    $$f{v}_{i}=\gamma \left(\zeta \left(ptc{h}_{i}\right),\varrho \left(ptc{h}_{i}\right)\right), i\in \{\mathrm{1,2},\dots ,4\}$$
    (1)
  1. 4

    Merge the four feature vectors generated from the four patches to obtain one merged feature vector per input image.

    $$feat\left(j+2920\times \left(i-1\right)\right)=f{v}_{i}\left(j\right), j\in \{\mathrm{1,2},\dots ,2920\}$$
    (2)

    where \(feat\) represents the merged feature vector of length 11,680 (= 2920 × 4).

3.2 Feature selection

The usage of overlapping nested patch-based feature extraction inherently leads to the generation of redundant features in the central region of the input image, as depicted in Fig. 2. To address this issue, our proposed model incorporates INCA, a straightforward yet highly effective feature selection mechanism. INCA operates by iteratively selecting the most discriminative features based on computed loss values [24], thereby filtering out redundant and non-informative features. In our experiments, we set the parameters of INCA as follows: the iteration range was defined from 500 to 1000, and the loss function calculator employed was SVM. Remarkably, when applied to our extensive study dataset consisting of 21,272 images, INCA successfully generated a final feature vector of optimal length 698.

3.3 Classification

For classification purposes, we employed cubic SVM, a widely recognized and efficient shallow classifier [25, 26]. To evaluate the performance of our model, we adopted a 75:25 training-to-test split hold-out validation strategy. The SVM parameters were configured as follows: the kernel utilized was a third-degree polynomial function, the coding scheme employed was one-vs-all, and the box-constraint parameter was set to 1.

4 Experiment

In this section, we have presented our experimental results. Moreover, we have defined two cases to get generalizable results.

4.1 Setup

The SEM image dataset utilized in this study is publicly available. Initially, the dataset was obtained by downloading it from the relevant source. Our model has been implemented within the MATLAB (2021b) environment, making use of a modestly configured personal computer equipped with 16 gigabytes of main memory, a 1 terabyte hard disk, a central processing unit operating at 3.60 gigahertz, and the Windows 11 operating system. To facilitate our implementation, we acquired the pre-trained DenseNet201 network. The proposed model was coded using m files and functions. Furthermore, we employed the MATLAB classification learner toolbox to generate the SVM code. Moreover, our proposed NFS-DenseNet201 constitutes a parametric deep feature engineering model, and the specific parameters employed are provided as follows.

4.1.1 Deep feature extraction

We conducted fully connected and global average pooling of the pre-trained DenseNet201. The DenseNet201 architecture was used with default settings. For feature extraction, we resized the images to dimensions of 224 × 224. Additionally, we employed four nested patches with sizes of 56 × 56, 112 × 112, 168 × 168, and 224 × 224.

4.1.2 Feature selection

INCA was utilized during the feature selection phase, and the parameters of this selector are presented as follows. We defined the iteration range from 500 to 1000, and the loss function calculator employed was SVM. Notably, the number of iterations of the NCA corresponds to half of the total number of observations. By using a greedy algorithm, the feature vector with the lowest misclassification rate.

4.1.3 Classification

To classify the selected features, we employed a 3rd-degree polynomial order SVM. The settings of the classifier are as follows: the kernel employed was a third-degree polynomial function, the coding scheme utilized was one-vs-all, the box-constraint parameter was set to 1 and validation is 75:25 split ratio.

4.2 Classification tasks

To evaluate the classification performance in a general setting, we devised two distinct cases, each encompassing a different number of classes. The specific details of these cases are outlined as follows:

  • Case 1: In this case, we utilized a subset of the dataset consisting of 5,080 SEM images. These images were drawn from four categories: “fibers,” “nanowires,” “porous sponge,” and “powder” (used for training and testing).

  • Case 2: For this case, we employed the entire dataset, comprising a total of 21,272 images spanning all ten categories.

It is worth noting that the optimal number of features selected by INCA in Case 2, namely 698, was applied to both cases. This consistent feature selection approach allowed us to use the same set of 698 features to evaluate the classification performance in both scenarios.

4.3 Model performance evaluation

The model's performance was evaluated using standard performance metrics [27, 28], namely accuracy, recall, and precision. We calculated both class-wise and overall performance metrics. The mathematical explanations of these performance evaluation parameters are illustrated as follows:

$$\mathrm{accuracy}=\frac{\mathrm{tp}+\mathrm{tn}}{\mathrm{tp}+\mathrm{tn}+\mathrm{fp}+\mathrm{fn}}$$
(3)
$$\mathrm{recall}=\frac{\mathrm{tp}}{\mathrm{tp}+\mathrm{fn}}$$
(4)
$$\mathrm{precision}=\frac{\mathrm{tp}}{\mathrm{tp}+\mathrm{fp}}$$
(5)
$$f1=\frac{2\mathrm{tp}}{2\mathrm{tp}+\mathrm{fn}+\mathrm{fp}}$$
(6)

where \(\mathrm{tp}\), \(\mathrm{tp}\), \(\mathrm{fn}\), and \(\mathrm{fp}\) represent the numbers of true positives, true negatives, false negatives, and false positives, consecutively.

4.4 Results

For both Case 1 and Case 2, the performance of NFSDense201 was truly remarkable, as demonstrated by the excellent overall results presented in Table 1. Additionally, the class-wise classification performance, depicted in Fig. 3, was commendable. In particular, NFSDense201 achieved a remarkable accuracy rate of 99.53% for the four-class classification task, while obtaining a respectable accuracy rate of 97.09% for the ten-class classification task.

Table 1 Overall classification performances of NFSDense201
Fig. 3
figure 3

Class-wise recall, precision, and F1 scores of NFSDense201 for Case 1 (a) and Case 2 (b)

Regarding Case 1, the category-wise outcomes varied. The “porous sponge” category exhibited a flawless performance, achieving a perfect classification rate of 100%. On the other hand, the “fibres” category, which constituted the smallest group within the dataset, had a slightly lower classification rate of 97.56%.

In the context of Case 2, the overall class-wise F1 scores displayed a range of performance levels. The “particles” category showcased the highest F1 score of 98.03%, signifying a notable classification accuracy. Conversely, the “film-coated surfaces” category demonstrated a comparatively lower F1 score of 90.68%, indicating some room for improvement.

Analyzing the overall class-wise recall for Case 2, it is evident that the “powder” category performed exceptionally well, achieving a recall rate of 98.96%. Conversely, the “fibers” category exhibited a lower recall rate of 89.02%. It is worth noting that despite the relatively lower recall rate for the “fibers” category, the precision was remarkably high at 99.79% for Case 2. This implies that while approximately 11% of the “fibers” images may have been misclassified (as indicated by the recall rate of 89.02%), if an SEM image was classified as “fibers,” it was highly likely to be correct, considering the significantly high precision rate.

5 Discussion

NFSDense201 introduces a novel approach to feature generation by leveraging the last fc1000 and global avg_pool layers of the pre-trained DenseNet201. This model aims to extract deep features from non-fixed size patches of incremental dimensions obtained through a unique nested patch division technique applied to resized input SEM images. By incorporating an identical patch that matches the input image, the model efficiently generates comprehensive global and local features. To address feature redundancy resulting from overlapping nested patches centered around the image, we employ INCA feature selection. Through this process, we identify an optimal feature vector length of 698 for the study dataset. For classification, we utilize a standard shallow cubic SVM with a 75:25 split. Our NFSDense201 model achieves remarkable accuracy rates of 99.53% for Case 1 and 97.09% for Case 2.

In conducting a nonsystematic review of existing models for SEM image classification (see Table 2), we find that the NFSDense201 model outperforms its counterparts. Notably, our model utilizes the largest SEM image dataset to date. We specifically compare our results to those of Kavuran et al. [29] who employed transfer learning, feature reduction with metaheuristic optimization, and an SVM classifier for a four-class Case 1 classification task using the same dataset. Interestingly, their optimized model achieved a similar classification accuracy of 99.30% [29], despite our model not utilizing any optimization method. In a separate study, Li et al. [30] proposed deep and machine learning models for classifying minerals in microscopic images into 13 categories. While their end-to-end deep learning model achieved the highest F1 score of 92% [30], it incurred substantial time complexity. In contrast, our NFSDense201 model, based on transfer learning, achieves excellent classification results with minimal time costs. Furthermore, Leracitano et al. [31] obtained an accuracy of 92.50% using a multilevel perceptron-based image classification model, albeit with a relatively small dataset. Tsutsui et al. [32] employed a gray-level co-occurrence matrix to extract textural features from SEM images and achieved 85% accuracy using two shallow classifiers on a limited dataset. Similarly, Tian et al. [33] reported an accuracy of 88% with their pre-trained VGG16-based model on a small SEM image dataset. Lastly, Yin et al. [34] proposed an attention-convolutional neural network model, attaining an impressive accuracy of 98.56% on a sizable four-class dataset.

Table 2 State-of-the-art SEM image classification methods

Within the existing literature, we found no substantial model that specifically addresses Case 2, as far as our knowledge extends. As a consequence, we are unable to provide any comparative findings pertaining to the aforementioned dataset containing 10 distinct classes. To address this gap, we designed a comparative scenario to assess the performance of our proposed model against other prevalent CNN architectures, including (i) MobileNetV2, (ii) DarkNet53, (iii) Xception, (iv) EfficientNetb0, (v) ResNet50, and (vi) InceptionV3. Employing our NFS-based deep feature extraction architecture, we applied it to these pre-trained CNNs and obtained results using Case 2, involving a substantial image dataset.

To illustrate the outcomes, both the calculated results and the performance of our proposed NFDenseNet201 are depicted in Fig. 4.

Fig. 4
figure 4

Comparative analysis of pre-trained CNNs for the NFS architecture using Case 2

Figure 4 demonstrates that the optimal deep feature extraction model for our proposed deep feature engineering architecture is DenseNet201, achieving an impressive accuracy of 97.09%. Following closely, InceptionV3 attains a respectable accuracy of 95.33%. Conversely, among the pre-trained CNNs, MobileNetV2 exhibits the lowest performance for this particular problem, with an accuracy of 92.33%.

Key characteristics of the NFSDense201 model are outlined as follows:

  • We introduce a novel nested patch division method, which enables effective feature extraction from SEM images.

  • This method is combined with downstream deep feature extraction using a pre-trained deep network, resulting in a novel deep feature engineering model.

  • The model is trained and evaluated on an extensive dataset comprising 21,272 SEM images.

  • Remarkably, NFSDense201 achieves classification accuracy rates of 97.09% and 99.53% for ten- and four-class SEM image classification tasks, respectively, using a standard shallow SVM classifier without any optimization. These results compare favorably with existing literature.

  • NFSDense201 demonstrates computational efficiency, making it highly practical for implementation.

  • The presented architecture can be readily adapted to address various classification problems.

Nevertheless, certain limitations should be acknowledged. In this study, we employ a cubic SVM classifier without hyperparameter tuning, potentially hindering the model's performance. Exploring optimization methods may yield improved classification results. Additionally, alternative advanced classifiers could have been examined. However, the primary focus of this work is to showcase the discriminative capabilities of the features generated by the main upstream model components, namely the novel nested patch division and the pre-trained DenseNet201. Given this objective, a robust yet shallow classifier like SVM adequately serves our present research goals.

6 Conclusions

In this study, we have successfully demonstrated the feasibility and practicality of the NFSDense201 model for accurate SEM image classification. By integrating the innovative nested patch division technique and efficient deep feature extraction using the pre-trained DenseNet201, we have achieved outstanding classification results. Leveraging the INCA feature selector in combination with a cubic SVM classifier, our model achieved remarkable accuracy rates of 97.09% and 99.53% for ten- and four-class classification tasks, respectively. Notably, these results were obtained using the largest publicly available SEM image dataset.

Our proposed model exhibits several desirable qualities. Firstly, it is computationally lightweight, enabling efficient processing of images. Secondly, its implementation is straightforward, ensuring ease of use for researchers and practitioners. Moreover, we believe that the NFSDense201 model holds promise beyond SEM image classification and can be readily applied to other computer vision tasks with minimal modifications.

As future research directions, we suggest exploring alternative methods and networks to enhance the nested patch division and/or replace the pre-trained DenseNet201 model. By integrating these new components, we can develop next-generation feature engineering models tailored to diverse image classification applications. This avenue of investigation has the potential to further improve the performance and versatility of our approach.