Background and literature review

The identification and classification of pollen grains are essential methods for various fields, including agriculture, ecology, paleoclimatology, agriculture, environment, paleoecology, archeology, medicine, and forensics [1,2,3,4]. The field of pollen grain taxonomy, known as Palynology, relies heavily on analyzing morphological characteristics such as general shape, polarity, symmetry, apertures, size, and ornamentation. However, due to the frequent morphological similarities among pollen grains, it can be challenging to use these features to quickly and accurately identify pollen species, genera, or even families (as illustrated in Fig. 1), and traditional identification methods have been associated with high error rates [5,6,7]. Also, manually identifying pollen grains using microscopes is time-consuming and labor-intensive.

Fig. 1
figure 1

Images of pollen grains representing the similarities in morphological features across divergent taxa. A Salvia dorii; B Monardella villosa; C Phlox longifloia; D Phlox diffusa; E Taraxacum officinale; F Taraxacum californicum; G Astragalus pulsiferae; H Astragalus purshii; I Erysimum capitatum; J Sisymbrium altissimum; K Pinus monophylla; L Abies concolor

Automating the identification process using DL algorithms can provide several benefits, including reducing the time and effort required for identification, improving the accuracy and consistency of the results, and enabling large-scale analysis of pollen grain samples. These methods can lead to new insights and discoveries in numerous fields. In recent years, the demand for high accuracy and computational efficiency has increased in industry and academia due to the availability of advanced technology in computer vision and image processing. Deep learning has been widely utilized to maximize efficiency and accuracy, reduce labor, and minimize artifacts [8,9,10,11,12]. Among the DL techniques, CNNs have gained popularity over the past decade for image classification, object detection, and task recognition, owing to their powerful neural network architecture that automatically extracts mid- to high-level features from image datasets [13,14,15].

CNN modeling has been proven effective for pollen taxonomic classification, especially when using transfer learning, which involves pre-trained CNN models to solve new problems [3,4,5, 7]. However, CNN models require massive training datasets, making pollen grain classification challenging due to the limited availability of pollen images. Transfer learning is an effective technique for learning features from small training datasets and automatically classifying images, making it a powerful tool for deep networking training without overfitting. One limitation, however, is that transfer learning heavily rely on large datasets to avoid overfitting [16,17,18]. Transfer learning is a technique in which a pre-trained CNN model, trained on a large dataset such as ImageNet that contains millions of images, is repurposed to learn a new task by leveraging the knowledge already gained from the previous task [19]. In the context of pollen classification, these pre-trained models can be used to make predictions or combined to train a new model. Transfer learning offers several advantages, including reducing the amount of time required to train a model from scratch, which is typically time-consuming and requires many parameter combinations. Moreover, utilizing pre-trained models can lead to higher accuracy and a lower risk of overfitting, making it a valuable approach for pollen classification tasks [17, 20, 21].

Previous studies on pollen grain automation have succeeded to some extent [4, 5, 7]. However, one of the main challenges in identifying pollen species is the limited availability of pollen datasets for training neural network models. The small number of datasets makes it difficult to define relevant features and variations in pollen morphology for identification purposes, especially given the similarities among pollen species [19, 22,23,24,25,26,27,28,29]. Moreover, most previous studies on pollen identification algorithms have primarily focused on Europe, Asia, and equatorial regions, leaving a gap in the literature for North America [23,24,25,26,27,28,29]. Our study is fills this gap by focusing on pollen classification in North America. Studying pollen grains in this region can expand our understanding of pollen morphology and provide more accurate identification tools for researchers worldwide.

The aim of this study was to enhance the accuracy of pollen grain image classification by utilizing transfer learning techniques to address the challenge of limited training data. To accomplish this objective, we employed a total of eleven transfer learning architectures, namely AlexNet, VGG-16, MobileNet-V2, ResNet (18, 34, 50, 101), ResNeSt (50, 101), SE-ResNeXt, and ViT, in addition to developing a CNN model from scratch. Our research objective was to achieve highly accurate and efficient classification of pollen grain images in North America, which has not been previously accomplished. Furthermore, this study sought to answer several critical scientific questions, such as the effectiveness of the proposed scratch CNN model in identifying pollen grains compared to the transfer learning models, and how the performance of the 11 transfer learning models compared to each other in identifying different types of pollen grains. We also aimed to investigate the limitations of the proposed models in identifying pollen grains and provide possible avenues for addressing these limitations in future research.

Materials and methods

Data collection and data preprocessing

To train and test the model, we needed to collect a dataset of images of pollen grains. In this study, pollen grains were collected from plants located at the University of Nevada, Reno Museum of Natural History (UNRMNH). To ensure the accuracy of pollen identifications, the researchers prepared over 400 reference slides containing pollen from previously identified native flowers at the UNRMNH. A total of 10,000 images from the 400 pollen reference slides, representing 40 pollen species, were taken and used for training the model, where each class includes a range between 95 and 500 (Figs. 2, 3) images of size 224 × 224 and in *.jpg format.

Fig. 2
figure 2

Historgram of taxa images used in this study

Fig. 3
figure 3

The basic structure of CNN

We used a ZEISS Axiolab 5 light microscope and an Axiocam 208 color microscope camera to identify and photograph pollen grains. The images were captured using 40× objective lenses and 10× ocular lenses. Z-stack images were used to capture all details of the pollen grains, showing the vertical details of pollen grains at various focus levels. To prepare the images for training the model, we cropped each image using Adobe Photoshop (CS6, 13.0.1.3). Then, we removed images with high noise levels due to debris, air bubbles, or aggregated pollen.

Before training the models, the dataset was preprocessed; this step includes normalizing the pixel values to a specific range and resizing the images to the appropriate input size for the models. We also applied data augmentation techniques such as rotating, flipping, or adding noise to the images to improve model robustness and prevent overfitting. The dataset was split into training (70%), validation (15%), and test sets (15%) to train the models.

CNN modeling background

CNNs are a type of DL model used for image classification tasks. These models comprise multiple layers, including input, hidden, and output layers (Fig. 3). The input layer takes the image dataset as input, which is then preprocessed and resized to an optimal size and passed to the convolutional layer. In the convolutional layer, filters or kernels perform element-wise multiplications with input images to extract low and high-level features, while the pooling layer reduces the size of the image while retaining important information. Next, normalization (ReLU) is applied to the features extracted in the convolutional layer, followed by processing in the fully connected layers, where the images are processed with a non-linear function to produce distinct categories with probabilities ranging from 0 to 1 for each taxon. This step adds considerable power to traditional taxonomic approaches, while the automated classification step provides a quick computerized approach for identifying pollen. The output layer provides the final classification result for the given input image.

Research methodology

Create a model from scratch

We developed the CNN model with a 6-layer model created from scratch. We chose an input image size of 224 × 224 and applied data augmentation techniques like rotation, rescale, shear, zoom, and horizontal flip to the training image data. The Rectified Linear Unit (ReLU) activation function was used within each convolutional layer. To avoid overfitting, a dropout with a rate of 0.2 was implemented. The softmax function was applied to estimate the probability for each taxon. The model consisted of three convolutional layers and two fully connected layers. The Adam optimizer with a learning rate of 0.0001 was used for training and trained the model for 14 epochs with a batch size of 32.

Transfer learning

Transfer learning was utilized as a technique to improve the classification accuracy of pollen grain images. The approach involves using a pre-trained CNN model as the starting point for a new task. The weights and biases of some layers are unfrozen and trained on the new image dataset, allowing the pre-trained model to adapt to the new task. The model architecture is adjusted by freezing some layers of the pre-trained model and modifying the output neurons to fit the specific needs of the task. The convolutional layers act as a fixed feature extractor that extracts relevant features from the input pollen images for classification. For retraining these transferred networks, the number of classes in the last layer was adjusted to 40, which is the number of pollen species present in the Great Basin.

Proposed transfer learning methods

  1. 1.

    AlexNet: is the first large-scale CNN model, which was initially created to classify millions of images in 1000 categories in ImageNet datasets [30]. The model takes input images of size 224 × 224 RGB (Red Green Blue) and consists of eight layers, including five convolutional layers and three fully connected (FC) layers. The AlexNet model has around 61 million parameters (Table 1). The output layer in the AlexNet model predicts the probability of images belonging to each pollen species category. This approach uses ReLU activation function, Dropout, and data enhancement strategies to avoid overfitting.

  2. 2.

    VGG-16: Visual Geometry Group (VGG) introduced by the University of Oxford. VGG-16 consists of 16 convolutional layers, five max-pooling layers, and three fully connected layers [13]. VGG-16 has over 138 million parameters and uses ReLU activation function and dropout regularization to improve generalization error and prevent overfitting. The final layer of VGG-16 uses the softmax activation function followed by the ReLU activation function. Images with a fixed size of 224 × 224 are used as inputs, and the stride is set to 1 (Table 1). The main difference compared to previous models is the deeper architecture, which includes associated double or triple convolution layers. In our model, we used the Adam optimizer with a learning rate of 0.0001, and training was performed with a batch size of 32 in 14 epochs.

  3. 3.

    MobileNet-V2: MobileNet-V2 is a family of neural network architectures for efficient on-device image classification and related tasks. The “V2” indicates that it’s the second version of the MobileNet architecture, which includes several enhancements over the original MobileNet. The enhancements focus on improving accuracy and reducing computational complexity, making the model more efficient for mobile and edge devices where computational resources are limited. This architecture was introduced by a team of Google engineers [31]. The MobileNet-V2 is a lightweight CNN model with 5.3 million parameters, making it remarkably efficient compared to other architectures in this study. It contains 53 layers, including an initial fully convolutional layer with 32 filters, followed by 19 residual bottleneck layers, and takes input images with a size of 224 × 224 (Table. 1). MobileNet-V2 architecture features linear bottlenecks between the layers and shortcut connections between the bottlenecks, enabling faster training and better accuracy. The MobileNet-V2 architecture utilizes depth-wise separable convolutions, resulting in models that are smaller, low-latency, and low-power. The use of global hyperparameters in this architecture optimizes accuracy, and the model builder can choose the most suitable model size to achieve better accuracy. Moreover, MobileNet-V2 uses 3 × 3 depth-wise separable convolutions that require 8 to 9 times less processing than traditional convolutions, with negligible loss in model performance.

  4. 4.

    ResNet (Residual Network): The ResNet architecture was developed by Microsoft researchers [32] and consists of various ResNet models, including ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152, ResNet-1202, and others. In this study, we utilize ResNet-18, 34, 50, and 101. The ResNet architecture introduces a novel identity shortcut connection that skips one or more layers, which helps address the issue of vanishing gradients commonly encountered in DL models. This is especially important since using a high number of layers in transfer learning often leads to the derivatives disappearing in the network. Instead of fully connected layers, ResNet uses global average pooling. Batch normalization is also utilized in the fully connected layers to achieve convergence and enable the use of higher learning rates, leading to faster training speed. The input images for this architecture need to be of size 224 × 224 pixels, as shown in Table 1.

  5. 5.

    ResNeSt: This term is short for “ResNet with Split-attention Networks”. It is an architecture developed by Facebook researchers [33] that includes between 27 and 48 million parameters. ResNeSt is a variant of ResNet that combines channel-wise attention with multi-path representation in a Split-Attention block, allowing attention across feature-map groups. Two main variants of ResNeSt are ResNeSt-50 and ResNeSt-101, which are pre-trained on the ImageNet dataset. This architecture uses an average pooling layer with a kernel size of 3 × 3, and input images of size 224 × 224 pixels (Table 1). To prevent overfitting, a dropout regularization with a probability of 0.2 is employed.

  6. 6.

    SE-ResNeXt: is an extension of the ResNeXt (ResNet with Next-gen architecture, it is a variant of the original ResNet model, which incorporates “next generation” enhancements for better performance) architecture that incorporates a squeeze and excitation (SE) block. It was introduced by Hu et al. [34] and contained over 28 million parameters. SE-ResNeXt uses the same basic building block as ResNeXt, which is a split-transform-merge strategy that enables parallel feature extraction. In this architecture, a squeeze and excitation (SE) block is used at the end of each non-identity branch of the residual block. The SE block performs channel-wise feature recalibration by explicitly modeling interdependencies between channels. This architecture creates a well model for several complex image datasets by stacking SE blocks together. The input image size for this model is fixed at 224 × 224 (Table 1).

  7. 7.

    Vision Transformer (ViT): is a novel image classification model that uses the Transformer architecture, which was initially developed for Natural Language Processing (NLP), over patches of images [35]. The Transformer is a deep neural network based on the attention mechanism that achieved state-of-the-art results in NLP tasks. This success has inspired computer vision researchers to use the Transformer approach for image classification tasks [36]. Unlike CNNs, which take pixels in images as input data, ViT divides the images into fixed-size patches (usually 16 × 16) and embeds each patch while retaining its positional embedding as input to the transformer encoder. The ViT employs self-attention to enable the model to embed knowledge across the image.

Table 1 Properties of our scratch model and eleven pre-trained CNNs

In Table 1, we compare different models in terms of their depth, parameters, input image size, complexity, and speed. Regarding the definition of complexity and speed, Complexity refers to the computational complexity of the model, which we determine primarily based on the number of layers and the number of parameters the model contains. A ‘low’ complexity model is one that is relatively simpler and requires fewer computational resources, typically having fewer layers and parameters. On the other hand, a 'high' complexity model is more intricate, having a higher number of layers and parameters, and thus requires more computational resources.

Regarding the speed, it refers to the inference speed of the model, which is the rate at which the model can process input and generate output. This rate is measured in terms of the number of input samples processed per unit of time. A ‘high’ speed model can process a larger number of input samples in a given time frame, while a ‘low’ speed model processes fewer.

These categorizations are relative and meant to provide a broad comparison across different models based on the various factors such as batch size, hardware accelerators (GPUs, TPUs) and software optimizations.

Experimental design and optimization techniques

The experiments for the scratch model, AlexNet, and VGG-16, data preprocessing, and analysis were conducted on a Dell Alienware (m17 R4) laptop using the Python programming language (version 3.10.6) and several unique libraries for running DL models. For other transfer learning experiments, we utilized the Microsoft Azure cloud computing platform with the Azure automated ML service, utilizing a Standard_NC6 virtual machine, GPU device (NVIDIA Tesla K-80), six cores, 56 GB RAM, and 380 GB storage. To optimize our models, we implemented several hyperparameters, including early stopping (using the Bandit policy with a slack factor of 0.1) and 15 ensemble iterations. Additionally, we utilized grid search to find the optimal hyperparameters, specifying the grid sampling method for sweeping over the defined parameter space. We set the maximum number of configurations to sweep to 100 iterations.

Performance metrics

Figure 4 shows a flowchart of the pollen classification steps using CNNs, including Input images, preprocessing steps, transfer learning models, and evaluation Metrics.

Fig. 4
figure 4

Flowchart showing the pollen image classification process across several steps, including: (1) Input images; (2) Preprocessing step; (3) Transfer learning models; (4) Evaluation Metrics

This section evaluates the performance of various transfer learning models in classifying pollen grain images. The models are assessed based on accuracy, precision, recall, and F1-score. The evaluation uses a macro-average, which considers the overall study and assigns equal weight to each pollen species class. The macro-average is preferred because the dataset is relatively imbalanced, and all classes are equally significant. To analyze the experimental results, the confusion matrix is used, which provides guidance for the four outcomes: TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative). These metrics used in this study provide insights into how well the model performs across all classes.

  1. 1.

    Accuracy estimates the ratio of correct predicted classes to the entire number of samples evaluated.

    $$ {\mathbf{Accuracy}}{:}\frac{TN + TP}{{TN + FN + TP + FP}} $$
    (1)
  2. 2.

    Recall (Sensitivity) is used to estimate the fraction of positive patterns that are accurately classified.

    $$ {\mathbf{Recall}}{:}\frac{TP}{{TP + FN}} $$
    (2)
  3. 3.

    Precision (Specificity) is used to estimate the positive patterns that are correctly predicted by all predicted patterns in a positive class.

    $$ {\mathbf{Precision}}{:}\frac{TP}{{TP + FP}} $$
    (3)
  4. 4.

    F1-score incorporates the precision and recall of a classifier into a single metric by using their harmonic mean.

    $$ {\mathbf{F1{\text{-}}score}}{:}\frac{2*Precision*Recall}{{Precision + Recall}} $$
    (4)

Results and discussion

Classifying small datasets in computer vision is challenging and has been a topic of interest for many researchers [37, 38]. In our study, we addressed the issue of having a limited number of pollen images by comparing the performance of a scratch model and eleven transfer learning models. Our research builds upon a few automated classification methods for pollen grains that were developed using small datasets [19, 20, 22, 23, 25]. In this study, we trained a CNN model from scratch with six layers, fine-tuned the hyperparameters, and achieved an impressive accuracy of 91.87%. We also evaluated the performance of eleven transfer learning architectures on the classification of pollen grain images. Despite imbalances in the dataset, the models achieved excellent values for accuracy (ranging from 92.87 to 97.24%), precision (ranging from 93.50 to 97.89%), recall (ranging from 93.10 to 97.13%), and F1-score (ranging from 92.40 to 96.86%). The best-performing models were ResNeSt-101 and SE-ResNeXt, with accuracy values of 97.24% and 97.05%, respectively. On the other hand, AlexNet had the lowest accuracy of 91.87%. The study also found that deeper neural networks in the ResNet architecture (ResNet-101>ResNet-50>ResNet-34>ResNet-18) performed relatively better than shallower ones, indicating the importance of having more layers in the model to improve the learning of low and high-level features in pollen grain images. Table 2 and Figure 5 provide more information on each model’s precision, recall, and F1 scores.

Table 2 Model performance of different transfer learning architectures in this study
Fig. 5
figure 5

The values of accuracy and loss for the top five best transfer learning models

The ViT has a shorter training time but may not perform well on small datasets due to its high capacity and complex architecture. ViTs require a significant amount of data to generalize well and may be overfitted on limited data [39]. In addition, ViTs apply self-attention mechanisms to capture global dependencies in the image but may not capture fine-grained details as effectively as models that use convolutional layers, such as ResNeSt and SE-ResNeSt [40]. The ResNeSt-110 model has a deeper and wider architecture compared to the ViT model, which may contribute to its better performance in this study. The ResNeSt-110 model has 110 layers, while the ViT model has only 12 layers. Deeper architectures can capture more complex and abstract features, which may be necessary for accurately identifying pollen grains [41, 42]. Additionally, the ResNeSt-110 and SE-ResNeSt models have a wider architecture, meaning that they have more channels in each convolutional layer, which allows them to capture more diverse and informative features from the pollen grain images.

On the other hand, the MobileNet-V2 network is a lightweight model that has the smallest number of parameters, making it more suitable for use in applications with limited storage space. However, its performance on pollen classification is lower than most other architectures. Therefore, we recommend using MobileNet-V2 architecture when high classification performance is not critical, such as when the model is used in a phone application. The slight decrease in classification accuracy can be tolerable in such cases.

ResNeSt and SE-ResNeXt architectures leverage multi-scale features in a nested way, which enables them to capture more complex patterns and features in the data. Specifically, ResNeSt uses a multi-scale group convolutional approach that divides the channels of each convolutional layer into groups and aggregates them hierarchically [33]. This allows ResNeSt to capture fine-grained details in the image and learn more discriminative features, which can lead to higher accuracy and precision. At the same time, the stacked blocks in SE-ResNeXt generate a highly effective model for the pollen grain dataset [34]. The SE-ResNeXt architecture is designed to enhance the representational power of a network by allowing dynamic channel-wise feature recalibration [34].

This study also found that increasing the number of layers is useful, especially in ResNet networks. However, increasing the number of channel groups in ResNeSt and SE-ResNeXt was more effective than increasing the depth. These techniques can enhance the accuracy without increasing the parameter complexity and simultaneously reduce the number of parameters. In conclusion, our analysis highlights the limitations of using relatively shallow and simple models such as Alexnet, VGG, ResNet-18, and ResNet-34 for the pollen classification task. Our experimental results demonstrate that models with shortcut connections and Squeeze-and-Excitation networks, such as ResNeSt and SE-ResNeXt, outperform the other models on the pollen dataset. Therefore, we recommend the use of these more advanced models for improved accuracy and performance in the context of pollen classification (Fig. 6).

Fig. 6
figure 6

Confusion matrix for the 40 pollen species used for the training dataset pollen images from the Great Basin. Rows are species identities, and columns are CNNs species assignments. The color bar indicates frequency, with dark green being the most frequent. The diagonal elements are the frequency of correctly classified outcomes, while misclassified outcomes are on the off-diagonals

Conclusion

In the context of pollen grain classification, transfer learning allows using pre-trained models on large image datasets to classify pollen grains more efficiently. With the approaches outlined here, we have demonstrated that we could achieve accurate classification results by fine-tuning a pre-trained model on a small dataset of pollen grain images while reducing the training time and computational resources required. This study demonstrated that increasing the complexity and depth of neural networks effectively achieves reliable and efficient classification of pollen grains at low taxonomic levels. However, classifying pollen grain datasets using machine learning and deep neural networks is challenging due to the relatively small and imbalanced sets of images. Among the transfer learning techniques used, ResNeSt-101 and SE-ResNeXt performed exceptionally well, even though the CNN architecture utilized data from the ImageNet dataset, which has no image data similar to pollen grains. These techniques worked well because of their ability to capture multi-scale features, their deeper and wider architecture, and their suitability for the task of pollen grain classification.

The findings of this research have significant implications for the study of the Great Basin Desert, as identifying pollen grains can provide insights into the plant species present in the region, their distribution, and their interactions with other organisms. Further research is needed to address the limitations of the proposed models, including a focus on potential bias in the dataset and improved interpretability of the model.