1 Introduction

Coffee beans classification performs a required role in the coffee industry, as it impacts the quality and flavor profile of the final product [1,2,3,4,5]. Accurate classification of coffee beans allows producers to make informed decisions, ensuring an improved product for the end customer. There are farmers’ efforts in manual methods for detecting coffee diseases, along with significant financial investment in training for this purpose [6,7,8]. Coffee production is sensitive to global price fluctuations, impacting economics and stability of countries dependent on this precious commodity. The rise of specialty coffee has transformed coffee consumption into a sensory experience, necessitating precise classification and differentiation of coffee types to meet the demand of coffee enthusiasts and connoisseurs. Deep learning (DL) is a subfield of artificial intelligence (AI) that has made significant advancements in the computer vision (CV) field, enabling the development of several. The Adam optimizer is applied to pre-trained models [9,10,11,12,13,14,15,16,17]. Its adaptive learning rate and momentum parameters enable efficient converging to optimal solutions, navigating complex models and enhancing their performance.

The prominent optimizer in the training of pre-trained models plays a crucial role in optimizing the process to adjust learning rates per parameter, is a popular choice for fine-tuning pre-trained models for various applications [18,19,20,21,22,23,24,25,26,27,28,29].

The versatile and robustness of this tool in optimizing pre-trained architectures make it a valuable tool for deep learning practitioners to utilize pre-trained models for various tasks [30,31,32,33,34]. It is critical to accurately predict coffee types to ensure the quality and consistency of the final product and investigate the impact of pre-trained models on the accuracy and efficiency of predicting coffee types, highlighting the growing demand for specialty coffee and the need for precise classification. We use transfer learning and fine-tuning techniques to evaluate pre-trained models and identify their strengths and weaknesses. The trained methods may lead to incorrect findings, potentially using the wrong pesticides to treat diseases, causing environmental degradation rather than treating the issue [35, 36]. We explore the impact of pre-trained models on coffee type prediction accuracy and efficiency using deep learning techniques, focusing on transfer learning, and fine-tuning [37]. In this study, we investigate the effect of selecting different pre-trained models on the performance of a coffee-type prediction system. We compare various state-of-the-art pre-trained models, such as VGG, ResNet, and MobileNet, and analyze their impact on the overall accuracy, training time, and computational resources required for predicting coffee types [38,39,40,41,42]. We proposed the use of deep learning techniques and transfer learning to predict coffee types and compared the performance of various pre-trained models that evaluate coffee classification models' generalizability and identifies pre-trained models for higher accuracy and convergence, benefiting coffee producers and processors to enhance classification systems and economic stability as shown in Fig. 1 [43,44,45,46]. The main study contributions are:

  1. 1.

    Exploring the importance of selecting the right pre-trained deep learning model for accurately classifying specialty coffee types, emphasizing the need for precise classification.

  2. 2.

    Comparing various pre-trained models like AlexNet, LeNet, HRNet, GoogleNet, MobileNetV2, ResNet-50, VGG, EfficientNet, Darknet, and DenseNet to understand their strengths and weaknesses for specific coffee classification tasks.

  3. 3.

    Employing transfer learning and fine-tuning techniques to evaluate the generalizability of models for coffee classification, utilizing pre-existing knowledge from large datasets.

  4. 4.

    providing clear performance findings on pre-trained models for coffee classification, offering recommendations for future coffee-related applications and guiding the selection process.

  5. 5.

    Enhancing the performance of pre-trained models, resulting in more accurate and efficient classification of coffee types depending on Adam optimizer.

The following is how the remaining study is structured. Section 2 commonly presents related works. The methodology is presented in Sect. 3. In Sect. 4, experimental evaluation is offered. In Sect. 5, Conclusion and Discussion is provided.

Fig. 1
figure 1

The general architecture for the study

2 Related works

Pre-trained models, trained on large datasets, may have limitations like inability to handle new data types or certain tasks, and may not be optimal for specific tasks due to their training conditions [47,48,49,50]. In this section, we present common related works of pre-trained models for computer vision tasks. Researchers are focusing on pre-trained model types, which are trained on large datasets, saving time and resources are illustrated in Table 1. Esgario et al. [6] suggested a powerful and useful method that could recognize and gauge the level of stress that biotic agents on coffee leaves were causing. A multitask system built on convolutional neural networks makes up the suggested method. They have also investigated using data augmentation techniques to strengthen and improve the system. The accuracy of the suggested system's ResNet50 architecture-based computational trials was 95.24%. When classifying flaws in coffee beans, Chang et al. [7] suggest a technique that has been proven to lessen prejudice. The proposed model had a 95.2% detection accuracy rate. The accuracy rate reached 100% when the model was restricted to defect detection. To develop an automated system, Sorte et al. [8] suggest a technique for employing computational approaches to identify serious diseases in coffee leaves. Expert method to help coffee growers identify diseases in their early phases. Due to the shapelessness of these two diseases, a texture attribute extraction approach for pattern recognition is inspired. Using the suggested layer, Novtahaning et al. [9] describe an ensemble strategy for DL models. Selecting three models that excel at classification and joining them to create an ensemble architecture that is then fed to classifiers to decide the output. To improve the quality and expand the data sample's size, a data pre-processing and augmentation procedure is also used [51, 52]. By achieving 97.31% validation, the suggested ensemble architecture exceeded other cutting-edge neural networks in performance. Velásquez et al. [10] propose an experiment employing a Coffee Leaf Rust development stage diagnostic model in the Coffea arabica, Caturra variety, and scale crop using wireless sensor networks, remote sensing, and DL techniques. An F1-score of 0.775 was attained by diagnostic model. Gope et al. [11] propose a deep neural network model to classify green coffee beans with multi-label properties. The model, modified from the EfficientNet-B1 model, uses branches to correspond to each defect after feature extraction layers. This improved overall performance by an f1-score of 0.8229, compared to the single EfficientNet-B1 model. liang et al. [13] aim to develop an automated coffee bean inspection system using YOLOv7 and a convolutional neural network (CNN). The system classifies coffee beans into broken, insect-infested, and mold categories using transfer learning. The YOLOv7 image recognition model processes the captured beans, determining if they are good or defective. The DenseNet201 model achieves an accuracy of 98.97% in classifying defective coffee beans. Ke et al. [18] explore deep convolutional neural networks (DCNNs) are proposed for capturing coffee bean images, with results showing 98% accuracy. The lightweight model, with around 250,000 parameters, is practical due to its low cost. Chen et al. [19] suggest a model architecture that combines semi-supervised learning and attention mechanisms, combining explainable consistency training and a directional attention algorithm to improve prediction ability and achieve an F1-score of 97.21%.Hsia et al. [20] propose a lightweight deep convolutional neural network (LDCNN) for detecting quality in green coffee beans. The model combines DSC, SE block, and skip block frameworks, and includes rectified Adam, lookahead, and gradient centralization to improve efficiency. The model's local interpretable model-agnostic explanations (LIME) model is used to explain predictions. Experimental results show an accuracy rate of 98.38% and F1 score of 98.24%, resulting in lower computing time and parameters.

Table 1 The related work in the ResNet (50) architecture

Esgario et al.'s approach using a multitask system based on convolutional neural networks (CNNs) for stress detection in coffee leaves presents an advantage in its focused application. Nevertheless, a potential drawback lies in its limited scope, as the model may not excel in tasks beyond the detection of stress in biotic agents. Chang et al.'s model for defect detection in coffee beans demonstrates reduced bias and achieves high detection accuracy. However, its drawback lies in its restricted scope, optimized primarily for defect detection and possibly lacking generalization to other tasks [6]. Sorte et al.'s computational approach for disease identification in coffee leaves using expert methods is advantageous. However, it is limited to early-phase diseases, potentially struggling with the identification of late-stage diseases [8]. Novtahaning et al.'s ensemble strategy, combining three models for improved classification, demonstrates enhanced performance. However, its drawback lies in its increased complexity, which may be computationally demanding [9]. Velásquez et al.'s diagnostic model integrating wireless sensor networks, remote sensing, and deep learning for Coffee Leaf Rust presents advantages in technology integration. However, the limited F1-score achieved indicates a potential need for improved diagnostic accuracy [10]. Gope et al.'s model for multi-label classification of green coffee beans using branches presents improved performance. However, its specificity to green beans may limit its optimality for other types or stages of coffee beans [11]. Liang et al.'s system integrating YOLOv7 and a convolutional neural network for coffee bean classification is advantageous in its integration of image recognition models. However, its limitation lies in focusing on specific defect categories (broken, insect-infested, and mold), potentially overlooking other defects. Sim et al.'s approach using hyperspectral imaging for rapid and non-destructive origin classification offers advantages. However, its dependency on this technology may limit its universal applicability [13]. While pre-trained models offer efficiency gains in various coffee-related applications, their limitations highlight the importance of choosing or adapting models based on the specific requirements of the task at hand.

3 Methodology

In this study, we apply a several pre-trained architectures to classify images of coffee beans by using the Coffee Bean Dataset, which contains images of various types of coffee beans [53,54,55]. The purpose of this comparison is focusing on the strength and weakness of this architecture. This section describes the details of the CNNs implemented for coffee-type classification.

This study concentrates on finding the most appropriate pre-trained CNN model. The entire procedure is divided into four basic steps: data acquisition, data training, data classification, and data evaluation, which are detailed below.

3.1 Data acquisition

The Coffee Bean Dataset is a collection of information about coffee beans from around the world that is shown in Fig. 2. It includes data on the origin, type, and flavor profile of each bean, as well as information on the growing conditions and processing techniques used to produce them. The images are automatically collected and saved in PNG format, with 4800 images in total, classified into 4 degrees of roasting, with 1200 images under each degree that are illustrated in Table 2. We went with the Coffee Bean Dataset for our study because it's got a solid stash of details about coffee beans from all sorts of places worldwide. We wanted to dig into the specifics of various coffee beans—where they come from, what kind they are, their flavor, how they grow, and how they're processed. The idea is to get down to the nitty–gritty and find patterns and insights that help us grasp what factors play into the unique characteristics of coffee beans [1]. The way the dataset is so neatly organized, with a whopping 4800 images in PNG format, gives us a real treasure trove of visual data. We took it a step further and sorted the beans into four roasting levels, making it a cool 1200 images for each level. This not only lets us dig into how roasting changes the look of the beans but also opens the door for a more detailed analysis of the flavors linked to each roasting level. The Coffee Bean Dataset fits like a glove with what we're trying to do in our study. We're all about exploring how the characteristics of coffee beans connect to how they look, and at the same time, we're checking out how different roasting levels shake up their flavors. This alignment is key—it makes the dataset super relevant, ensuring our study can draw some real-deal conclusions that add to the bigger picture of understanding coffee bean traits [9].

Fig. 2
figure 2

Coffee dataset samples

Table 2 Coffee dataset structure

3.2 Data training

In this phase, the training of the CNNs was carried out through the dataset, ImageNet, with the objective of initializing the weights before the training on this coffee dataset [56,57,58]. On the next stage, we showed the advantage of transfer learning, which aims to transfer knowledge from one or more domains and apply the knowledge to another domain with a different target task [59]. Fine-tuning is a transfer learning concept that consists of replacing the pre-trained output layer with a layer containing the number of classes in the coffee dataset. The main purpose of using pre-trained CNN models is related to the fast and easy training of a CNN using randomly initialized weights, as well as the achievement of lower training errors than ANNs that are not pre-trained. The performance of the following CNN architectures has been evaluated for the coffee plant classification problem with common pre-trained architectures.

3.2.1 AlexNet architecture

AlexNet consists of five convolutional layers and three fully connected layers, with output fed into two fully connected layers and a softmax classifier. The AlexNet model uses the Rectified Linear Unit (ReLU), which is applied to each of the first seven layers with architecture equations [60]. It excels at learning complex features from raw data, resists overfitting, and has a low computational cost, but its main drawback is the high demand for labeled data. We apply it to coffee classification with the following steps, as shown in Algorithm 1.

3.2.2 CSPDarknet53 architecture

The CSPDarknet53 architecture, developed by Chinese Academy of Sciences researchers, is a CNN based on Darknet53, offering high object detection accuracy and efficient computational resource utilization.

This architecture excels in object detection and image classification, but has limitations like high training data requirements, slow inference speed, and reliance on pre-trained weights from ImageNet, making it unsuitable for real-time applications and new datasets with architecture equations [61]. Algorithm 2 illustrates the detailed steps for coffee dataset with CSPDarknet53.

figure a

The Alexnet architecture for coffee image dataset

Algorithm 2
figure b

The CSPDarknet53 architecture for coffee image dataset

3.2.3 Darknet-53 architecture

Darknet-53, a 53-layer deep neural network developed by Joseph Redmon and Ali Farhadi, is a highly accurate real-time object detection method used in various applications. Its training process is time-consuming and computationally expensive. Algorithm 3 illustrates the detailed steps for coffee dataset with Darknet-53 with architecture equations [62].

figure c

The Darknet-53 t architecture for coffee image dataset

3.2.4 DenseNet architecture

DenseNet is a feed-forward convolutional neural network architecture, offering improved parameter efficiency, reduced parameter number, and accuracy, making it ideal for image classification tasks with fewer parameters. It despite its limitations like increased memory usage and slower training times, remains a powerful image classification tool with numerous advantages over traditional CNNs. Algorithm 4 illustrates the detailed steps for coffee dataset with DenseNet with architecture equations [63].

3.2.5 EfficientNet architecture

Google AI's EfficientNet is an advanced CNN architecture with improved accuracy, faster training times, and smaller model sizes. It features AutoML for task-specific hyperparameter search, but its complexity is a disadvantage. EfficientNet, despite its complexity and high computational requirements, is an effective architecture for tasks like image classification, object detection, and natural language processing, offering high accuracy and efficiency, making it an attractive choice for various applications. Algorithm 5 illustrates the detailed steps for coffee dataset with EfficientNet with architecture equations [64].

Algorithm 4
figure d

The DenseNet architecture for coffee image dataset

Algorithm 5
figure e

The EfficientNet architecture for coffee image dataset

3.2.6 GoogLeNet architecture

GoogleNet uses an Inception module for multiple filters, improving accuracy in input images. However, it has high computational cost, training difficulties due to large layers, and requires large data. Furthermore, it is not suitable for real-time applications due to its complexity and slow inference time. Algorithm 6 illustrates the detailed steps for coffee dataset with GoogLeNet with architecture equations [65].

Algorithm 6
figure f

The GoogLeNet architecture for coffee image dataset

3.2.7 HRNet architecture

HRNet is a DL architecture enhancing image recognition accuracy through hierarchical feature representation, capturing high-level semantic information, robustness to noise, and scalability but faces challenges like large training data requirements and hyperparameter optimization difficulties. It, despite its limitations in handling complex scenes and data augmentation techniques, remains an attractive choice for image recognition tasks due to its high accuracy and scalability. Algorithm 7 illustrates the detailed steps for coffee dataset with HRNet with architecture equations [66].

Algorithm 7
figure g

The HRNet architecture for coffee image dataset

3.2.8 Residual network

ResNet-50 is a complex architecture that can learn complex features from large datasets with fewer parameters, but it can be computationally expensive and difficult to interpret. Additionally, ResNet-50 may not be suitable for certain types of tasks such as image generation or natural language processing due to its limited capacity for learning abstract features. Algorithm 8 illustrates the detailed steps for coffee dataset with Residual Network with architecture equations [67].

3.2.9 VGG architecture model

VGG is a widely used technique for object detection, image segmentation, and facial recognition, renowned for its high accuracy in recognizing complex patterns in images. VGG has limitations, including the need for large amounts of data for training and its computational cost due to its numerous parameters. Algorithm 9 illustrates the detailed steps for coffee dataset with VGG Network with architecture equations [68].

Algorithm 8
figure h

The Residual architecture for coffee image dataset

Algorithm 9
figure i

The VGG architecture for coffee image dataset

3.2.10 MobileNetV2 architecture

MobileNetV2 is computationally efficient and accurate model suitable for mobile applications, with smaller datasets, making them ideal for real-world scenarios due to their inverted residuals and linear bottlenecks. MobileNetV1 and MobileNetV2 have limitations in scaling, accuracy, computational efficiency, and tuning, and require more tuning for optimal performance in complex tasks or datasets. Algorithm 9 illustrates the detailed steps for coffee dataset with MobileNetV2 Network with architecture equations [68].

Algorithm 10
figure j

The MobileNetV2 architecture for coffee image dataset

3.3 CNN settings

The CNN settings usually consist of a series of specific elements, which are the ones that present the variations in the different architectures. To allow a fair comparison between the experiments, an attempt was also made to standardize the hyper-parameters across the experiments, using the following hyper-parameters, which are described in Table 3. DL has significantly advanced in many research areas.

Table 3 The general overview comparison between the used pre-trained architecture

3.4 Data classification

The number of the classification output layer is equal to the number of the classes. Then, each output has a different probability for the input image because these kinds of models can automatically learn features during the training stage; then, the model picks the highest probability as its prediction of the class. This phase determines which disease is present in the leaf using the pre-trained set. Figure 3 shows the general structure for our proposed work.

Fig. 3
figure 3

The main structure for our proposed work

4 Result and discussion

In this section, we evaluate the performance of various neural network architectures for coffee classification. Alex Net, SPDarknet53, Darknet, DenseNet121, EfficientNet, GoogleNet, HRNet, LeNet, MobileNetV2, ResNet-50, and VGG-19 showed impressive accuracy and efficiency.

The study emphasizes the importance of choosing the right neural network for specific tasks, considering factors like model complexity and computational efficiency. The experiments highlight the diverse capabilities of these architectures.

4.1 The pre-trained architecture

In this section, we compare the proposed work architecture that used for coffee classification.

4.1.1 Alex Net

AlexNet's coffee classification model faces overfitting risk, particularly in smaller datasets, due to its complex architecture and large number of parameters, causing rapid convergence and reduced accuracy in real-world scenarios. Figure 4 shows the common learning curves that appear to be the drawbacks of using it.

Fig. 4
figure 4

a BoxPlot, b violin curve, c KDI curve, d learning curve

4.1.2 LeNet architecture

LeNet's shallow architecture may hinder its ability to capture complex features in coffee images, causing slower convergence and limited pattern learning, potentially impacting classification accuracy, especially with fine-grained coffee varieties. Figure 5 shows the common learning curves that appear to be the drawbacks of using it.

Fig. 5
figure 5

a BoxPlot, b violin curve, c KDI curve, d learning curve

4.1.3 HRNet architecture

HRNet's high-resolution architecture increases computational intensity due to numerous parameters and slow convergence in the learning curve, requiring substantial resources for training and inference. Figure 6 shows the common learning curves that appear to be the drawbacks of using it.

Fig. 6
figure 6

a BoxPlot, b violin curve, c KDI curve, d learning curve

4.1.4 Google net architecture

GoogleNet's intricate architecture, utilizing multiple inception modules with varying kernel sizes, can increase training time and slow convergence in the learning curve compared to simpler models. Figure 7 shows the common learning curves that appear to be the drawbacks of using it.

Fig. 7
figure 7

a BoxPlot, b violin curve, c KDI curve, d learning curve

4.1.5 Mobile V2 net architecture

MobileNet architectures are lightweight and efficient, reducing parameters for resource-constrained devices. However, this can limit their ability to accurately classify complex tasks like coffee classification. Figure 8 shows the common learning curves that appear to be the drawbacks of using it.

Fig. 8
figure 8

a BoxPlot, b violin curve, c KDI curve, d learning curve

4.1.6 ResNet (50) architecture

ResNet-50's deep architecture demands significant computational resources for training, potentially leading to extended training times and high memory usage, making it less practical for applications with limited computational capabilities. Figure 9 shows the common learning curves that appear to be the drawbacks of using it.

Fig. 9
figure 9

a BoxPlot, b violin curve, c KDI curve, d learning curve

4.1.7 VGG architecture

ResNet-50's deep architecture demands significant computational resources for training, potentially leading to extended training times and high memory usage, making it less practical for applications with limited computational capabilities. Figure 10 shows the common learning curves that appear to be the drawbacks of using it.

Fig. 10
figure 10

a BoxPlot, b violin curve, c KDI curve, d learning curve

4.1.8 Efficient architecture

EfficientNet models, with fewer parameters, can be efficient in resource-constrained environments but may struggle with complex coffee image datasets due to limited learning capacity. Figure 11 shows the common learning curves that appear to be the drawbacks of using it.

Fig. 11
figure 11

a BoxPlot, b violin curve, c KDI curve, d learning curve

4.1.9 Darknet architecture

Darknet's deep architecture and capacity can require significant computational resources for training and inference, making it less practical for resource-constrained environments or real-time applications. Figure 12 shows the common learning curves that appear to be the drawbacks of using it.

Fig. 12
figure 12

a BoxPlot, b violin curve, c KDI curve, d learning curve

4.1.10 DenseNet architecture

DenseNet's architecture, featuring dense skip connections, can be challenging to understand and optimize due to complex hyperparameter configuration and debugging, and show a slower learning curve. Figure 13 shows the common learning curves that appear to be the drawbacks of using it.

Fig. 13
figure 13

a BoxPlot, b violin curve, c KDI curve, d learning curve

The evaluation of deep learning models for coffee classification reveals AlexNet, LeNet, HRNet, GoogleNet, MobileNetV2, ResNet, VGG, EfficientNet, Darknet, and DenseNet as robust models with high sensitivity, precision, and accuracy, but with moderate F1 Scores and potential computational complexity. These insights help in selecting suitable models for coffee classification, as illustrated in Table 3 and Table 4.

Table 4 Comparison of Hyperparameters for the used pre-trained architecture

The evaluation results provide a nuanced understanding of each architecture's advantages and drawbacks. While some models excel in specific aspects, considerations of computational complexity and task-specific requirements are crucial for informed model selection. These insights contribute valuable guidance for practitioners seeking optimal deep learning models for classification tasks.

5 Conclusion

The global coffee industry, a vital sector for numerous nations, is currently grappling with economic instability due to fluctuations in global coffee prices. The study utilized deep learning techniques, specifically pre-trained models, to accurately predict coffee types. The selection of the best pre-trained model is crucial due to the growing demand for specialty coffee and the need for precise classification. In this study, we focus on evaluating the effectiveness of various pre-trained models through deep learning techniques. The motivation behind opting for transfer learning lies in the recognition that leveraging knowledge gained from models trained on large datasets can significantly boost the performance of models trained on smaller datasets. The increasing popularity of specialty coffee amplifies the need for accurate classification, making the selection of an optimal pre-trained model crucial. In our comprehensive comparison, we assess several well-known pre-trained models, including AlexNet, LeNet, HRNet, Google Net, Mobile V2 Net, ResNet (50), VGG, Efficient, Darknet, and DenseNet. Through transfer learning and fine-tuning, we gauge the models' ability to generalize to the coffee classification task illustrated in Table 5 and Fig. 14. We reveal the pivotal role of the pre-trained model choice in influencing performance, with specific models demonstrating higher accuracy and faster convergence than conventional alternatives. By employing key evaluation metrics such as sensitivity, specificity, precision, negative predictive value, accuracy, and F1 score, we provide nuanced insights into the complex landscape of pre-trained models. This strategic use of transfer learning and fine-tuning not only enhances the accuracy of coffee classification but also contributes to addressing economic challenges associated with global price fluctuations in coffee bean production. In addition to advancing our understanding of coffee bean production dynamics, our study has tangible implications for real-world problem-solving.

Table 5 The state-of-the-art comparison between the proposed work models
Fig. 14
figure 14

The chart for with standard deviations for proposed model attributes

The challenges faced by coffee production in the face of global price fluctuations directly impact the economic stability of countries reliant on this industry. As the specialty coffee market continues to grow, precise classification becomes paramount. Our comprehensive evaluation, leveraging transfer learning and fine-tuning, ensures that the selected pre-trained models can be practically applied in the coffee industry for more accurate and efficient classification. In essence, our study transcends the theoretical realm by offering a valuable resource for coffee producers, processors, and distributors. The nuances uncovered through our result evaluation metrics, including sensitivity, specificity, precision, negative predictive value, accuracy, and F1 score, provide actionable insights into the practical use of pre-trained models in addressing challenges faced by the coffee industry. This research not only enhances our understanding of the intricate landscape of pre-trained models but also contributes to the development of tools that can make a real impact in the coffee production sector. In the future, we will use it for the real-time classification of coffee bean images. This could potentially lead to improved efficiency in sorting and categorizing coffee beans for commercial purposes.