Introduction

Longjing tea is a famous green tea loved by consumers from all over the world. After thousands of years of development, Longjing tea has become one of the most popular tea. Longjing tea is mainly planted in Zhejiang Province, China, and its production areas can be divided into three main geographical origins: West Lake Zone, Qiantang Zone, and Yuezhou Zone. Different geographical origins produce different subtypes of Longjing tea [1, 2]. According to various growth environments and processing techniques, including processing techniques and plucking time, Longjing tea from each geographical origin can also be classified into different quality levels [3]. From a marketability perspective, high quality equates to high prices and high profits. Therefore, it is of great significance to identify the quality of different subtypes of Longjing tea.

Traditional manual identification of tea quality is typically labor intensive, time consuming, and subjective. In recent years, image-based computerized systems using image processing and machine learning techniques have been developed to overcome these problems. Some popular classifiers, including K-nearest neighbor (KNN), random forest (RF), artificial neural network (ANN), and SVM, were used for different kinds of tea quality identification and achieved excellent results [4,5,6]. It is known that achieving accurate identification also relies on effective hand-designed features, such as color, texture, and shape [7, 8]. However, different qualities of tea tend to have minor differences in appearance, resulting in low identification accuracy of hand-designed features combined with classical machine learning methods [9].

Deep learning models have achieved great success in many computer vision tasks [10]. The most commonly used tools in deep learning include convolutional neural networks (CNNs), which are complex and efficient. CNNs have a high rate of discrimination and have proven to provide good results in precision agriculture [11, 12]. The CNN model provides an end-to-end solution to extract features and classify them with a high degree of automation. In addition, the self-learned high-level features are powerful enough to deal with many complex and high-similarity problems [13,14,15]. However, training a CNN model requires a large number of labeled images with ground truth. Concerning our situation, collecting and annotating many images from each quality level of Longjing tea and every geographical origin is undoubtedly tedious and expensive.

As a new branch of machine learning, transfer learning can take advantage of the similarities between data, tasks, or models and apply the models and knowledge learned in the existing domain (called the source domain) to the new domain (called the target domain) [16]. Using the similarity between different datasets to achieve transfer is an intuitive idea. Some labeled training samples from other available datasets can be incorporated as complementary training data using transfer learning strategy. Therefore, the current datasets may no longer require a large amount of data. In particular, Longjing tea subtypes from different geographical origins all belong to Longjing tea and have high similarities. They share common knowledge and have the potential to save much sample collection and processing work through transfer learning. Hence, combining deep learning and transfer learning can not only take advantage of deep neural networks to extract discriminative semantic features but also reduce data hunger through knowledge transfer. At present, some scholars have applied basic transfer learning strategies, such as fine-tuning pre-trained models or feature extraction, to plant disease and pest detection [11, 17, 18], fruit classification [13][19], and sheep facial expression classification [20]. The results show that transfer learning is an effective strategy for building high-performance classification models.

However, some problems hinder further development. First, current deep transfer learning is limited to learning from the knowledge in the pre-trained model by saving and adjusting parameters. The dataset used by the pre-trained model is a general large-scale visual dataset ImageNet, which has good universality but is not targeted for specific tasks, and the effect of the transfer is limited. Second, due to the data distribution gap, not every sample in the source domain is suitable for transfer learning. In some cases, ‘negative transfer’ may even occur, severely reducing the model performance [21]. Deep convolutional neural networks combined with the softmax classifier cannot filter out suitable transfer learning samples at the instance level.

To solve the problems mentioned above, we propose an instance-based deep transfer learning method for the quality identification of Longjing tea in this paper. First, the MobileNet V2 model is trained using the hybrid training dataset containing all labeled samples from source and target domains. The trained MobileNet V2 model is used as a feature extractor instead of directly using the pre-trained model. Then the multiclass TrAdaBoost algorithm is proposed for instance-based transfer learning, and valuable samples in the source domain are given higher weights to improve transferability. With the help of Longjing tea images from other geographical origins, the proposed method can accurately identify the quality of Longjing tea in the current geographical origin with limited samples.

The contributions of this paper are as follows:

  • According to the common demands of image-based tea quality identification, we build three novel Longjing tea quality datasets. Longjing tea images from three different geographical origins of West Lake, Qiantang, and Yuezhou are collected, and the tea from each geographical origin contains four grades. The Longjing tea from West Lake is regarded as the source domain, which contains more labeled samples, and the Longjing tea from the other two geographical origins contains only very limited labeled samples, which are regarded as the target domain. The tasks of all domains are the same, i.e., to realize the quality identification of tea. The constructed datasets can be used to verify the intra-domain and cross-domain classification performance of the model. As far as we know, there are few cross-domain classification datasets in the agricultural field.

  • The feature extraction capabilities of four common lightweight CNN architectures constructed using different training strategies are compared. The results show that the MobileNet V2 model trained with hybrid training datasets containing all labeled samples from source and target domains has the best feature extraction ability. The trained MobileNet V2 model is used as a feature extractor.

  • We propose a novel multiclass TrAdaBoost algorithm. It extends the original TrAdaBoost to the multiclass classification problem, maintains low computational complexity, and avoids class imbalance. The multiclass TrAdaBoost are trained with the deep features extracted from the MobileNet V2 model.

  • We explore the effect of the proposed instance-based deep transfer learning method on the performance of Longjing tea quality classification based on extensive experiments. The effectiveness of the proposed method is validated in detail.

The remainder of this paper is structured as follows: Sect. “Related works” introduces the related research works of this study. Sect. “Materials and methods” provides a detailed description of the materials used and the methods proposed in this paper. Sect. “Results and discussions” presents the comparison results and discussions. Sect. “Conclusion” summarizes the conclusions.

Related works

In this section, some related research on image-based tea quality identification, transfer learning, and boosting algorithms is introduced. Some limitations of the current study are also summarized.

Image-based tea quality identification

Tea appearance is an important attribute that can directly reflect tea quality [5]. Generally, computer vision systems (CVSs) are designed to measure the appearance of samples, which mimic the human vision process. Compared with other non-destructive testing technologies (such as electronic nose, electronic tongue, near-infrared spectroscopy, etc.), the computer vision system for image collection is easy to establish. The collection speed is fast (a sample only takes a few seconds), and the amount of information in images is rich. Many image-based related studies have been carried out, and the effectiveness of image-based methods has been validated [22]. Gill et al. [23] discriminated between four different grades of made black tea by texture features and multilayer perceptron (MLP) techniques, and 82.33% classification accuracy was achieved. Bakhshipour et al. [24] used two common heuristic feature selection methods, correlation-based feature selection (CFS) and principal component analysis (PCA), to select the most significant features. The results show that the ANN with 7-10-4 topology developed by CFS-selected features provided the best classifier with a classification rate of 96.25%.

In recent years, deep learning has provided powerful tools for research related to the tea industry. Liu et al. [25] compared the quality identification results of Chinese chrysanthemum tea products with multivariate classification models and deep learning methods. The results showed that the classification performance of the self-designed simple deep neural network significantly outperforms other multivariate classification models. Zhang et al. [26] built a 12-layer CNN for the classification of 3 kinds of tea. Data augmentation and stochastic gradient descent with momentum (SGDM) are used in the training phase. The experiments showed that a 12-layer CNN gives a good result. The sensitivities of oolong, green, and black tea are 99.5%, 97.5%, and 98.0%, respectively. Chen et al. [27] developed a CNN model named LeafNet to extract the features of tea plant diseases from images and constructed SVM and MLP classifiers. The results show that LeafNet was superior in the recognition of tea leaf diseases compared to the MLP and SVM algorithms. Kimutai et al. [28] proposed a deep learning model named TeaNet to detect the optimum fermentation of tea. The experimental results showed that TeaNet was superior in the classification tasks compared to the other machine learning techniques, including KNN, SVM, RF, and linear discriminant analysis (LDA). Kimutai et al. explored the use of the internet of things (IoT) and CNN with majority voting techniques in detecting the optimum fermentation of black tea. The deep learner recorded the highest precision and accuracy of 95.89% and 86.46%, respectively, when evaluated on real-time images.

The high-level features automatically extracted by deep CNNs are influential and representative enough to deal with the more challenging situation, such as the high similarity between different tea qualities [10, 29]. However, CNN-based deep learning models have a large number of parameters. For example, the classical residual neural network structure ResNet 50 [30] has more than 25 million parameters. The large number of parameters requires massive data for training to prevent over-fitting. The task of tea quality identification requires researchers to collect images themselves to construct the dataset. Collecting and labeling a large number of images is undoubtedly very expensive and difficult. Hence, the current obstacle to deep learning in tea quality identification is mainly the contradiction between the massive amount of data required for deep learning and manual data collection for tea quality identification. One possible solution is using transfer learning to learn general knowledge and reduce the amount of learning [31]. By combining the transfer learning strategy with the deep learning model, the transferability of deep learning is brought into play.

Transfer learning overview

Pan and Yang [16] give a classical definition of transfer learning: Given a source domain \(D_{S}\) and learning task \(T_{S}\) and a target domain \(D_{T}\) and learning task \(T_{T}\), transfer learning aims to help improve the learning of the target predictive function \(f_{T} (.)\) in \(D_{T}\) using the knowledge in \(D_{S}\) and \(T_{S}\), where \(D_{S} \ne D_{T}\), or \(T_{S} \ne T_{T}\). Pan and Yang also divide transfer learning approaches into four categories according to principles: instance-based transfer, feature-based transfer, parameter/model-based transfer, and relation-based transfer. The instance-based transfer learning approach relies on reweighting some labeled data from the source domain for use in the target domain, which is very intuitive, concise, and highly interpretable in theory. Much research work focuses on estimating the distribution ratio of the source domain and the target domain and using it as the weight of the samples [32,33,34,35,36].

Deep learning dramatically expands the scope of transfer learning and provides more possibilities. Experiments have proven that the hierarchical structures of CNNs have scalability and domain transferability [37]. Different fine-tuning methods, including extracting the output features of a particular layer, using pre-trained model parameters as initialization, and freezing or modifying the trainable parameters of particular layers, have achieved good results in many application scenarios [38, 39]. Zhu et al. [40] used the deep features from 12 CNN models to train the SVM classifier for carrot appearance recognition. The deep features of the three-layer fully connected layer of the network models (AlexNet, VGG16, VGG19) were also extracted and compared. The results showed that the accuracy of deep features with SVM was superior to the fine-tuned models. Arora et al. [41] utilized a pre-trained CNN model to achieve acrylamide identification in potato chips. The learning rate, optimization techniques, and loss function were also compared and discussed. Simulation results demonstrated that MobileNet V2 outperformed the AlexNet, ResNet-34, ResNet-101, VGG-16, and VGG-19 models. Guo et al. [42] proposed the transfer weighted extreme learning machine (TWELM) classifier to solve the class imbalance problem. Experimental results on real-world data sets show that TWELM outperforms existing algorithms on classification accuracy and computation cost. The knowledge matching strategy similar to this method also makes great contributions to solving dynamic multi-objective optimization problems [43]. The deep transfer learning method based on fine-tuning the CNN model inherits the advantages of deep learning, which can obtain high-level and powerful features and has good generalization and robustness. In recent years, Vison Transformer (ViT), as a new backbone, has made great achievements in various visual tasks. By considering the global information of the image, ViT is more competitive in some visual classification tasks [44, 45]. Scholars have proved that ViT can be used in traffic sign classification [46], plant disease detection [47], and face recognition [48]. Vit-based transfer learning begins to attract more and more attention [49,50,51].

For our task, the main characteristic is the high similarity in two aspects: the high similarity between different classes within the domain and the high similarity between the domains. Powerful features obtained by fine-tuning CNN models can distinguish different classes with high similarity. At the same time, the high similarity between domains means a similar data distribution, which is perfect for the instance-based transfer learning approach. Hence, fusing the instance-based transfer learning method with the deep learning model is an intuitive potential solution. However, the deep learning model contains the characteristic information of a large amount of data. How to associate deep learning with instance-based transfer learning methods is a major challenge. Current research rarely involves relevant aspects.

Boosting for classification and transfer learning

Boosting is a general concept of improving the learning algorithm’s performance by combining a group of ‘weak learners’ to generate a ‘strong learner’ [52]. The AdaBoost algorithm is the first and classic boosting method [53]. The relative weights of incorrectly classified samples are increased, and correctly classified samples' weights are decreased in each iteration. Essentially, AdaBoost is an instance-based learning algorithm. To extend AdaBoost to multiclass classification problems, scholars have made different improvements. AdaBoost.M1 [54] adjusted the weight update function to adapt to multiclass classification. In the previous iteration, the weights of the incorrectly classified samples remain unchanged, while the weights of the correctly classified samples decrease. AdaBoost.OC [55] was proposed to solve multiclass classification problems by combining AdaBoost and error-correcting output codes [56]. To improve the computational efficiency, Hastie et al. [57] extended the AdaBoost algorithm by stagewise additive modeling (SAMME) using a multiclass exponential loss function. SAMME has shown outstanding performance with low computational complexity and has become the current mainstream when considering AdaBoost.

Boosting-based transfer learning algorithms are instance-based transfer learning approaches that utilize labeled data from the source domain to improve the classification performance in the target domain by a reweighting strategy. Dai et al. [58] proposed the popular boosting-based transfer learning algorithm TrAdaBoost and adapted it with SVM as the base learner for two-class text classification. The main principle of TrAdaBoost is the utilization of available source data sharing some similarities with the target data and characterizing data distribution differences by reweighting. Similar samples are screened out, and negative transfer is effectively avoided. To extend TrAdaBoost to multiclass classification problems, Li et al. [59] extended the conventional TrAdaBoost for sandstone microscopic image classification by applying the one-vs.-all method. However, extending the binary classification algorithm to multiclass classifications directly through the one-vs.-all or one-vs.-one [60] method will bring about data imbalance or high computational complexity. It is not the optimal solution for the multiclass classification problem. A more efficient multiclass TrAdaBoost algorithm is urgently needed. On the other hand, the base learner also greatly influences the performance of the overall TrAdaBoost algorithm. The effect of base learner is worthy of in-depth study and exploration, and there are currently few studies comparing the effects of the base learners on their classification tasks.

Materials and methods

In this section, the details of the dataset construction and proposed methods are illustrated in appropriate subheadings.

Data collection and pre-processing

Three subtypes of Longjing tea images from the geographical origins of the West Lake Zone (Wllj), Qiantang Zone (Qtlj), and Yuezhou Zone (Yzlj) were collected and pre-processed, which formed three tea quality datasets. Every subtype of Longjing tea is divided into four grades. According to GB/T 18650-2008, the Longjing tea geographical origins are in the center of Zhejiang Province, primarily confined by 28.87–30.55°N and 118.38–121.22° E. To verify the proposed method, the three datasets have different sample sizes. The Wllj dataset is regarded as a source domain (\(D_{S}\)) with more labeled images with ground truth. The Qtlj dataset and Yzlj dataset are regarded as target domains (\(D_{{T_{1} }}\),\(D_{{T_{2} }}\)) with few labeled images with ground truth. The details of the three datasets are shown in Table 1.

Table 1 Details of three datasets from different Longjing tea geographical origins

The tea image samples were collected in a dark room isolated from outside light. During each collection, 5 g of tea leaves were placed on the white platform (length: 200 mm, width: 200 mm). A CCD camera (OSEECAM H1600ST, Shenzhen Weishenshidai Technology Co., Ltd., Shenzhen, China) with a 5-mm lens was used to capture images from 120 mm above the sample. A ring LED light (produced by Shenzhen Weishenshidai Technology Co., Ltd., Shenzhen, China) is placed at the same level as the lens and coaxial with the lens. It provides uniform and stable white-light illumination for all image samples. The raw images were saved in JPG format with RGB color mode. The raw image resolution is 1920 × 1080. To eliminate the influence of background and possible distortion, a 600 × 600 region of interest (ROI) was extracted by positioning the central pixel of the raw image as the center for each sample (shown in Fig. 1). The ROI image samples are used as input to the proposed model. Examples of different Longjing tea datasets are shown in Fig. 2. It can be seen that there are slight differences in the appearance of different qualities of Longjing tea. The good-quality Longjing tea has a flat and smooth appearance, the single leaf is uniform and complete, and there is little slag. As the quality decreases, the color of Longjing tea becomes uneven, the shape of single leaves is irregular, sharp edges appear, and more slag appears.

Fig. 1
figure 1

Schematic diagram of pre-processing extraction of the ROI

Fig. 2
figure 2

Example image samples of three datasets

For every class in each target domain, only 10 labeled image samples with ground truth can participate in model training and validation each time. The remaining 90 samples are used for model generalization performance evaluation. The training samples in the target domain are randomly selected. Considering that manual collection and labeling of samples is time consuming and labor intensive, we keep as few labeled image samples as possible in the target domains. In this way, the significance and value of our research are more prominent.

CNN model building and feature extraction

End-to-end deep learning classification model is often the first choice to solve large-scale classification problems. However, building end-to-end CNN requires a lot of data, otherwise the classification model will fail due to over-fitting. The target domain datasets used in this study have limited annotation data, so it is not suitable for directly building a deep learning classification model. The construction of classification model needs the help of the complementary source domain and the transfer learning strategy.

Extracting features from a CNN model to adapt to specific visual tasks is a common transfer learning strategy. In this approach, input images are given to a CNN model directly. The deep features of the CNN model are extracted from a particular layer and feature vector is obtained. After the deep feature extraction, classical machine learning algorithms are applied for the classification model development. It has been successfully applied in some classification problems [61, 62]. Typically, CNN models for feature extraction are trained with ImageNet dataset, and activation values of the fully connected layers of those pre-trained CNNs are obtained. However, ImageNet, as the source domain, is very different from the tea quality identification dataset, which is not conducive to improving performance [37, 63]. Therefore, it is necessary to retrain the model based on the source and target domains in this study.

We use all labeled samples in the source and target domains and combine them by label to obtain a hybrid training dataset. The CNN model trained with the hybrid training dataset contains information from both the source domain and the target domain, and the extracted deep features can better represent the similarity and difference between classes.

Multiclass TrAdaBoost

The proposed multiclass TrAdaBoost algorithm extends the original TrAdaBoost proposed by Dai et al. [58] to the multiclass classification problem. Compared to the one-vs.-all method, the proposed algorithm has low computational complexity and avoids class imbalance. The key idea of TrAdaBoost is updating the sample weights of the source domain and target domain separately. The sample weights updating mechanism of TrAdaBoost is as follows:

$$ w_{i}^{t + 1} = \left\{ {\begin{array}{*{20}c} {w_{i}^{t} \beta^{{\left| {h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) - y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)} \right|}} ,} & {1 \le i \le m} \\ {w_{i}^{t} \beta_{t}^{{ - \left| {h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) - y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)} \right|}} ,} & {m + 1 \le i \le m + n} \\ \end{array} } \right. $$
(1)

Here, m is the number of samples in the source domain, and n is the number of samples in the target domain.\(w_{i}^{t}\) is the weight of sample i at iteration t, \({\varvec{x}}_{{\varvec{i}}}\) is the feature for sample i extracted from the aforementioned CNN model, \(h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right)\) is the predicted label, and \(y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)\) is the true label. The multiplier for source domain samples is defined as \( \beta = 1{\text{ }}/\left( {1 + \sqrt {2\log m{\text{ }}/2\log m/{\text{ }}N} } \right){\text{ }} \), where N is the maximum number of iterations. The multiplier for target domain samples is defined as \(\beta_{t} = {{\left( {1 - \varepsilon_{t} } \right)} \mathord{\left/ {\vphantom {{\left( {1 - \varepsilon_{t} } \right)} {\varepsilon_{t} }}} \right. \kern-0pt} {\varepsilon_{t} }}\), where \(\varepsilon_{t}\) is the overall error of \(h_{t}\) on all target domain samples at iteration t. In the iterative process, the weights of wrongly predicted source domain samples are decreased, and the weights of wrongly predicted target domain samples are increased. In contrast, the weights of correctly predicted samples are kept unchanged. The original TrAdaBoost has shown strong transfer learning ability, although only a few labeled samples are in the target domain [64,65,66].

Similar to the critical idea of the original TrAdaBoost, we extend TrAdaBoost to multiclass classification by modifying the sample weight updating mechanism in the source domain and target domain separately. The small amount of labeled data involved in training has the same distribution as the test data for the target domain. Hence, the SAMME [57] is adopted, which is the same as the common multiclass AdaBoost:

$$ w_{i}^{t + 1} = \left\{ {\begin{array}{*{20}c} {w_{i}^{t} e^{{ - \frac{K - 1}{K}\alpha_{t} }} ,} & {h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) = y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)} & {m + 1 \le i \le m + n} \\ {w_{i}^{t} e^{{\frac{1}{K}\alpha_{t} }} ,} & {h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) \ne y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)} & {m + 1 \le i \le m + n} \\ \end{array} } \right. $$
(2)

Here, \(\alpha_{t}\) is the multiclass weight updating parameter based on the exponential loss function, which is defined as \(\alpha_{t} = \log {{\left( {1 - \varepsilon_{t} } \right)} \mathord{\left/ {\vphantom {{\left( {1 - \varepsilon_{t} } \right)} {\varepsilon_{t} }}} \right. \kern-0pt} {\varepsilon_{t} }} + \log \left( {K - 1} \right)\), where K is the number of classes. By comparing the item of \(\alpha_{t}\) and \(\beta_{t}\) above, it can be found that \(\alpha_{t}\) retains the same error calculation method as \(\beta_{t}\). For the source domain, we also adopted the SAMME as follows:

$$ w_{i}^{t + 1} = \left\{ {\begin{array}{*{20}c} {w_{i}^{t} e^{{ - \frac{K - 1}{K}\alpha_{t} }} ,} & {h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) = y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)} & {1 \le i \le m} \\ {w_{i}^{t} e^{{ - \frac{K - 1}{K}\alpha_{t} }} \cdot \beta ,} & {h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) \ne y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)} & {1 \le i \le m} \\ \end{array} } \right. $$
(3)

Al-Stouhi et al. [67] found that the rapid weight drop occurred in the original TrAdaBoost for correctly predicted source domain samples and proposed the correction factor \(C_{t}\) to alleviate the weight-drift effect:

$$ C_{t} = 2\left( {1 - \varepsilon_{t} } \right) $$
(4)

By combining Eq. (2) for the target domain, Eq. (3) for the source domain, and the correction factor in Eq. (4), the modified sample weight updating mechanism for multiclass TrAdaBoost is obtained:

$$ w_{i}^{t + 1} = \left\{ {\begin{array}{*{20}l@{\quad}l@{\quad}l} {w_{i}^{t} \cdot 2\left( {1 - \varepsilon_{t} } \right) \cdot e^{{ - \frac{K - 1}{K}\alpha_{t} }} ,} & {h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) = y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)} &\quad {1 \le i \le m} \\ {w_{i}^{t} \cdot 2\left( {1 - \varepsilon_{t} } \right) \cdot e^{{ - \frac{K - 1}{K}\alpha_{t} }} \cdot \beta ,} & {h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) \ne y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)} & \quad {1 \le i \le m} \\ {w_{i}^{t} \cdot e^{{ - \frac{K - 1}{K}\alpha_{t} }} ,} & {h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) = y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)} & \quad {m + 1 \le i \le m + n} \\ {w_{i}^{t} \cdot e^{{\frac{1}{K}\alpha_{t} }} ,} & {h_{t} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) \ne y\left( {{\varvec{x}}_{{\varvec{i}}} } \right)} & \quad {m + 1 \le i \le m + n} \\ \end{array} } \right. $$
(5)

By incorporating the modified sample weights updating mechanism into the original TrAdaBoost, the multiclass TrAdaBoost is obtained as follows:

In Algorithm 1, labeled dataset \(T_{tar}\) and unlabeled dataset S have the same data distribution, and both belong to the target domain. The base classifier Learner can be any simple multiclass machine learning algorithm. The number of samples in \(T_{tar}\) is much less than that in \(T_{src}\) (\(n \ll m\)). In short, the multiclass TrAdaBoost utilizes a small amount of labeled data in the target domain, supplemented by a large amount of relevant source domain data, to achieve instance-based transfer learning. Our contribution is to extend the sample weights updating mechanism with the help of the SAMME algorithm and the correction factor \(C_{t}\). The rest maintains the simplicity and ease of use of the original algorithm.

figure a

The framework of the proposed approach

This research aims to achieve accurate quality identification of Longjing tea with limited training samples. With the power of transfer learning, a complementary dataset with more labeled Longjing tea images from another geographical origin is utilized as the source domain to boost the classification performance. To benefit from the available Longjing tea quality datasets from different geographical origins and minimize the negative effect of distribution dissimilarity, we design a transfer learning framework to incorporate the source domain into building the classification model, as described in Fig. 3. Dataset 1 is the source domain dataset, which acts as a complementary dataset. Dataset 2 is the target domain dataset, splitting into training dataset 2a and testing dataset 2b. The two main steps of the proposed instance-based deep transfer learning method are CNN feature extraction and multiclass TrAdaBoost. The features of each instance are extracted by the CNN model trained by dataset 1 and dataset 2a. The extracted feature vector represents a single instance and is imported into the proposed multiclass TrAdaBoost algorithm for classification. This is how the instance-based deep transfer learning method works.

Fig. 3
figure 3

The framework of the proposed approach. In training phase (a), CNN is trained using samples from the source and target domains (Dataset 1 and Dataset 2a). The feature vectors are extracted and used to train the multiclass TrAdaBoost algorithm. In the testing phase (b), the unknown samples in the target domain Dataset 2b are classified using the trained models

Performance metrics

Tea quality identification is a multiclass classification task. It is necessary to provide an appropriate evaluation that allows observation of an extensive catalog of configurations. The model performance is evaluated by accuracy, precision, recall, and F1 score, defined in Eqs. (6) to (9) [68, 69].

$$ \begin{gathered} {\text{Accuracy = }}\frac{{\text{TP + TN}}}{{\text{TP + FP + TN + FN}}} \hfill \\ \hfill \\ \end{gathered} $$
(6)
$$ \begin{gathered} {\text{Precision = }}\frac{{{\text{TP}}}}{{\text{TP + FP}}} \hfill \\ \hfill \\ \end{gathered} $$
(7)
$$ {\text{Recall = }}\frac{{{\text{TP}}}}{{\text{TP + FN}}} $$
(8)
$$ F1{\text{ score = 2}} \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}} $$
(9)

where TP equals true positive, TN equals true negative, FP equals false positive, and FN equals false negative. It is easy to know from the equations that the precision index describes how many of the positive results predicted by the classifier are true positives, and the recall describes how many of the true positives in the test set are picked out by the classifier. The F1 score is defined as the harmonic mean of precision and recall, which can consider both of them. Since we have four balanced classes in each dataset, macro-F1 was used for comparison. The classes of target domain datasets used in our experiments are well balanced (shown in Table 1), so the multiway accuracy is a good overall measure of accuracy [70].

All experiments were performed with ten independent runs, and the mean and standard deviation values of the results were recorded. For visualization requirements, a stacked (summed) confusion matrix was presented as a summary.

Results and discussion

Any CNN architecture can be used as a deep feature extractor. Considered the final target of transplanting the whole intelligent algorithm into embedded terminal, the computation of the CNN model should be minimized. Exploring the classification and transfer learning performance of deep features extracted by lightweight models has more academic and application value [71, 72]. Hence, some powerful and efficient lightweight CNN models are used as the backbone feature extractor in our research.

All the experiments were run on Windows 10 as the operating system on a personal computer with an 8-core Intel Core i5-8500 with 3.00 GHz CPU, 16 GB of DDR4 SDRAM and an NVIDIA GeForce GTX 1660 GPU with CUDA 10.1 and 6 GB of memory. For the implementation of the CNN model, we used PyTorch 1.9.0 [73] based on Python 3.6.13 in the backend. In the model training phase, the training dataset was augmented by rotating the images by a randomly selected angle from the set {0, 90, 180, 270}. The image samples were resized to 224 × 224, 80% of the dataset was used for training, and the rest was used for validation. When training the model, it is not necessary to consider the natural generalization ability of the model, so the test set is not applicable, and the effect of the validation set is to observe the convergence of the model. The stochastic gradient descent (SGD) optimizer and the categorical cross-entropy loss function were utilized. After some preliminary experiments, a batch size of 32 and a learning rate of 0.0001 were used as hyper-parameters. The model stopped training and the parameters were frozen after 30 epochs.

For the image pre-processing and implementation of the multiclass TrAdaBoost algorithm, the main supporting library includes OpenCV-Python 4.5, Numpy 1.19.2, and Scikit-learn 0.24.1. For the base learner in multiclass TrAdaBoost, some preliminary experiments were carried out to select the optimal hyper-parameters with the help of the Scikit-learn library. The best results are taken as the final results.

Performance of the CNN models with different training datasets

In this experiment, the CNN models were trained only with the source domain or samples from both the source and target domains. To verify the effect of the proposed method in the case of minimal labeled samples in the target domains, only ten samples in the target domains can be used for training. The remaining 90 samples are used as the test set to test the generalization ability of the models, that is, the transfer learning ability.

Different classification models and training strategies are shown in Table 2. We introduced four common and powerful lightweight network architectures, including MobileNet V2 [74], MobileNet V3-large [75], MNasNet [76] and ShuffleNet V2 [77]. We did not build the CNN models only using the target domain due to too few labeled images with ground truth. As a comparison, color and texture features based on the color histogram and gray-level co-occurrence matrix (GLCM) were extracted. The SVM classifier, which is suitable for low shot classification problems, was also trained and validated. More details and parameter settings of the comparison method can be obtained in [78]. The overall accuracy and F1 score of different methods and training strategies are also shown in Table 2. The best results are in bold.

Table 2 The overall accuracy and F1 score for different methods and training strategies

It can be seen that hand-engineered features combined with the SVM classifier are not satisfactory. Due to too few labeled images with ground truth, the accuracy of the SVM classifier trained only using the target domain datasets is only 78.2% and 74.3%, respectively. If the source domain dataset is directly added as a supplement for training, it will destroy the generalization of the SVM classifier and get even worse results. Using lightweight CNN instead of hand-engineered feature extraction can greatly improve the classification accuracy. The classification results of different CNN models show that when the model is trained with hybrid datasets, the classification accuracy is much higher than that of the model trained only with source domain data. The capacity of the CNN models can store the classification information from the source domain and the target domain without causing the damage of generalization. Among the four lightweight CNNs, MobileNet V2 achieves the best global accuracy and F1 scores. The precision and recall values of every single class are shown in Fig. 4 and Fig. 5. Considering the visualization effect, the error bar is not displayed in the single-class precision and recall histograms. The overall accuracy values of the obtained Qtlj dataset and Yzlj dataset are 83.3% and 79.5%, respectively. For the MobileNet V2 model trained only with the source domain dataset, the classification task on the target domain is a kind of zero-shot learning (ZSL) problem. Although the overall accuracy is 64.9% and 55.5% in the two target domains, they are much higher than the accuracy of random selection (25%), which shows the transfer learning ability of the deep neural network itself. In summary, CNN model trained with the hybrid datasets from source and target domains can consider the difference in data distribution between domains and has a robust feature extraction capability.

Fig. 4
figure 4

The precision values of every single class for MobileNet V2 with different training strategies

Fig. 5
figure 5

The recall values of every single class for MobileNet V2 with different training strategies

Identification results of deep features and multiclass TrAdaBoost

In this experiment, the performance of the proposed multiclass TrAdaBoost was evaluated. The performance of conventional AdaBoost was also carried out to examine the transfer learning ability of the multiclass TrAdaBoost. Considering that MobileNet V2 model has achieved the best results in the Sect. “Performance of the CNN models with different training datasets”, the MobileNet V2 model was used as the feature extractor. We slightly modified the MobileNet V2 model to adapt it to our vision task. The classifier of the trained model was removed, and the remaining part was used as a feature extractor for feature extraction of image samples. The extracted feature dimension is 1280. The features were used as the input of subsequent classifiers. As a comparison, color and texture features (same as the features mentioned in Sect. “Performance of the CNN models with different training datasets”) were also extracted. We have fine tuned the hyper-parameter settings on the basis of [59, 79]. For both classifiers, the base learner was set as a decision tree (DT). The maximum depth of the boosting algorithm was set to 2, and the number of trees was set to 50. The learning rate was set to 1. The overall accuracy and F1 score of the different features combined with AdaBoost and multiclass TrAdaBoost are shown in Table 3. The precision and recall values of every single class of MobileNet-based methods are shown in Fig. 6 and Fig. 7, and the best results are in bold. Considering the visualization effect, the error bar is not displayed in the single-class precision and recall histograms.

Table 3 The overall accuracy and F1 score of the different features combined with AdaBoost and multiclass TrAdaBoost
Fig. 6
figure 6

The precision values of every single class for MobileNet V3 features with AdaBoost and multiclass TrAdaBoost

Fig. 7
figure 7

The recall values of every single class for MobileNet V2 features with AdaBoost and multiclass TrAdaBoost

It should be noted that training the MobileNet V2 model and training boosting-based classifiers are two separate steps. The ‘training data’ in Table 3 refers to the dataset composition used to build the MobileNet V2 model (the same below), while the training datasets for training the boosting-based classifiers are the combined dataset, which consists of the source domain dataset and target domain dataset (same as \(src + tar\)). At the same time, the samples used for testing are only used to evaluate the algorithm, and they are guaranteed not to participate in any training steps.

The comparative research results show that regardless of which MobileNet V2 model is used as the feature extractor, the multiclass TrAdaBoost significantly outperforms the conventional AdaBoost algorithm in overall and single-class performance. The transfer learning ability of the multiclass TrAdaBoost is fully reflected and verified. Compared with using the SVM classifier without transfer learning ability in Table 2, the use of color and texture features combined with the proposed multiclass TrAdaBoost algorithm can improve the classification accuracy. At the same time, the accuracy obtained using color and texture features is still far lower than using deep transfer learning, which fully demonstrates that the deep learning model can overcome the differences between domains to a certain extent and extract domain-invariant deep features. It is worth noting that, compared with solely using the MobileNet V2 model (shown in Table 2), the overall performance was improved after adding the multiclass TrAdaBoost algorithm. The best classification results increased from 83.3% and 79.5% to 88.1% and 83.3%, respectively. It can be concluded that the transfer learning ability of the instance-based deep transfer learning method proposed in this paper comes from the CNN models trained with hybrid dataset and the multiclass TrAdaBoost, respectively, and that they can provide positive effects in synergy.

Identification results of multiclass TrAdaBoost combined with different classifiers

In ensemble learning methods, the classification performance of ensemble classifiers depends not only on the integration strategy but also on the ability of the base learner [80]. As a boosting-based algorithm, the multiclass TrAdaBoost proposed in this paper has similar characteristics. Some scholars have found that the performance of TrAdaBoost after fusing different base learners is quite different on some classification tasks [59, 81]. Hence, it is necessary to investigate the impact of different classifiers on the algorithm's performance to obtain optimal results. In addition to the DT mentioned above, another three standard and powerful machine learning classifiers, naïve Bayes (NB), logistic regression (LR), and SVM with linear kernel, are used as base learners for the multiclass TrAdaBoost construction. For boosting algorithm, we use the same settings as in Sect. “Identification results of deep features and multiclass TrAdaBoost”. Different basic learners correspond to different hyper-parameter settings: for DT, the maximum depth is set to 2; for linear SVM and LR, the penalty coefficient C is set to 0.5, and the NB algorithm has no parameters to set. We use grid search to adjust and optimize hyper-parameters. The MobileNet V2 model trained with the hybrid datasets is used as a feature extractor. Other related parameter settings are the same as those in Sect. “Identification results of deep features and multiclass TrAdaBoost”. The overall accuracy and F1 score of every base learner are shown in Table 4. The precision and recall values of every single class are shown in Fig. 8 and Fig. 9, and the best results are in bold. Considering the visualization effect, the error bar is not displayed in the single-class precision and recall histograms.

Table 4 The overall accuracy and F1 score of multiclass TrAdaBoost with different base learners
Fig. 8
figure 8

The precision values of every single class for multiclass TrAdaBoost with different base learners

Fig. 9
figure 9

The recall values of every single class for multiclass TrAdaBoost with different base learners

The results of multiclass TrAdaBoost with the different base learners are variant. The transfer learning ability of CNN deep features and multiclass TrAdaBoost has been proven above, and the extracted features should be informative and discriminative. Therefore, the diversification of the results may lie in the matching between the base learner and the features. The single-class and overall performance of multiclass TrAdaBoost with NB learners is much lower than those of other methods. NB is a classification model with simple mechanisms and sound effects. However, it has high requirements for the independence of data and features, which is challenging to meet for structured, high-dimensional deep features extracted from complex CNN models [82]. Compared with NB, LR is close to its mechanism but does not highly depend on the independence of feature weights. Hence, multiclass TrAdaBoost could achieve better results, which is close to multiclass TrAdaBoost with DT. Multiclass TrAdaBoost with linear SVM learners obtains the best overall and single-class results and outperforms the other methods. SVM has strong performance in processing high-dimensional, low-shot data [83], which is suitable for our task.

Comparison results summary

Finally, the performance of all the mentioned methods in Sect. “Results and discussions” is compared. As shown in Fig. 10, the F1 score, which can represent classification performance reasonably, is selected to compare the performance of different methods. To match the data in Tables 24, the error bar is shown in Fig. 10. The global highest overall F1 scores of the two target domains, Qtlj and Yzlj, are 93.6% and 91.5%, respectively. The corresponding method is the MobileNet V2 feature extractor trained by the hybrid datasets and the multiclass TrAdaBoost with linear SVM learner. The stacked confusion matrix of the test set with the best performance method is shown in Fig. 11. The lowest recall values of the best method on the two target domains both appear on Grade 2, which are 89.0% and 85.6%, respectively. The recall values of other single classes are all above 93%. The misjudgment is mainly due to the high degree of similarity. For example, the appearance of Grade 2 in Qtlj and Yzlj is very similar to that of other grades. Even with the naked eye, it is difficult to distinguish the difference, which brings significant challenges to the algorithm. In summary, we can conclude that with the help of image samples of Longjing tea from other geographical origins, the proposed instance-based deep transfer learning method can accurately identify the quality of Longjing tea in the current geographical origin with limited samples.

Fig. 10
figure 10

The overall F1 score of all methods

Fig. 11
figure 11

The stacked confusion matrix of the test set with the best performance method (MobileNet V2 features and the multiclass TrAdaBoost with linear SVM learner). Left: Qtlj dataset; Right: Yzlj dataset

Efficiency and performance are equally important from an application perspective. The inference time of different methods should be taken into account. The main process that affects the inference efficiency is feature extraction process (hand-engineered features or deep features) and classification process (SVM, AdaBoost, and multiclass TrAdaBoost). For fair comparison, the inference process is run on the CPU, and the single-image inference time results of each method is shown in Table 5.

Table 5 The comparison results of processing time

Most of the time is spent in the feature extraction process, and it takes about 59 ms to manually extract the color and texture features of a single image. It is worth noting that it only takes 7 ms more to extract the deep features based on the MobileNet V2 backbone network, and there is no obvious increase in time overhead, but the performance is greatly improved (refer to Fig. 10). This undoubtedly shows once again the superiority of the lightweight network architecture MobileNet V2, which not only exerts the advantages of deep neural networks, but also ensures good real-time performance. The weight update strategy of the proposed multiclass TrAdaBoost is highly similar to the SAMME-based AdaBoost classifier, so the time consumption is basically the same. In addition, due to the different complexity of the classifier, when it is combined with multiclass TrAdaBoost as the base learner, it will also bring subtle time consumption differences. From the experimental results, the slight difference reflected in a single image is only 0.1–0.2 ms, which is basically negligible. Therefore, the best performance method (MobileNet V2 feature extractor trained with the hybrid datasets and the multiclass TrAdaBoost with linear SVM learner) can maintain good real-time performance.

Conclusion

Automatic identification of Longjing tea quality is of great significance to consumers and proprietors. In this paper, a novel instance-based deep transfer learning method for Longjing tea quality identification was proposed. MobileNet V2 was modified to match our vision task and trained using the hybrid training dataset containing all labeled samples from source and target domains. The trained model was used as a feature extractor. Then deep features from different domains were extracted by the trained MobileNet V2 model and imported into the proposed multiclass TrAdaBoost algorithm to build a classification model. To validate the proposed method, three Longjing tea quality datasets from three different geographical origins were collected, one of which contains many labeled image samples, and the other two have very limited labeled image samples. The main results are as follows:

  1. (1)

    The MobileNet V2 model is trained using the hybrid training dataset containing all labeled samples from source and target domains. The trained MobileNet V2 model is used as a feature extractor instead of directly using the pre-trained model. Compared with traditional image processing combined with pattern recognition methods and other lightweight CNN models, the MobileNet V2 model trained with the hybrid dataset from the source and target domains has better identification results. The CNN feature extractor can extract high-level features and maintain transferability.

  2. (2)

    Referencing the reweighting idea and SAMME multiclass classification strategy, an instance-based transfer learning algorithm named multiclass TrAdaBoost is proposed. The proposed algorithm can adapt to multiclass classification tasks, has lower computational complexity, and avoids data imbalance. When combined with the deep features extracted from MobileNet V2 model, the overall accuracy values of Qtlj and Yzlj reached 88.3% and 83.1%, respectively. The results show that deep features combined with multiclass TrAdaBoost can achieve a great transfer learning effect on different target domain datasets.

  3. (3)

    The effect of the base learner on the performance of the multiclass TrAdaBoost is also explored. The experimental results demonstrated that the deep features extracted from MobileNet V2 model combined with the multiclass TrAdaBoost with SVM learner obtain 93.6% and 91.5% accuracy for two target domain datasets, which outperforms all other methods and achieves the global optimal identification results. In addition, real-time performance is also well maintained.

With the addition of the deep features, the instance-based multiclass TrAdaBoost shows strong performance and constitutes an instance-based deep transfer learning method. In summary, we can conclude that with the help of image samples of Longjing tea from other geographical origins, the proposed instance-based deep transfer learning method can accurately identify the quality of Longjing tea in the current geographical origin with limited samples. This transfer learning method would substantially shorten the data-collecting time and save human resources. It also provides a reliable vision-based tea quality identification method for relevant personnel, even if the appearance of tea between different grades is very similar.

In future work, we plan to expand the transfer learning-based tea quality identification method. To improve the transfer learning ability, appropriate distance measures can be introduced to minimize the domain distance and achieve higher-order transfer learning (e.g., feature-based transfer learning). From the perspective of the application, we will try to embed the algorithm in a mobile end device for further testing.