A cross-domain fruit classification method based on lightweight attention networks and unsupervised domain adaptation

Image-based fruit classification offers many useful applications in industrial production and daily life, such as self-checkout in the supermarket, automatic fruit sorting and dietary guidance. However, fruit classification task will have different data distributions due to different application scenarios. One feasible solution to solve this problem is to use domain adaptation that adapts knowledge from the original training data (source domain) to the new testing data (target domain). In this paper, we propose a novel deep learning-based unsupervised domain adaptation method for cross-domain fruit classification. A hybrid attention module is proposed and added to MobileNet V3 to construct the HAM-MobileNet that can suppress the impact of complex backgrounds and extract more discriminative features. A hybrid loss function combining subdomain alignment and implicit distribution metrics is used to reduce domain discrepancy during model training and improve model classification performance. Two fruit classification datasets covering several domains are established to simulate common industrial and daily life application scenarios. We validate the proposed method on our constructed grape classification dataset and general fruit classification dataset. The experimental results show that the proposed method achieves an average accuracy of 95.0% and 93.2% on the two datasets, respectively. The classification model after domain adaptation can well overcome the domain discrepancy brought by different fruit classification scenarios. Meanwhile, the proposed datasets and method can serve as a benchmark for future cross-domain fruit classification research.


Introduction
In the food industry, fruit and vegetable are a major part of fresh produce. Accurate fruit classification is the basis for maximizing economic benefits [1]. In industrial production and daily life, automated image-based fruit classification has many applications, such as self-checkout in the supermarket, automatic fruit packing and transportation in the factory B Jin Wang dwjcom@zju.edu.cn 1 and dietary guidance in life [2]. Hence, fruit classification based on computer vision system is not only a hot topic in the field of academic research, but also in applications. Early work mainly focused on using image-processing techniques to extract handcrafted features such as color, shape, and texture, and then import them into a machine learning classifiers to achieve classification [3][4][5]. However, the accuracy of traditional image-processing methods highly depends on the quality of handcrafted features. Normally, these features are adapted to each particular problem and lack of generalization.
Deep learning provides end-to-end solutions for computer vision tasks and is a highly automated method [6]. Several deep learning methods based on convolutional neural networks (CNN) have been proposed for fruit classification [7][8][9], and most of them have been proven to be far more effective than traditional methods based on handcrafted features. However, one of the main drawbacks of deep learning is the need for massive amounts of annotated data. In addition, as a statistical learning method, deep learning can achieve excellent performance on the premise that the training dataset and the testing dataset are independent and identically distributed. During the service of the trained model, the scene is often not static due to changes in illumination, background, pose, etc., which makes the assumption of independent and identical distribution extremely easy to break. These variations, known as a domain shift problem, would produce a severe degradation in performance if the model trained with the initial images (images from the source domain) is used to predict the classes of the new images after the above changes (images from the target domain) [10,11]. It is obviously timeconsuming and laborious to relabel large amounts of images in each new domain, which is also sometimes unrealistic. Hence, the strategy of domain adaptation can be adopted to facilitate tasks in the new target domain with labeled data from the source domain, which can efficiently improve the robustness and universality of classification models [12]. Domain adaptation can alleviate the problem of domain drift caused by the interference of illumination and background changes, thereby reducing model performance degradation.
Learning a discriminative model in the presence of the shift between the training and test data distributions is known as domain adaptation or transfer learning [10,13]. Unsupervised domain adaptation (UDA) is a branch of domain adaptation, which means that the labels of all target domain samples are invisible. This is also the most common and challenging situation in applications. UDA algorithms have achieved excellent results in some classification tasks, such as object recognition [14,15], fault diagnosing [16,17] and medical image diagnosis [18,19]. The application of UDA algorithm in the agro-food field is still in its infancy and there is few related research. Marino et al. proposed an unsupervised adversarial deep domain adaptation method for potato defects classification. Experimental results show that a domain adaptation method is mandatory, going from an average F1-score of 0.46 without adaptation, to 0.84 by applying the proposed method [20]. Li et al. proposed an unsupervised domain adaptation for in-field cotton boll status identification. The proposed method was validated on the self-constructed in-field cotton boll dataset with 1600 images. Extensive experiments show that their proposition outperforms other state-of-the-arts in terms of identification accuracy [21]. Zhao et al. introduced the domain adaptive object detection into aquaculture field to improve the crossdomain robustness of fish detection. Compared with the original Faster RCNN and domain adaptation model DA-Faster, the proposed method can not only save the cost of manual annotation, but also effectively improve the detection performance of unlabeled target domain [22].
Unlike general object recognition and classification, Fruit classification presents significant challenges due to interclass similarities and irregular intra-class characteristics [1].
When applying UDA to fruit classification problems, its finegrained characteristics need to be considered. To the best of our knowledge, regardless of the dataset, algorithm or application level, the research on the application of UDA in fruit classification is still blank. Therefore, it seems interesting and necessary to explore the UDA methods for fruit classification.
Ideally, one desires the same low error rates when reapplying models derived from previous source domain to a new, unlabeled target domain, often referred to as domain adaptation or cross-domain learning [23,24]. In this paper, we propose to address the domain shift problem applied to fruit classification, extending the fruit classification task from being in a single domain to spanning multiple domains. A cross-domain fruit classification method based on light attention network and feature-based unsupervised domain adaptation is proposed. A lightweight attention network based on MobileNet V3 architecture is designed and used as the backbone. In this architecture, a hybrid attention module (HAM) is used to replace the original squeeze-and-excitation block (SE block) in the original MobileNet V3. A hybrid loss function that combines explicit and implicit minimization of the feature discrepancy between the source domain and the target domain is proposed for model construction. Experiments verify the effectiveness of our proposed UDA method, demonstrate its superiority over state-of-the-art competitors and report highly accurate classification performances. The main contributions are as follows: • Based on common fruit classification scenarios in industrial production and daily life, two cross-domain fruit classification datasets containing different domains are constructed. The first is a grape classification dataset, containing images of seven different varieties of grapes collected in four scenarios, constituting the four domains of controlled environment, illumination variation, background variation and Internet. The other is a general fruit classification dataset, containing 11 different fruits collected under two domains. We hope that they can serve as the benchmarks for future research work (the datasets can be available here). • A hybrid attention module is proposed and added to MobileNet V3 to construct the HAM-MobileNet that can suppress the impact of complex backgrounds and extract more discriminative features. The resulting HAM contains both the channel attention mechanism and the spatial attention mechanism, and does not add any parameters compared to the SE block. • A hybrid loss function combining subdomain alignment and implicit distribution metrics is used to reduce domain discrepancy during model training and improve model classification performance. The proposed loss function can perfectly deal with the fine-grained characteristic of fruit classification. • Extensive experiments are carried out to validate the effectiveness and superiority of the proposed method. The results show that the proposed UDA model can effectively resist the domain discrepancy caused by various interferences such as illumination and background, and achieve accurate cross-domain fruit classification. By implementing the UDA method, the fruit classification model can well adapt to the common illumination and background changes in industrial and daily life scenes without adding manual annotations. Meanwhile, when faced with tasks in specific scenarios (such as laboratories, production lines, supermarket self-checkouts, etc.), the UDA classification model can be built with the help of public datasets or labeled images collected on the Internet, without having to manually annotate a large number of images from the specific scenes. These will undoubtedly improve the robustness of industrial fruit classification models and save a lot of labor costs.
The remainder of this paper is structured as follows. "Related works" introduces the related works of our study. "Materials and methods" makes a detailed description of the datasets used and the methods proposed in this paper. "Results and discussions" presents the experiments results and discussions. "Conclusions" summarizes the conclusions.

Convolutional neural networks
Convolutional neural networks are multilayered neural networks that are able to learn task-specific invariant features in a hierarchical manner [6]. Typical CNNs usually consist of convolutional layers, pooling layers and fully connected layers. As an important component of feature extraction, the convolutional layer performs feature mapping layer by layer with the help of a series of learnable filters. The pooling layer contains a preset pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions, thereby realizing feature selection and information filtering. Finally, the fully connected layer maps the feature space calculated by the previous layers to the label space. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is one of the most sought after and authoritative academic competitions in machine vision in recent years. The winning network architecture of the competition, including AlexNet [25], VGGNet [26], GoogLeNet [27] and ResNet [28], has become the mainstream architecture for various visual tasks. These powerful network architectures also shine on fruit recognition tasks [29].
For engineering applications, accuracy and efficiency are equally important. Designing CNN architecture for the optimal trade-off between accuracy and efficiency has been an active research area in recent years. SqueezeNet [30] extensively uses 1 × 1 convolutions with squeeze-andexpand modules primarily focusing on reducing the number of parameters. MobileNets [31] are a series of lightweight CNN architectures based on depthwise separable convolutions. MobileNet V2 [32] proposes inverted residual block and MobileNet V3 [33] further utilizes neural architecture search (NAS) achieving better performance with fewer latency. ShuffleNet [34] introduces channel shuffle operation to improve the information flow exchange between channel groups and ShuffleNet V2 [35] further considers the actual speed on target hardware for compact model design. MixNets [36] introduce a novel mixed depthwise convolution and obtain excellent transfer learning performance on different datasets. In recent years, Vision Transformer (ViT) has also begun to appear lightweight variants, such as Mobile-ViT [37], which provide new ideas for lightweight network architecture design.
For the specific task of fruit classification, Duong et al. [7] explored the performance of two lightweight models MixNet and EfficientNet [38] on the Fruits-360 fruit classification dataset, and the results showed that both models achieved excellent accuracy. However, Duong et al. [7] did not specify specific fruit classification task scenarios (such as supermarkets, production lines, etc.), nor did they consider the fine-grained characteristics of fruits to make targeted improvements. For a more sophisticated and complex situation, directly applying the existing network architecture may not be the optimal solution.

Attention mechanism
Inspired by human visual system, methods for diverting attention to the most important regions of an image and disregarding irrelevant parts are called attention mechanisms [39]. After several years of development, attention mechanisms can be classified into the following categories: channel attention methods generate attention mask across the channel domain and use it to select important channels, some representative works including the aforementioned SENet [40], Style-based Recalibration Module (SRM) [41] and Efficient Channel Attention (ECANet) [42]. Spatial attention methods generate attention mask across spatial domains and use it to select important spatial regions, some representative works including Spatial Transformer Networks (STN) [43] and Gather-excite Networks (GENet) [44]. Channel and spatial attention, such as convolutional block attention module (CBAM) [45] and coordinate attention [46], combines the advantages of channel attention and spatial attention and produce a 3-D attention map. Some other methods also focus on the effects of temporal attention and branch attention [47,48]. Attention mechanisms have greatly improved the performance of convolutional neural networks on various visual tasks. In recent years, vision transformers based on self-attention mechanisms have attracted attention, greatly increasing the diversity of computer vision [49].
Attention mechanisms are often embodied as plug-andplay attention modules that can refine convolutional outputs within a block and enable the whole network to learn more informative features [50]. Some mature CNN architectures have integrated attention modules, such as the SE module added to MobileNet V3, making it perform better than MobileNet V1 and MobileNet V2 [33]. In the field of agrofood, more and more researchers add attention modules to their self-designed network architectures to deal with challenges like complex backgrounds and inter-class similarity, such as leaf disease detection [51][52][53][54] and pest detection [55]. However, the attention modules are usually built by a series of complex factors, e.g., the choice for pooling. These complex operations will introduce additional parameters and computational consumption, which are not friendly to lightweight network architectures.

Unsupervised domain adaptation
Unsupervised domain adaptation refers to predicting the labels of samples drawn from a target domain, given labeled samples drawn from a source domain and unlabeled samples drawn from the target domain itself. The domain is defined, in this context, as different probability distributions p(x, y) over the same feature-label space pair X × Y , where X and Y represent the feature space and label space, respectively [56]. Early shallow methods include reweighting the training data from the source domain so that they can more closely reflect the data distribution in the target domain [57,58], and finding a transformation in a lower-dimensional manifold that draws the source and target subspaces closer. The latter is also summarized as a feature-based or discrepancy-based method, and well-known methods such as Transfer Component Analysis (TCA) [59], Joint Distribution Adaptation (JDA) [60], Geodesic Flow Kernel (GFK) [61], CORrelation ALignment (CORAL) [62], etc. belong to this category, and achieved good results.
Recent studies have proven that deep CNN can learn more transferable features for domain adaptation by disentangling explanatory factors of variations behind domains [63][64][65]. Deep learning has brought revolutionary advancements to domain adaptation. The authors in [66] tried to integrate maximum mean discrepancy (MMD) metric into CNN for the first time, which can be regarded as an extension of the TCA method to the deep learning. The proposed Deep Domain Confusion (DDC) architecture trained the AlexNet [25] combining a classification loss on source images and an MMD loss to minimize the distance between the source and target domains. An adaptation layer called "bottleneck adaptation layer" was also introduced. Long et al. [67] proposed a deep adaptation network (DAN) architecture, where they applied a multi-layer adaptation based on a multiple kernel variant of MMD (MK-MMD). Hidden representations of all task-specific layers in the AlexNet architecture are embedded in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched. Considering that the direct application of the MMD metric can only adapt to the marginal distribution of the sample, Long et al. [68] further proposed the joint maximum mean discrepancy metric (JMMD) and embedded it in a joint adaptation network (JAN) architecture. Based on the JMMD criterion, JAN can learn a transfer network by aligning the joint distributions of domain-specific layers across domains instead of only aligning the marginal distribution. In recent years, there have been more and more improvements based on the MMD metric, providing some novel feature matching and alignment methods [69][70][71].
In summary, we can find a paradigm of deep learningbased UDA: the CNN architectures are used as the backbones, and the classification loss and domain discrepancy loss are combined to jointly train the backbone network and learn a distinguished representation. However, there are still several gaps in the application of UDA to fruit classification.
(1) Dataset construction. There are some public object recognition datasets as recognized standards for evaluating UDA methods, such as Office-31 [61], VisDA [72] and DomainNet [73]. Nevertheless, there is a lack of UDA datasets for fruit classification tasks. (2) Backbone architectures design. For the consistency of algorithm evaluation results, researchers often use several specific backbone networks, such as AlexNet, ResNet 50 and ResNet 101 [71]. For the applicationoriented task of fruit classification, the aforementioned backbone networks are huge and bloated, and deploying them on mobile devices will greatly increase the burden on the devices. (3) Fine-grained characteristics. Different from general object recognition tasks, different types of fruit images have high similarity and have fine-grained characteristics, but the current domain difference measurements do not consider this problem.

Materials and methods
In this section, the composition and construction basis of the datasets are introduced in detail. A detailed explanation of our proposed method is also given.

Dataset description
Fruit classification includes two categories of tasks, the classification of the same kind of fruit, such as the identification of different varieties of litchi fruit [74], and the classification of different kinds of fruit, such as the identification of different kinds of fruit like pears, apples, and bananas [2]. Both types of classification tasks are easily interfered by small differences in appearance between varieties and large changes in background and illumination. Hence, we build two datasets, one is a grape classification dataset of different varieties, and the other is a general fruit classification dataset. Grape classification dataset is self-constructed and contains seven common late-ripening varieties of grapes: Crimson Seedless, Manicure Finger, Zuijinxiang, Munag, Summer Black, Red Globe and Shine Muscat. According to common fruit classification scenarios, the dataset contains four domains: controlled environment, illumination variation, background variation and Internet. The details of the grape classification dataset are listed in Table 1.
• Domain-controlled environment (Domain-CE) includes the grape images collected in the controlled environment to simulate the fruit classification in an automatic fruit sorting system or in a laboratory. In general, the image acquisition and lighting equipment of the fruit sorting system or laboratory are in fixed positions, so the obtained images are less interfered by the background and lighting [75][76][77][78][79]. In this domain, we captured images of different varieties of grapes with a whiteboard background using a stationary camera and light source. The characteristics of the images in this domain are that the background and lighting are consistent, the original image resolution is 4032 × 3024, and all the images are saved in Joint Photographic Experts Group (JPEG) format. Some typical samples in the Domain-CE are shown in Fig. 1a. • Domain-illumination variation (Domain-IV) includes the grape images collected at different light intensities and illumination angles to simulate changes in light in a controlled environment. During the model service process, illumination changes are very common, and it is difficult to maintain long-term stability of light conditions. In this domain, we removed the fixed light source in the controlled environment of the Domain-CE and cast light from four angles instead. To add variety, some underexposed and overexposed samples were also added. Compared with artificially increasing the brightness of raw images [20], our scheme for constructing illumination variation is closer to the real situation [80]. The original image resolution is 4032 × 3024, and all the images are saved in JPEG format. Some typical samples in the Domain-IV are shown in Fig. 1b. • Domain-background variation (Domain-BV) includes the grape images collected in different backgrounds to simulate background variations common in fruit classification systems. Different from the common controlled environments that appear on the production line or in the laboratory, the fruit classification system used on the user side often faces more diverse background variations, such as fruit classification in mobile devices or self-checkout systems in supermarkets. In this domain, we have selected some backgrounds that are commonly seen on the user side, including grapes placed on the table, on a plate, in a transparent box, in a carton, and on the hand. All images were captured in natural light, with no deliberate light source set, so as to be closer to the real situation. All images were captured using iPhone 12 mobile device with a resolution of 4032 × 3024 and saved in JPEG format. Some typical samples in the Domain-BV are shown in Fig. 1c.  • Domain-Internet (Domain-IN) includes the grape images collected from the Internet to simulate a more general fruit classification scenario. Both developers and users expect the fruit classification system to resist as much interference as possible and work in scenarios that are more general.
In this domain, we collected grape images from Baidu and Bing search engines. The collected images vary widely and may be subject to various interferences such as background, lighting, viewpoint, and text and object occlusion. The position of the target object (grapes) is also varied, not centered like the other three domains. The resolution of the images varies from 220 × 220 to 4032 × 3024, and all images were saved in JPEG format. Some typical samples in the Domain-IN are shown in Fig. 1d.
The grape classification dataset contains about 100 images per class in each domain. Therefore, each domain contains about 700 grape images, for 2810 images in the entire dataset. According to the definition of UDA, each domain can be used as a labeled source domain as the training set to build General fruit classification dataset consists of 11 common fruit images from public datasets: Apple, Kiwifruit, Lime, Nectarine, Onion, Orange, Pear, Plum, Potato and Watermelon. Currently, some public datasets for fruit classification have been released, and our datasets are mainly from the public datasets VegFru dataset [81] and Supermarket Produce Dataset [3]. Hence, two domains are available in this dataset: VegFru and Supermarket Produce Dataset. The details of the general fruit classification dataset are listed in Table 2.
• Domain-VegFru contains fruit images from VegFru dataset. VegFru dataset is a large-scale and novel domainspecific dataset that contains 200 classes of vegetables and 92 classes of fruit. The dataset contains in total more than 160,000 images retrieved by different search engines and it categorizes vegetables and fruit according to their eating characteristics [81]. Therefore, this dataset contains an extremely rich diversity. Domain-VegFru is a subset of the VegFru dataset, and we selected the common classes of VegFru and Supermarket Produce Dataset. Since the two datasets have different definitions of classes, we modified and merged some classes. For example, VegFru dataset has three classes containing apples: green apple, wax apple and apple, which are merged into one class in our dataset: apple. The number of images in each class ranges from 400 to 2500, for a total of 15,124 images. Some typical samples in the Domain-VegFru are shown in Fig. 2a.

• Domain-Supermarket Produce Dataset (Domain-SPD)
includes fruit images from Supermarket Produce Dataset, which comprises 15 different categories and 2633 images in total. The Supermarket Produce Dataset is the result of 5 months of on-site collecting in the local fruit and vegetables distribution center. The images were captured on a clear background at the resolution of 1024 × 768, using a Canon PowerShot P1 camera. The images were gathered at various times of the day and in different days for the same class, which increase the dataset variability and represent a more realistic scenario [3]. Therefore, Supermarket Produce Dataset is a dataset collected in a semi-controlled environment with a well-defined application scenario (supermarket cashier). Domain-SPD is a subset of Supermarket Produce Dataset, and we did the same screening and merging operations on this domain as on the Domain-VegFru. The number of images in each class ranges from 75 to 383, for a total of 2278 images. Some typical samples in the Domain-SPD are shown in Fig. 2b.
Similar to the domain adaptation task in the grape classification dataset, two source-target pairs can be constructed: VegFru-SPD and SPD-VegFru. Practically, fruit classification systems generally serve specific application scenarios, such as supermarket self-checkout in Domain-SPD. Therefore, taking Domain-VegFru as the source domain and Domain-SPD as the target domain has more research and application value. The source-target pair of VegFru-SPD will be used for subsequent algorithm validation.

Lightweight attention networks
As mentioned above, the backbone networks used for UDA are mainly AlexNet and ResNet, which have many parameters and require a large amount of computation. At the same time, these architectures are oriented toward general vision tasks, and there is no targeted optimization and improvement for fruit classification tasks with fine-grained characteristics. Hence, we build a lightweight attention network HAM-MobileNet, based on the advanced mainstream lightweight network architecture MobileNet V3.
MobileNet V3 inherits the depthwise separable convolution and point-wise convolution in MobileNet V1, and improves the inverted residual structure in MobileNet V2. MobileNet V3 also introduced the swish nonlinear and SE block in the basic block, and obtained the overall network architecture based on NAS [33]. It is worth noting that by adding the SE block, the network's ability to perceive subtle features can be enhanced and noise interference can be suppressed without significantly increasing parameters and  [33,40]. This thought fits with the fine-grained characteristic of fruit classification, especially for highly variable domains such as Domain-IN and Domain-VegFru in our datasets. The main building block of MobileNet V3 is shown in Fig. 3.
SE block only recalibrates channel-wise feature responses and enhances the network's attention to important channels. For the spatial distribution inside the feature map, the SE block does not make further processing; however, some subtle differences in fruit images may be reflected in the spatial dimension, such as the subtle texture and contour differences of different types of fruits [82]. Hence, it is necessary to add a spatial attention mechanism to the network. Directly introducing spatial attention operations, like CBAM [45], will add extra parameters and computational burden, destroying the lightweight properties of the network itself. Therefore, we design a lightweight network-friendly hybrid attention module (HAM), as shown in Fig. 4. First, the input feature map is subjected to a squeeze-and-excitation operation to obtain a channel-weighted feature map. Overall, an SE block F SE (with parameter θ ), which takes X SE as input and outputs Y SE , can be formulated as: ReLU and sigmoid are activations and GAP is short for global average pooling [83]. W 1 , W 2 are learnable kernel weights. Then, a parameter-free attention module (PAM) is constructed after SE block. Based on well-known neuroscience theory, the PAM is capable of directly estimating 3-D weights including channel and spatial parts [84]. The authors in [84] propose that a neuron which shows suppression effects should be emphasized and define an energy function as: whereμ is the mean of the input feature andσ 2 is its variance. λ is the hyper-parameter. Following the settings in [84], λ is set to 0.0001 in our experiments. A sigmoid function is used to control the output range of the attention vector. The whole refinement phase of PAM is: where E groups all e * t across channel and spatial dimensions. It should be noted that the output of the SE block Y SE is regarded as the input of the PAM module X P AM . The resulting HAM contains both the channel attention mechanism and the spatial attention mechanism and does not add any parameters compared to the SE block.
The HAM-MobileNet is based on MobileNet V3-large and uses the proposed HAM to replace the SE block. In theory, the HAM architecture can better extract subtle feature differences, suppress the impact of noise, and fit the fine-grained characteristics of fruit images. To adapt to our cross-domain classification task, we also adjust the tail of the architecture. The number of output channels of the first 1 × 1 convolutional layer after GAP is set to 256 instead of 1280 in the original MobileNet V3-large. It has been proven that a lower dimensional layer can be used to regularize the training of the classifier on the source domain and prevent overfitting to the particular nuances of the source domain distribution [66]. Finally, the last 1 × 1 convolutional layer maps the learned distributed feature representations to the label space, and a Softmax layer implements classification. The building block and overall architecture of the proposed HAM-MobileNet is shown in Fig. 5.

Hybrid loss function design
Following the paradigm of deep learning-based UDA, the loss function is shown in Eq. (6).
where L C L (X S , y) denotes classification loss on the source labeled data X S and the ground truth y. L DL (X S , X T ) denotes the domain discrepancy loss between the source data X S and target data X T . λ (λ > 0) is the trade-off parameter of the domain discrepancy loss and the classification loss. The latter is the focus of UDA. MMD is a common domain discrepancy metric, which defines the distance between the two domain distributions with their mean embeddings in the reproducing kernel Hilbert space (RKHS). Let H be the RKHS with a characteristic kernel k and p, q be the source and target domain probability distributions; the formulation of the MMD is shown in Eq. (7): where x s ,x t is the data samples from X S ,X T . φ(•) denotes some feature map to map the original samples to RKHS. As the MMD loss gradually decreases, the feature representations of the source and target domains gradually "close", enabling domain adaptation. In practice, an unbiased estimate of the MMD compares the square distance between the empirical kernel mean embeddings as: The kernel k means k(x s , x t ) φ(x s ), φ(x t ) , where •, • represents the inner product of vectors. However, the MMD metric mainly focuses on the alignment of the global are p (c) and q (c) . The formulation of the LMMD and its unbiased estimator are: where ω sc i and ω tc j denote the weight of x s i and x t j belonging to class c, respectively.
The distributions of relevant subdomains within the same category of the source and target domains can be accurately aligned using the LMMD loss metric. In fact, born out of MMD, LMMD is also an explicit measure of distributional differences. However, for dynamically changing data in neural networks, the pre-defined explicit metric distance may not be sufficient to characterize the difference between distributions [62]. Therefore, a simple and effective implicit metric is needed, and the CORAL is introduced. The CORAL loss is defined as the distance between the second-order statistics (covariance) of the source and target domain features [85]: where • 2 F is the squared matrix Frobenius norm, C S is the source covariance matrix and C T is the target covariance matrix. d is the feature dimension of the data. The covariance matrices of the source and target domain data are given by Eq. (12) and Eq. (13): where 1 is a column vector with all elements equal to 1. n S and n T are the total number of samples in the source and target, respectively.
The final form of the domain discrepancy loss is the combination of LMMD and CORAL loss, as shown in Eq. (14): where γ is the trade-off parameter of the LMMD loss and the CORAL loss. Then, the domain discrepancy loss is combined with the classification loss, generally the cross-entropy loss, to form the total loss function of the UDA model as follows:  The hybrid loss function enables the UDA model to capture fine-grained information through subdomain alignment during training, and implicitly and explicitly realizes domain discrepancy alignment to maximize domain adaptation, thus meeting the needs of cross-domain fruit classification.

The framework of proposed approach
The overall framework of the proposed approach is shown in Fig. 6. The proposed approach comprises the proposed HAM-MobileNet backbone as the feature extractor and a classification part, where additional losses are introduced to enhance the learning process and the overall domain adaptation performance. The methodological contributions of this paper are shown in the red dashed boxes. The workflow of our approach is as follows: (1) The labeled source samples and unlabeled target samples are imported into the backbone network. The image samples of each mini-batch are extracted layer by layer through the convolution layers, and the output of the last convolutional layer is used as the representations. In each mini-batch, we sample the same number of source domain data and target domain data to eliminate the bias caused by domain size [86].
(2) The labeled sample feature representations are used to compute the classification loss, and the feature representations of the source and target domains are used to compute the domain discrepancy loss. (3) All losses are minimized by back-propagation, as shown by the dashed line with arrow. In the process of model iteration, the labeling for target samples usually becomes more accurate, and the label of the target sample is output to realize the cross-domain fruit classification.

Evaluation metrics
Model performance is measured using classification accuracy, which is defined as follows [87]: where TP equals true positive, TN equals true negative, FP equals false positive and FN equals false negative. The confusion matrix is used to visualize model performance.
The effect of domain adaptation is evaluated using Adistance [88]. Specifically, the proxy A-distance is defined as d A 2(1 − 2ε), where ε is the generalization error of a linear classifier trained on the binary problem of discriminating the source and target domain data. A-distance is often used to calculate the similarity degree of data in two domains, which is simple and effective [89]. In our research, the support vector machine (SVM) with linear kernel is selected as the classifier to distinguish the domains, and the loss of the SVM classifier is ε. In addition, the t-distributed stochastic neighbor embedding (t-SNE) [90] is used to visualize the network representations. t-SNE is a powerful tool to visually verify the validity of the transfer learning algorithm [63].

Setup
For all methods, the HAM-MobileNet mentioned in "Lightweight attention networks" is employed. The input images are resized to 224 × 224 and the batch size is set to 32. Data augmentation techniques including rotation, flipping and cropping are randomly applied when training the model. For the proposed methods in all tasks, we use mini-batch stochastic gradient descent (SGD) with a momentum of 0.9 and the learning rate annealing strategy in [65]. The learning rate is not selected by a grid search due to high computational cost; it is adjusted during SGD using the following formula: where θ is the training progress linearly changing from 0 to 1,η 0 0.01, α 10 and β 0.75. To suppress noisy activations at the early stages of training, instead of fixing the adaptation factor λ, we gradually change it from 0 to 1 by a progressive schedule: and 10 is fixed throughout the experiments [65]. Another adaptation factor γ is used to balance LMMD and CORAL loss. Since the scale of the domain discrepancy loss is in the same order of magnitude, the balance γ factor has little effect on the final result [91], and γ 0.2 is fixed throughout the experiments.
Some state-of-the-art feature-based UDA methods have also been applied to cross-domain fruit classification to compare with the proposed method. JAN [68] can learn a transfer network by aligning the joint distributions of multiple domain-specific layers across domains based on a JMMD criterion. Dynamic distribution adaptation network (DDAN) [70] is a novel deep transfer learning architecture that can dynamically and quantitatively adapt the marginal and conditional distributions in transfer learning with the adaptive factor. Batch nuclear-norm maximization (BNM) [92] is a new learning paradigm which can maximize the batch nuclear-norm to ensure higher prediction discriminability and diversity. Multi-representation adaptation network (MRAN) [86] is a novel structure that can align the distributions of multiple representations extracted by a hybrid structure named Inception Adaptation Module (IAM). These new methods extend and improve the classic deep UDA model, and have achieved good results, which should be used as the baselines.
For all MMD-based methods, we adopt Gaussian kernel with bandwidth set to median pairwise squared distances on the training data [93]. For other settings in the baseline methods, we follow their original settings for fair comparison. The average classification accuracy over three random experiments are reported for comparison. We implement all methods in Pytorch based on Python 3.7.10 in the backend, and all the experiments were run on Windows 10 as the operation system, on a personal computer with an 8-core CPU (Intel Core i7-10875H with 2.30 GHz), 16 GB of DDR4 RAM and a NVIDIA GeForce RTX 2060 GPU with CUDA 10.1 and 6 GB of memory.

Comparative results of different backbones
To demonstrate the effectiveness of the proposed HAM-MobileNet, we first perform classification directly on the Domain-IN and Domain-VegFru datasets, since these two domains have diverse image sources and complex scene distributions with the most obvious fine-grained characteristics. Several other powerful networks, including AlexNet, ResNet 50, VGG 16 [26], DenseNet 121 [94], MnasNet [95], Shuf-fleNet V2 [35] and EfficientNet-B0 & B4 [38], are introduced for comparison. Among them, MnasNet, EfficientNet-B0 and ShuffleNet are lightweight network architectures similar to MobileNet used in our research. Multiply adds (MAdds) and number of parameters (Params) are used to measure resource usage in the same way of [33]. In the training process of the models, the ratio of training set, validation set and test set is 6:2:2. All models converge to a stable level after several epochs, and the model with the highest accuracy on the validation set is selected for testing. The accuracy and resource cost are shown in Table 3.
Overall, the EfficientNet-B4 architecture achieves the best results, fully illustrating the great potential of the compound model scaling used in the EfficientNets. The HAM-MobileNet achieves comparable results to the ResNet 50 architecture. However, the Params and MAdds of the HAM-MobileNet are only 12.6% and 5.8% of the ResNet 50, which greatly saves the consumption of computing resources. Although there are more lightweight networks such as Shuf-fleNet V2 and MobileNet V3-small, they do not perform as well as the HAM-MobileNet. The HAM-MobileNet can  The best results are shown in bold achieve a balance between performance and efficiency. Compared with the original MobileNet V3-large, the improved HAM-MobileNet has improved accuracy on both datasets, which fully demonstrates that the proposed hybrid attention module can better explore the subtle features of fruit datasets and suppress complex background distraction. At the same time, the number of parameters and computational cost of the HAM-MobileNet hardly increase. Further, we explore the performance of HAM-MobileNet on the UDA tasks, UDA tasks with Domain-IN and Domain-VegFru as source domains are selected for validation, and the model is optimized using the proposed hybrid loss function. The results are shown in Table 4. The UDA model with HAM-MobileNet as the backbone network achieves the third best classification effect, and the average accuracy is only 0.2% lower than the best model with ResNet 50 as the backbone and the 0.1% lower than the model with the EfficientNet-B4 backbone. Compared with the original MobileNet V3-large, the improved HAM-MobileNet architecture has performance gains for almost all mentioned UDA tasks in Table 4. This fully demonstrates that HAM-MobileNet can extract rich and powerful fine-grained features to empower UDA tasks.

Comparative results of different loss
In this part, we fixed the backbone network as HAM-MobileNet. The classification results of grape classification dataset and general fruit classification dataset are shown in Tables 5 and 6, and Figs. 7, 8 and 9 show the confusion matrixes as the visual results. The experimental results reveal some insightful observations: (1) As shown in Tables 5 and 6, without applying UDA, the average accuracy of backbone network on grape classification dataset and general fruit classification dataset is only 73.5% and 70.5%. Domain drift due to interference The best results are shown in bold  idea for fruit classification application: when faced with tasks in specific scenarios (such as laboratories, production lines, supermarket self-checkouts, etc.), the UDA classification model can be built with the help of public datasets or labeled images collected on the Internet, without having to manually annotate a large number of images from the specific scenes. Obviously, this can save a lot of manual annotation costs, and the source domain dataset has the versatility to extend to various target domains [20,97] (Figs. 10a-12a), the target features are very scattered, and some are mixed together and indistinguishable. Features are more discriminative when UDA method are applied. We can also observe that even if visualizations of UDA methods are similar, features of some classes are more compact using the proposed method, especially reflected in Fig. 10b and c.
We use A-distance mentioned in "Evaluation metrics" to quantify domain distribution discrepancy. Figure 13 shows the A-distance values on CE-IN, IN-CE and VegFru-SPD with representations of backbone only, backbone with MMD and proposed method. The A-distance value of the proposed method is the smallest on all three tasks, which shows that the proposed method can eliminate the gaps between domains to the greatest extent.

Conclusions
In this paper, a feature-based unsupervised domain adaptation method for cross-domain fruit classification is proposed. Two fruit classification datasets covering different domains are established to explore the stability of the UDA model to differences in fruit classification scenarios. The novel HAM-MobileNet model is used as the backbone network, and a hybrid domain discrepancy loss based on subdomain alignment and implicit distribution metrics is proposed to build the cross-domain classification models. Experimental results manifest that the proposed method outperforms some state-of-the-art methods on both datasets, verifying the effectiveness of the proposed method. Some main conclusion are as follows: (1) With the help of the proposed HAM, the HAM-MobileNet is able to extract more expressive features, adapted to fruit images with fine-grained characteristics, while basically not increasing the model complexity. The experimental results that HAM-MobileNet can extract rich and powerful fine-grained features to empower UDA tasks. (2) The proposed hybrid loss function combining subdomain alignment and implicit distribution metrics can well characterize the domain discrepancy in the fruit classification task, with average accuracies of 95.0% and 93.2% on the grape classification dataset and general fruit classification datasets, respectively (3) Visualization results and A-distance prove that proposed method can eliminate the gaps between domains to the greatest extent. By leveraging the knowledge of the annotated source domain images, manual labeling of the new target domain images can be avoid and many work force and time costs can be saved. (4) By implementing the UDA method, the fruit classification model can well adapt to the common illumination and background changes in industrial and daily life scenes without adding manual annotations. Meanwhile, when faced with tasks in specific scenarios (such as laboratories, production lines, supermarket self-checkouts), the UDA classification model can be built with the help of public datasets or labeled images collected on the Internet, without having to manually annotate a large number of images from the specific scenes. Obviously, this can save a lot of manual annotation costs, and the source domain dataset has the versatility to extend to various target domains In future extension of our research, we plan to further compress the UDA model and deploy it to smart devices, such as developing a fruit classification application on a smartphone. At the same time, we will continue to expand the application scenarios and explore how to build a fruit classification model when the target domain is invisible, which may involve domain generalization techniques. Availability of data and materials The datasets will be available at https://github.com/op99pp/fruit_classification_zjurobotics.

Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.