Survey on clothing image retrieval with cross-domain

The paper summarizes the research progress on critical region recognition and deep metric learning to achieve accurate clothing image retrieval in cross-domain situations. Critical region recognition is of great value for the clothing feature extraction, effectively improving retrieval accuracy. The accuracy will decrease when solving difficult samples with similar features but different categories. Nowadays, deep metric learning is an effective way to solve this problem, which utilizes the optimization of different loss functions and ensemble network to strengthen the discrimination of clothing features. Therefore, through comparison of the experimental results of different algorithms and analysis of the accuracy of cross-domain clothing retrieval, it is demonstrated that the improvement of the retrieval accuracy in the future mainly depends on clothing important feature extraction and clothing feature discrimination.


Introduction
The clothing image retrieval technology is that the computer recognizes the given clothing image and recommends clothing images with similar styles. Clothing image retrieval has been widely used in e-commerce platforms and search fields, such as Taobao, Jingdong, Baidu, Google, etc. benefit from clothing image retrieval. People prefer to take photos in daily life and find their favorite clothing from the Internet. Cross-domain clothing retrieval technology can help people quickly and accurately find similar styles of clothing on the Internet. This not only meets the needs of people's daily life and improves the quality of life, but also promotes the consumption of the clothing industry. According to the fashion industry survey, the scale of domestic and foreign fashion markets is growing steadily. By 2024, the size of the domes-B Chen Ning chennvictor@gmail.com Yang Di 1564773884@qq.com Li Menglu 1035712318@qq.com 1 Xi'an Polytechnic University, Shaanxi, Xi'an, China tic fashion market is expected to increase by 8.8% compared with the size of 2020, thus the market size will reach 26.288 million US dollars [1,2] Although clothing image retrieval technology has made great progress in the past 10 years, clothing image retrieval in cross-domain situations still faces great challenges. Crossdomain clothing retrieval means that the image to be queried and the image retrieval database come from two different In the scene domain, clothing images in online shopping malls are retrieved from daily street clothing images based on their similarity. Referring to recent surveys, it is found that it has two main difficulties: (1) clothes are flexible items, and when viewed from different shooting angles or The appearance can be very different when wearing different body types. The query image provided by the user may be taken under complex conditions, with complex background, various shooting angles, various lighting conditions, and even occlusion. Instead, most store images feature clean backgrounds, good lighting, and good frontal angles. (2) The intra-class variance is large and the inter-class variance is small, which is an inherent characteristic of clothing images. For example, two dresses from different categories are very similar in color and design, but have subtle differences in the shape of the neckline, one is V-shaped and the other is Ushaped. Given an image of a user's clothing with a V-neck, returning a dress with a U-neck is not considered a correct search result in the retrieval system.

Motivation
Clothing image retrieval in cross-domain situations has been widely used in daily life. In online shopping, when a user provides a photo of daily life, the retrieval system can return clothing images with the same or similar characteristics, which reduces the retrieval system's dependence on text, so that the desired clothing images can be retrieved more directly and accurately. Cross-domain clothing image retrieval is also widely used in physical stores. To accurately avoid purchasing and avoiding inventory, store owners need to understand the clothing preferences of people in nearby blocks. The traditional algorithm is to manually count and classify the clothing styles of consumers around the store. However, if the program can automatically take pictures of nearby pedestrians and analyze the attributes of the clothes, the number of observers can be greatly increased, and the time and labor costs can be reduced at the same time, providing a stronger reference and basis for purchasing decisions. Therefore, the research and summary of cross-domain clothing image retrieval has far-reaching significance for both individuals and society.
In actual demand, in the major domestic e-commerce shopping platforms, clothing images are mainly retrieved through keywords or texts, and the essence is to search for pictures by text. This technique requires that clothing images be finely classified and labeled accordingly. But with the explosive growth of clothing images, the shortcomings of this method become more and more obvious. First, keywords can only describe the easy-to-extract and abstract semantic features, and cannot fully reflect the visual features of clothing images, especially some fine and difficult-to-describe features; secondly, due to the huge number of clothing images, it takes a lot of Human and material resources are used for manual labeling, and manual labeling is prone to subjective bias; finally, if the description of the search keywords entered by the user is not accurate enough, it is difficult to retrieve the desired product. Therefore, this paper studies the content-based cross-domain clothing image retrieval, summarizes and evaluates it from different technical perspectives, and hopes to bring some inspiration to researchers and find new research hotspots for future research.

Previous work
With the development of deep learning [3][4][5][6],the framework of cross-domain clothing image retrieval is shown in Fig. 1. Cross-domain clothing image retrieval mainly includes two key steps: feature extraction and similarity measurement. For clothing image features, the methods of critical region recognition is generally used to identify important areas of clothing. The new methods of similarity measurement have also appeared. At present, the best effect is use the deep metric learning methods As shown in Table 1, different methods from past studies are summarized based on the latest research survey. In crossdomain clothing retrieval, deformation, occlusion, complex background and other phenomena will occur, which bring challenges to the accuracy of cross-domain clothing retrieval. At present, most of the critical region recognition of clothing is detect foreground objects before extracting features. The main purpose is to suppress background differences, enhance the identification of relevant local details, and provide more discriminative features in image feature extraction, which is convenient for distinguishing different types of objects. Another major challenge of cross-domain clothing image retrieval is distinguish similar images of different categories, and to cluster the images with larger differences in the same category. Deep metric learning maps images to feature vectors in space through deep neural networks. In this space, Euclidean distance or cosine distance can  be directly used as a distance metric between two points. The contribution of many deep metric learning algorithms design a loss function that can learn more discriminative features. Therefore, there is a large amount of work researching deep metric learning and its related loss functions, including contrastive loss, triplet loss, triplet loss of more complex mutations, and the combination of multiple networks or methods. The ensemble method of output grouping together.

Bounding box method
The bounding box method uses some detection methods to identify the clothing regions in the image, and uses a rectangular box to mark the frame, as shown in Fig. 2. The purpose reduces the clothing image from the complex background and other external environmental factors during the retrieval process. Enhancing the effect of neural network on the feature extraction of clothing images. Kiapour et al. [7] used a selective search method [8] to filter out any image with a width less than one-fifth of the image width, and directly used manual labeling of the bounding box of the clothing to limit the influence of the background regions and obtain a more accurate search. And other steps help reduce some of the variability observed in different online stores and item descriptions. Chen et al. [9] based on the R-CNN [10] target detection method made some improvements to the clothing detection problem in the image, using selective search to generate expected region suggestions, and using the Network-in-Network (NIN) model feature extraction was performed on the local regions. Then, Huang et al. [11] embedded additional semantic information in the tree structure layer of the attribute perception network, after they obtained the depth features of attribute perception, use Support Vector Regression (SVR) to predict the overlap ratio of each candidate frame, limit the size range and aspect ratio of the bounding box, and discard some inappropriate candidate solutions, thereby enhancing the positioning effect of the clothing regions bounding box in the image. In general, with the development of target detection [12], bounding box method is relatively easy to implement, the speed and accuracy of the recognition are also improved. However, when solving more complicated clothing images with background, human posture, and occlusion, the image features extracted in the clothing regions set by the bounding box will have many interference features, so the accuracy of retrieval accuracy is decreased.

Human body landmark recognition method
The landmark recognition method of the human body focuses on the limbs of the person wearing clothes. As shown in Fig. 3, according to the important body parts, it identifies the important regions of clothing, and then uses convolutional neural network to perform feature analysis on this region.
A method of research work is based on the human body pose estimation method proposed by Marcin Eichner's team [13]to realize the detection of human body important nodes, they associated the clothing regions with the important parts of the human body and divided them into 9 parts: the torso, the left and right upper arms, the left and right lower arms, the upper left and right legs, and the lower left and right legs [14,15]. In the recognition process, the upper body detection [16] and face detection technology [17] are combined to estimate the regions of the upper body. On this basis, the human body is segmented using the GrabCut algorithm [18], and finally combined with the human body proposed in the literature [13]. The appearance estimation model estimates the posture of the human body, thereby further dividing the important regions. The human body important node recognition method uses the landmarks of the human body to realize the detection of the important parts of the clothing. When the posture of the human body in the image is complex, it can still detect the clothing features of the relevant parts. However, when some regions are blocked, the expressive power of this method is limited.

Clothing landmark recognition method
Clothing landmark recognition directly detects the landmark marked by the clothing itself, and it is a newer method to realize the location of important regions of clothing, as shown in Fig. 4.
The DeepFashion database proposed by Liu et al. [19] defines a set of clothing landmarks, corresponding to a set of landmarks on the clothing structure. For example, a set of landmarks on the upper body is positioned as left/right collar ends and left/right cuffs. The landmarks of the lower end, lower left/right lower end, lower body clothing and full body clothing are also defined, because some landmarks often included in the image are occluded, so the visibility of each set of landmarks is also marked. On this dataset, Deep Fashion Alignment (DFA) network [20] is proposed to detect landmarks. DFA consists of three stages. In each stage, the output of the previous stage is used as input, and the network uses VGG-16 as the skeleton. In the first stage, DFA uses the original image as input to predict rough landmark locations and pseudo-labels. The pseudo-labels represent landmark such as clothing category and posture; In the second stage, the output of the entire network needs to predict the offset of the landmark, and the pseudo-label represents the offset of the local landmark; the third stage uses two CNNs as branches, and each branch has the same input and output. The choice of branch is determined by the pseudo-label of Phase 2. Deep-Fashion also has its own shortcomings. For example, each image has only one piece of clothing and each clothing category has only 4 to 8 landmark. To solve this problem, the DeepFashion2 [21] database was proposed, which has more landmark labeling and annotation information, the Match R-CNN model is proposed on this database, which is mainly composed of three parts: Feature Network (FN), Perception Network (PN) and Matching Network (MN) composition. After the query image passes through the FN network, it is input to the PN network, and then the landmark positions are obtained through the convolutional layer and the deconvolutional layer, and finally combined with the MN network for clothing retrieval. These two large clothing databases provide great support for future clothing retrieval research. The clothing landmark recognition method is also more capable of processing clothing deformation, occlusion and details, and the retrieval accuracy is greatly improved, but a lot of labeling and annotation information, and the professional knowledge of the clothing industry is very demanding.

Attention map recognition method
The attention map recognition method uses the idea of the attention mechanism to extract the image features of the saliency or visual attention regions from the original image, and usually needs to combine the attribute information of the clothing to complete the clothing retrieval task, as shown in Fig. 5.
Clothing attributes usually refer to the semantic attributes of objects or scenes shared across categories. The attributes of clothing include color, texture, fabric and style, so attributes can be used as potential and interpretable connections between image content and abstract labels. By constructing a latent space between fine-grained labels and low-level features, it helps models find inter-class and intra-class correlations between clothing categories. The attention map recognition method is an algorithm model that does not require a large number of human-labeled bounding boxes and feature extraction of landmark. Extract effective image representation from the spatial location of the salient regions reduce the cost of annotations while ensuring the effect of clothing image retrieval.
Most of the popular attention map recognition algorithms use image attributes as external information to locate the attention of the image in the database, and use the database image as the context to infer the attention of the query image. The attention model ignores the noise background and extracts discriminative features for retrieval. Wang et al. [22] proposed a deep convolutional neural network system, TagCtxYNet, which includes a convolutional layer for image feature extraction and an attention layer for spatial attention modeling. It extracts an effective representation of the image by learning attention weights. Gu et al. [23] proposed an autonomous learning Visual Attention Model (VAM) to extract attention maps from clothing images. It includes two branch networks, one is a global branch network based on CNN, which is used to extract attention maps from clothing images. Extract the bottom layer features of the image to get the image feature map. The other is the introduction of the attention branch of the Full Convolutional Network [24] (FCN), which is used to predict the saliency regions of the image to obtain the image attention map. The Impdrop module is used to connect two branches to obtain the attention feature map, and the module introduces randomness between the attention map and the feature map. The addition of this randomness can reduce the risk of overfitting and allow the neural network to learn to more robust features, so the robustness of the model is improved. Zheng et al. [25] proposed an Attention-Based Region Transfer (ART) module to highlight the importance of foreground, which works in a rough way that is not classified. The attention mechanism in the advanced features is used to extract the foreground objects of interest and mark them when the feature distribution is aligned. Through multi-layer adversarial learning, the use of complex detection models can achieve effective cross-domain retrieval.
Attribute learning models usually treat attribute prediction as a multi-label classification problem, and treat each attribute as a category. In fact, the clothing images trained by each model are associated with a series of attributes, such as "silk pocket shirts". However, traditional attribute learning models ignore sequence information. Although [22,23] use the combination of the attention feature map and the image feature map to find a more effective feature representation, which improved the clothing retrieval effect. However, there is a lack of more local information and research on the contextual connection of different parts of clothing. Luo et al. [26] proposed an attention-based learning strategy in the clothing image retrieval task. By integrating global information and local information, the features of clothing images can be intuitively extracted, because these two kinds of information provide complementary mechanisms. It can describe clothing images accurately, and uses the Long Short-Term Memory (LSTM) mechanism [27] to simulate the top-down spatial relationship of different parts of clothing to obtain more discriminative feature representations. Luo et al. [28] proposed a Deep Multi-task Cross-domain Hashing (DMCH) method to jointly establish the sequence correlation between clothing attributes, and learn the attention and perception vision of clothing images features to further enhance the effect of cross-domain clothing image retrieval.

Siamese network
Chopra et al. [29] first applied the contrastive loss function to the Siamese network based on deep neural network. Kiapour et al. [7] used Siamese network to predict whether two features represent the same category. Sean Bell et al. [30] used the traditional contrastive loss function to design an end-toend Siamese network for modeling learning. Huang et al. [11] proposed a Dual-attribute perceptual Ranking Network (DARN) for feature learning based on the Siamese network. All in all, the contrastive loss of the Siamese network is the most widely used pair of losses in calibration learning.
As shown in Fig. 6, the Siamese network architecture has two parallel feature networks, followed by a normalization operation (L2) and a contrastive loss. Jia et al. [31] defined the contrastive loss function in the literature as shown in Eq. (1). (1) Among f () is an embedding function that maps an image to a feature vector.y is a label, D() represents the distance between two feature vectors. The margin parameter m forces the distance between images of different categories to increase, which has a certain effect on learning ordering. On this basis, Xiong et al. [32] proposed a contrastive loss function with a bilateral distance parameter, as shown in Eq. (2).
Here, if the positive margin (PM) parameter is equal to the negative margin (NM), it is called symmetric double margin, otherwise we call it asymmetric double margin. The existence of positive margins makes clothing images diversified to a certain extent, which is more reasonable than forcing them to be exactly the same. Wang et al. [33] optimized the contrastive loss by adding penalty constraints, and proposed a robust contrastive loss function to improve the generalization ability of the learning network.

Triplet network and variants
The triplet loss [34] function is widely used in the triplet network model, and has achieved better results in cross-domain clothing image retrieval. The structure of the triplet network is shown in Fig. 7. The main structure is three parallel feature networks that map images into feature vectors, then normalize the feature vector, and input it into the three sets of loss functions. The triplet loss makes the distance between different clothing images larger and the distance between the same clothing images smaller.
Different from the contrastive loss function that considers the absolute distance of the pair, the triplet loss calculates the relative distance between the positive pair and the negative pair of the same reference sample, and the specific definition is shown in Eq. (3).
Among a i , p i and n i , respectively, represent the reference sample, the positive sample, and the negative sample. The a i and n i label of is the same, the a i and p i label is different. m is the margin between the positive and negative pairs.
Due to the triplet contains reference samples, positive samples and negative samples, so N images can generate O(N 3 ) sample, even if it is a medium number of images, it is impossible to consider all samples, because not all samples can provide the same information for training a model. Randomly selects triples is a very inefficient training deep embedding network, which has inspired a lot of recent work to mine difficult samples for training. Wang et al. [35] randomly selected samples as triples in the first 10 rounds of training, and dig out difficult triples in each small batch after   [36] used manual methods to mark difficult negative images from images with high confidence scores assigned to them in each round. Simo-Serra et al. [37] analyzed the impact of difficult positive sample mining and difficult negative sample mining, and found that the combination of positive sample mining and negative sample mining improved the discrimination ability. Song et al. [38] designed a small batch of triple loss that considers all possible three-group associations in the small batch. Liu et al. [39] proposed a cluster-level triple loss, which considered the correlation between the cluster center, the positive sample and the nearest negative sample. Ge et al. [40] introduced Hierarchical Triplet Loss (HTL) to solve the random sampling problem in the triplet training process. These studies solve the problem of how to mine difficult samples in training. They use more complex triples for training, which can not only speed up the convergence speed of the learning algorithm, but also can use the positive and negative samples of the given reference learn clearer margins and better improve the global structure embedded in the cyberspace.
However, the methods based on difficult sample mining aims to find those triples that are difficult to find in the existing network from the existing training samples. It is essentially a greedy algorithm, which makes the trained feature embed-ding network vulnerable to bad local optima [41]. Therefore, Zhao et al. [42] seek a method that can intentionally generate difficult triples to optimize the overall network of the network, instead of using a greedy strategy to explore existing samples only for the current network. As shown in Fig. 8, to generate the goal of difficult triples, a Hard Triplet Generation (HTG) network algorithm is proposed to optimize the network's ability to distinguish similar samples of different categories and group different samples of the same category.
Chopra et al. [43] proposed a novel Grid Search Network (GSN) to learn feature embedding for clothing retrieval. Similar to the triplet network variant, this method assumes that the training process is a search problem, and it finds a match of reference sample images in a grid containing positive and negative images. The framework also uses reinforcement learning-based strategies to learn special feature vector conversion functions, instead of simply connecting feature vectors. When applied to feature embedding networks, it further improves the clothing image retrieval accuracy. Kuang et al. [44] proposed a Graph Reasoning Network (GRNet), the similarity pyramid network, which uses global similarity to learn and query the similarity between clothing images and clothing databases.

Ensemble network
Ensemble is a widely used method of training multiple learners to obtain a combined model, and its performance is better than a single model [45,46]. For deep metric learning, the ensemble network connects the feature embedding learned by multiple learners. Under the constraint of the distance between a given image pair, a better embedding space can usually be obtained. A better ensemble network depends on the high performance of individual learners and the diversity between learners. However, in deep metric learning, there is not much research on the optimal architecture to generate feature embedding diversity.
According to the above-mentioned use of Siamese network or triplet network for deep metric learning, they have achieved better results. Although the image of clothing of the same category is closer, and clothing of different categories is further away from the image. However, it is difficult to directly optimize the target because of the size of the sample. Therefore, difficult sample mining is widely used to solve this problem, and it costs expensive calculations on a subset of samples considered to be difficult. However, the difficulty sample is defined relative to a specific model. Such a complicated model will treat most of the samples as easy samples, and a simple model will treat most of the samples as difficult ones, both of which are not conducive to training. Since different samples have different difficulty levels, it is difficult to define a moderately complex model, and it is also difficult to fully select difficult samples. To solve the above problems, we summarize and analyze the different methods put forward by researchers.
The above-mentioned triplet loss and variant methods are only based on a single model to mine difficult sample images, and cannot make full use of samples of different difficulty levels. Therefore, Yuan et al. [47] proposed the Hard-Aware Deeply Cascaded (HDC) Embedding model, which uses increasingly complex model sets in a cascaded manner to mine negative samples of different difficulty levels during the training process. They take advantage of the deep supervision network [48,49],and use a contrastive loss function to train the lower layers of the network to handle easier samples, and the higher layers of the training network to handle more difficult samples. Compared with this multi-layer method, the Boosting Independent Embedding Robustly (BIER) model [50]uses an ensemble method of high-dimensional embedding method, focusing on reducing the correlation on a single layer, and dividing the highdimensional embedding into several a learner, trained with Online Gradient Boosting (OGB). Continuous learners are trained on reweighted samples, which greatly reduces the correlation between learners, thereby reducing the correlation within the embedding and improving the robustness of the embedding. In addition, compared with the HDC model, the method allows continuous weighting of samples according to the loss function. Inspired by the BIER method, Xuan et al. [51] proposed a different method to learn robust, highdimensional embedding spaces. Instead of reweighting the input samples to create independent output embedding.
As an important aspect of ensemble network, learners should have diversity in feature embedding. So, Kim et al. [52] proposed an Attention-based Ensemble (ABE) model, as shown in Fig. 9, (a) based on ordinary ensemble learning; (b) based on attention ensemble learning. The model ensembles

Clothing databases
As shown in Table 2, the detailed introduction to the clothing databases. In recent years, the size of popular clothing databases and types of annotations is different. For example, Street2Shop and DARN contain 425K and 540K clothing images respectively. They contain two types of images (1) street images, which are images of people actually wearing clothes under daily uncontrolled environmental conditions; (2) shop images, which are clothing images of online clothing stores, which are made by professionals shot of a more controlled environment. Since the clothing category tags are extracted from the metadata of the images collected by online shopping sites, this makes their tags appear a lot of errors and confusion. DeepFashion and. ModaNet obtains labels by manually annotating clothing categories. In addition, different types of annotations are provided with these databases. DeepFashion is a large clothing database with comprehensive annotations, with four benchmarks, among which Consumerto-shop Benchmark is a database corresponding to street images and store images, and each clothing item's folder contains a street image and several store images, with a total of 33,881 clothing items and 239,557 clothing images. Each image has 4-8 clothings functional regions (such as " collar " ) and other related fashion labels. The recognitions of these fashion landmarks are shared with all clothing categories, which makes it difficult for them to capture the rich variety of clothing images. In contrast, ModaNet's street images have a mask of a single person, but there are no landmarks. Unlike the datasets above, DeepFashion2 contains 491K images and 801K annotations of landmarks, masks, and bounding boxes and 873,000 pairs of images, which is the most comprehensive benchmark in a clothing dataset.

Experiment preparation
The hardware and software environment used in the experiment is: Intel(R) Core(TM) i5-3570 CPU @ 3.40 GHz processor, NVIDIA GeForece GTX 1070 8GB graphics card, 8 GB memory. The operating system is Ubuntu 16.04, and the programming language is Python, the deep learning framework is Pytorch. As shown in Table 3, the datasets used in this paper are two subsets of the DeepFashion dataset, namely In-shop Clothes Retrieval Benchmark subset and Consumer-to-shop Clothes Retrieval Benchmark subset

The evaluation of clothing retrieval
There are many clothing databases, but the commonly used evaluation methods in clothing retrieval are as follows: precision, MAP, and accuracy.
1. The precision is shown in Eq. (4): where A is the number of similar clothing in the search results, and B represents the number of search results returned. From the precision rate formula, it can be seen that the precision rate can effectively investigate the proportion of the correct return results of the retrieval model in the retrieval results in all the returned results. 2. Although the precision rate is a statistical evaluation of the proportion of correct search results, it lacks the evaluation of the location information of the search results. Therefore, the MAP value is used as the evaluation of the location information of the search results. The MAP is shown in Eq. (5): where Q is the number of clothing images in the retrieval database, which represents the average correct rate and the change in recall rate, that is, the regions under the P−R curve. MAP can reflect the overall performance of the retrieval method, but lacks insight into the details of the retrieval results.
3. The accuracy of cross-domain clothing retrieval is the most commonly used evaluation criterion. The Top-k method is generally used, as shown in Eq. (6): where Q is the number of clothing images in the search database, and q represents the specific clothing image to  be queried. If at least one clothing image in the Top-k list matches the image q, then hit(q, k) will be set to 1, otherwise it will be set to 0.

Analysis of the results of the experiment
By summarizing the clothing image retrieval models based on deep learning in recent years, they mainly solve some difficulties in cross-domain situations. Tables 4 and 5, respectively, show two different models based on clothing critical region recognition and deep metric learning two ideas. Among them, " Y " and " N " indicates whether the algorithm uses the corresponding attribute and landmark annotations. It can be seen from the table that the network model based on critical region recognition has higher requirements for clothing to attribute labeling, and supervised learning is generally used. Even if the attention mechanism is used to greatly reduce the annotations of landmark in clothing, some weakly supervised networks are proposed, but some annotations of clothing attributes are still needed. However, most of the network models based on deep metric learning do not need landmark annotations or clothing attribute annotations, because deep metric learning is based on the characteristics of the image itself, in-depth mining of different difficult samples, and strengthening the discrimination of different extracted clothing features, then furthers extract the important clothing features that are discriminative. At present, most popular clothing retrieval networks are implemented based on the DeepFashion database. The Deep-Fashion database has two subsets, Consumerto-shop Benchmark and In-shop Benchmark. The application scenarios are cross-domain clothing image retrieval and same-domain clothing image retrieval, as shown in the figure. As shown in Fig. 10, the performance of using the critical region recognition idea in solving the cross-domain clothing retrieval problem. The figure shows that different clothing critical region recognition algorithms have a greater impact on clothing retrieval performance. Among them, clothing landmark recognition and attention map recognition have a higher accuracy in clothing retrieval. However, FashionNet and Match R-CNN networks use clothing landmark recognition to rely too much on clothing attributes and landmark annotation information in the retrieval process, while the attention maps recognition method can solve this problem better in clothing retrieval, it can still achieve better retrieval accuracy without the clothing landmark annotation, which provides new ideas of cross-domain clothing image retrieval. As shown in Fig. 11, the performance of several deep metric learning-based clothing image retrieval models in different datasets. From the comparison of the performance of (a) and (b), We find that the deep metric learning is better in the same domain clothing image retrieval than the crossdomain clothing image retrieval, because the same domain clothing image itself is less affected by the external environment. Mainly considering the influence of the inherent attributes of clothing images, deep metric learning can use different loss functions combined with network models to achieve better similar clothing matching. However, for crossdomain clothing image retrieval, deep metric learning using contrastive loss, ternary loss and variant, and ensemble learning needs to consider the influence of background and other factors, and needs to mine difficult samples. At present, ensemble learning is widely used, and it can be very use- Fig. 10 The retrieval results on Consumer-to-shop benchmark ful. It is good to mine samples of different difficulty levels to improve the accuracy of cross-domain clothing retrieval.
Clothing image retrieval requires feature extraction and similarity match. Different research methods have different focuses. Fig. 12 and Table 6 show the performance of deep network models in cross-domain clothing retrieval in recent years. It can be seen that the overall effect of deep metric learning is not as good as clothing critical region recognition of solving cross-domain retrieval problems, indicating that the main problem to be solved in cross-domain clothing retrieval are the recognition of clothing important regions in the image. This is important step in using convolutional Fig. 12 The retrieval results on Consumer-to-shop benchmark neural networks to extract features, then combine clothing attributes can achieve better retrieval results. Therefore, in the future, we can combine critical region recognition and deep metric learning to propose a new algorithm without additional annotation information, it can achieve better crossdomain clothing image retrieval accuracy.

Conclusion
This paper reviews the cross-domain clothing retrieval methods. First, it analyzes the common methods of the critical Fig. 11 The retrieval results of different models under different databases region recognition and deep metric learning in cross-domain clothing retrieval. Then, the research results show that attention map recognition method not only saves time and cost, but also further improves the effect of clothing retrieval. Deep metric learning research is widely used, and has achieved better results in the same-domain clothing retrieval and cross-domain clothing retrieval. At last, we find that the critical region recognition can extract more important clothing detail features, and deep metric learning makes the extracted features more discriminative, both affect the effect of crossdomain clothing retrieval.
In summary, although cross-domain clothing retrieval has achieved better retrieval results using clothing critical region recognition and deep metric learning methods, there are still many issues need to be solved, it mainly includes: 1. Attribute labeling problem: Most of the deep network models need the assistance of clothing attribute labels, which is a supervised learning or weakly supervised method. It requires high clothing labeling and is a timeconsuming and labor-intensive work. How to reduce clothing attribute labels, save costs, and improve accuracy still needs further research. 2. Model complexity problem: In recent years, the research of cross-domain clothing image retrieval tasks has mainly focused on ensemble methods. Although better results have been achieved, the long training time and memory loss brought by model ensemble methods are difficult to solve. Therefore, how to solve the high model complexity brought by ensemble learning under the premise of ensuring the retrieval effect is a big challenge. 3. Clothing databases: At present, the clothing databases contain different types of clothing, which are distinguished from clothing categories, such as dresses, jeans, shirts, etc. However, due to the development of the clothing fashion industry, different clothing combinations can produce ever-changing clothing styles, such as sports, Japanese, punk, etc. The retrieval of clothing styles will be another important research direction in the future. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.