Tiered Deep Similarity Search for Fashion

Manandhar, Dipu; Bastan, Muhammet; Yap, Kim-Hui

doi:10.1007/978-3-030-11015-4_3

Dipu Manandhar¹⁴,
Muhammet Bastan¹⁴ &
Kim-Hui Yap¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11131))

Included in the following conference series:

European Conference on Computer Vision

1648 Accesses
1 Citations

Abstract

How similar are two fashion clothing? Fashion apparels demonstrate diverse visual concepts with their designs, styles and brands. Hence, there exist a hierarchy of similarities between fashion clothing, ranging from exact instance or brand to similar attributes, styles. An effective search method, thus, should be able to represent the tiers of similarities. In this paper, we present a deep learning based fashion search framework for learning the tiers of similarity. We propose a new attribute-guided metric learning (AGML) with multitask CNNs that jointly learns fashion attributes and image embeddings while taking category and brand information into account. The two tasks in the framework are linked with a guiding signal. The guiding signal, first, helps in mining informative training samples. Secondly, it helps in treating training samples by their importance to capture the tiers of similarity. We conduct experiments in a new BrandFashion dataset which is richly annotated at different granularities. Experimental results demonstrate that the proposed method is very effective in capturing a tiered similarity search space and outperforms the state-of-the-art fashion search methods.

You have full access to this open access chapter, Download conference paper PDF

MMFL-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval

Article 22 August 2022

Learning Image Representation via Attribute-Aware Attention Networks for Fashion Classification

Learning Type-Aware Embeddings for Fashion Compatibility

Keywords

1 Introduction

Fashion contributes a significant portion in rapidly growing online shopping and social media [2, 9]. With such a growth, visual fashion analysis has received a huge research attention [1, 15, 16, 19, 24] and has been successfully deployed in large e-commerce companies and websites [13, 23, 29] such as eBay, Amazon, Pinterest, Flipkart, etc. One of the most important aspects in visual fashion analysis is fashion search. This paper presents a new deep learning based fashion search framework with an interesting fusion of multitask and metric learning.

With the recent advances in deep learning, end-to-end metric learning methods for visual similarity measure have been proposed [3, 22]. The main task here is to learn a discriminative feature space for image representation. Particularly for fashion search, the feature space should incorporate various elements of fashion. As fashion domain demonstrates a huge diversity in visual concepts with their designs, styles, brands, there exist tiers of similarity for fashion clothing. Visual fashion similarity can be defined based on various concepts such as categories (e.g. dress, hoodie), brands (e.g. Adidas, Nike), attributes (e.g. color, pattern) or design (cropped, zippered). Figure 1 illustrates tiers of similarity for clothing images. Clothing (A) and the reference clothing (R) are the exact same (brand, model, categories, color etc.) clothing and hence lies closest within the inner circle. Clothing (B) shares the same model as clothing (R) with a different color and hence it is second nearest to (R). Similarly, clothing items (C), (D) & (E) lie farther away. We aim to learn such a tiered feature space, as this provides the desired retrieval outcome for practical fashion search applications.

Deep metric learning has demonstrated huge success in learning visual similarity [3, 8, 11, 22, 28]. Siamese networks [5, 8, 28] and triplet networks [22, 27] are the most popular models for metric learning, the latter being reported to be better [10, 22]. Although successful, the existing triplet based methods [3, 10, 22] have few limitations. First, they require exact instance/ID level annotations, and do not perform well with weak label annotations e.g., category labels (shown in Sect. 3). Secondly, these methods employ hard binary decisions during the triplet selection and treat the selected triplets with equal importance, which creates a restriction to learn tiers of similarity.

To learn discriminative feature space, researchers have combined metric learning with auxiliary information using multitask networks which have achieved better performance for face identification and recognition [6, 22, 30], person re-identification [14, 18], clothing search [12]. Particularly for fashion representation, multitask learning with attribute information is used in [12, 24]. Where-To-Buy-It (WTBI) [15] used pre-trained features and learned a similarity network using Siamese networks. Recently, FashionNet in [19] proposed to jointly optimize classification, attribute prediction and triplet loss for fashion search. However, they do not explore the possible interaction between the tasks and hence do not effectively learn a tiered similarity space required for fashion search.

In view of this, we propose a new attribute guided metric learning (AGML) framework using multitask learning for fashion search. The proposed framework utilizes the interactions between the attribute prediction and triplet network, by jointly training them. This has two major advantages over the existing methods. First, it helps in mining informative triplets especially when exact anchor-positive pair annotations are not available. Second, training samples are treated based on their importance in a soft manner which helps in capturing multiple tiers of similarity required for fashion search. We demonstrate its effectiveness for fashion search using a new BrandFashion dataset. Compared to the existing fashion datasets [4, 7, 19], this dataset is richly annotated with essential elements of fashion including clothing categories, attributes, and brand information which capture different tiers of information in fashion.

2 Proposed Method

The architecture of the proposed framework is shown in Fig. 2. It consists of three identical CNNs with shared parameters $\theta $ and accepts image triplets $\left\{ x^a,x^p,x^n \right\} $ i.e. an anchor image $(x^a)$, a positive image $(x^p)$ from the same class as the anchor, and a negative image $(x^n)$ from a different class. The last fully connected layer has two branches for learning the feature embedding f(x), and the attribute vector $ \mathbf {v} $. The guiding signal links two tasks and helps triplet sampling based on the importance of the samples. The network is trained end-to-end using the loss,

$$\begin{aligned} L_{total}(\theta ) = L^{G}_{tri}(\theta ) + \lambda L_{attr}(\theta ) \end{aligned}$$

(1)

where $L^{G}_{tri}(\theta )$ & $L_{attr}(\theta )$ represent the attribute-guided triplet loss & attribute loss respectively, and $\lambda $ balances the contribution of the two losses.

2.1 Attribute Prediction Network

We use K semantic attributes to describe the image appearance, denoted $ \mathbf {a} = [ a_1, a_2, \dots , a_K ]$, where each element $a_i \in \left\{ 0,1 \right\} $ indicates the presence or absence of the $i^{th}$ attribute. The problem of attribute prediction is treated as multilabel classification. We pass the first branch from last fully connected layers into a sigmoid layer to squash the output to [0, 1] and output $ \mathbf {v} $. The attribute prediction is optimized using binary-cross entropy loss $L_{attr}(\theta ) = -\sum _{i=1}^K \left[ a_i \log (v_i) + \, (1-a_i) \log (1- v_i) \right] $, where $a_i$ is binary target attribute labels for image x, and $v_i$ is a component of $ \mathbf {v} = [v_1, v_2, \dots , v_K]$, which is the predicted attribute distribution.

2.2 Attribute Guided Triplet Training

We use the predicted attribute vectors to guide both triplet mining and triplet loss training.

A. Triplet Mining

Random sampling based on class/ID labels for triplet does not assure the selection of the most informative examples for training. This is especially critical when only category information is available. For effective training, anchor-positive pairs should be reliable. Therefore, we propose to leverage cosine similarity $\langle \mathbf {x} , \mathbf {y} \rangle = \frac{ \mathbf {x} ^\intercal \mathbf {y} }{\Vert \mathbf {x} \,\Vert _2\Vert \mathbf {y} \,\Vert _2}$, between the anchor-positive attribute vectors (outputs of the attribute prediction network) to sample better triplets. In particular, we use a threshold $(\varPhi )$ such that only the triplets with $\langle \mathbf {v} ^a, \mathbf {v} ^p\rangle > \varPhi $ are selected for the training. This ensures that the anchor-positive pairs are similar in attribute space and hence are reliable.

B. Attribute Guided Triplet Training

We propose two ways to guide the triplet metric learning network. The first weights the whole triplet loss while the second operates on the margin parameter of the loss function. Let $ \left\{ x^a, x^p, x^n \right\} $ be an input triplet sample and $\left\{ f(x^a), f(x^p), f(x^n)\right\} $ be the corresponding embedding. The proposed attribute-guided triplet loss given by,

$$\begin{aligned} L^{G}_{tri}(\theta ) = w(a,p,n) \left[ \Vert f(x^a) - f(x^p) \Vert _2^2 - \Vert f(x^a) - f(x^n) \Vert _2^2 + m(a,p,n)\right] _+\,, \end{aligned}$$

(2)

where w(a, p, n) and m(a, p, n) are the loss weighting factor and margin factor, which are functions of attribute distributions $\left\{ \mathbf {v} ^a, \mathbf {v} ^p, \mathbf {v} ^n \right\} $, as explained below.

B1. Soft-Weighted (SW) Triplet Loss

The SW triplet loss operates on the overall loss using the weight factor w(a, p, n). We use $w(a,p,n) = \langle \mathbf {v} ^a, \mathbf {v} ^p\rangle \langle \mathbf {v} ^a, \mathbf {v} ^n \rangle $, the product of similarities between attribute vectors of anchor-positive and anchor-negative pairs. The above function adaptively alters the magnitude of the triplet loss. When the anchor-positive pair is similar in attribute space (i.e. $\langle \mathbf {v} ^a, \mathbf {v} ^p \rangle $ is high), the sample is more confident and reliable. Likewise, when the anchor-negative pair is similar in attribute space (i.e. $\langle \mathbf {v} ^a, \mathbf {v} ^n \rangle $ is high), it forms a hard negative example i.e. high information. Hence, the triplet is given higher priority and more attention during the network update. This is analogous to hard negative mining [22], but we handle them in a soft manner.

B2. Soft-Margin (SM) Triplet Loss

The SM triplet loss operates on the margin parameter using m(a, p, n). The naive triplet loss uses a constant margin m, which treats all triplets equally and restricts learning desired tiered similarity. The soft margin is an adaptive margin $ m(a,p,n) = m_0\log (1+ \langle \mathbf {v} ^a, \mathbf {v} ^p\rangle \langle \mathbf {v} ^a, \mathbf {v} ^n \rangle )$, and promotes a tiered similarity space. Similar to SW triplet loss, when both $\langle \mathbf {v} ^a, \mathbf {v} ^p\rangle $ and $\langle \mathbf {v} ^{a}, \mathbf {v} ^n \rangle $ are high, the triplet is more reliable and informative (hard negative), and hence the effective margin becomes larger. In other words, when the negative image and anchor image are very similar in the attribute space, a reliable margin is used to learn the subtle difference and avoid the confusion. Hence, both SW and SM triplet loss explore the importance of the triplets, which helps in learning a tiered similarity space.

3 Experiments

We collected a new BrandFashion dataset with about 10K clothing images with distinctive logos from 15 brands. The images are categorized into 16 clothing categories and annotated with 32 semantic attributes. The goal is to demonstrate the tiered similarity space using the category, brand and attribute annotations. There are 50 query images in the dataset. We evaluated the performance for instance search using mean average precision (mAP) and the performance of tiered similarity search using normalized discounted cumulative gain (NDCG) i.e. $NDCG@k = \frac{1}{Z}\sum _i^k \frac{2^{r(i)}-1}{\log _2(1+i)}$. The relevance score of $i^{th}$ ranked image is calculated based on similarity match considering three levels of information, namely category, brand and attribute i.e. $r(i) = r_i^{cat} + r_i^{brand} + r_i^{attr}$, where $r_i^{cat} \in \left\{ 0,1 \right\} $, $r_i^{brand}\in \left\{ 0,1 \right\} $. The attribute match $r_i^{attr}$ is computed by taking the ratio of the number of matched attributes to the total number of query attributes [12]. Overall, the relevance score summarizes the tiered similarity search performance.

We used VGG16 [25] as the base CNN network, which is trained using the loss defined in Eq. (1), with SGD momentum of 0.5 & learning rate of 0.001. We set $\lambda $ to 1. The value of margin $m_0$ is experimentally set to 0.5. For the SM triplet loss, the value of $m_0$ set to 0.8 such that the effective margin swings around the original value. We set the threshold ($\phi $) to 0.7 and observe that the performance is fairly stable on $\phi \in [0.5,0.9]$.

Items from the same category and brand are sampled for the anchor-positive pairs, and items from different categories or brands constitute the negative samples. $L_2$-normalized feature from the last fully connected layer is used as the feature vector. We used PyTorch [20] for the implementation. Similar to [15, 19, 23], we crop out the clothing region prior to feature extraction. We used Faster-RCNN [21] to jointly detect the brand logo and clothing items in the images.

Table 1. Comparison of the proposed method with state-of-the-art-methods

Full size table

Table 1 compares performance of different methods in terms of mAP and NDCG@20. In terms of mAP, the naive triplet loss achieves 33.8%, while the multitask network (triplet+attribute) achieves 56.4%. This shows that there is a clear benefit of using auxiliary information using multitask metric learning. The proposed method additionally guides the triplet loss using the predicted attributes. The proposed AGML-SW and AGML-SM achieve mAPs of 63.79% and 63.71%. This demonstrates the advantage of attribute guided triplet loss. The proposed method clearly outperforms the deep feature encoding based methods [13, 17, 26], and state-of-the art metric learning methods [15, 19, 23].

Similar trend in results can be observed for NDCG in Table 1. The proposed attribute-supervised SW and SM triplet network achieve NDCG@20 of 83.66% and 85.12% respectively. Our method clearly outperforms other state-of-the-art methods which demonstrate the advantage of learning a tiered similarity space. We further take advantage of logo detection to re-rank the retrieval results. The proposed method achieves mAP $\approx $ 71% and NDCG@20 $\approx $ 96% with re-ranking based on detected brand logo information. Figure 3 shows example search results obtained using WTBI [15], FashionNet [19] and the proposed method which further demonstrates the advantage of the proposed method.

4 Conclusions

We presented a new deep attribute-guided triplet network which explores the importance of training samples and learns a tiered similarity space. The method uses multitask CNN which shares the mutual information among the tasks for better tuning the loss. Using the predicted attributes, the proposed method first mines informative triplets, and then uses them to train the triplet loss in a soft-manner, which helps in capturing the tiered similarity desirable for fashion search. We believe that the tiered similarity search will be appreciated by fashion companies, online retailers as well as customers.

References

Al-Halah, Z., Stiefelhagen, R., Grauman, K.: Fashion forward: forecasting visual style in fashion. In: IEEE International Conference on Computer Vision (ICCV), pp. 388–397. IEEE (2017)
Google Scholar
Baldwin, C.: Online spending continues to increase thanks to fashion sector (2014). https://www.computerweekly.com/news/2240225386/Spend-online-continues-to-increase-thanks-to-fashion-sector
Bell, S., Bala, K.: Learning visual similarity for product design with convolutional neural networks. ACM Trans. Graph. (TOG) 34(4), 98 (2015)
Article Google Scholar
Bossard, L., Dantone, M., Leistner, C., Wengert, C., Quack, T., Van Gool, L.: Apparel classification with style. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7727, pp. 321–335. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37447-0_25
Chapter Google Scholar
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a siamese time delay neural network. In: Advances in Neural Information Processing Systems, pp. 737–744 (1994)
Google Scholar
Chechik, G., Sharma, V., Shalit, U., Bengio, S.: Large scale online learning of image similarity through ranking. J. Mach. Learn. Res. 11, 1109–1135 (2010)
MathSciNet MATH Google Scholar
Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic attributes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 609–623. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_44
Chapter Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 539–546 (2005)
Google Scholar
Financial Times: online retail sales continue to soar (2018). https://www.ft.com/content/a8f5c780-f46d-11e7-a4c9-bbdefa4f210b
Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv:1703.07737 (2017)
Hu, J., Lu, J., Tan, Y.P.: Discriminative deep metric learning for face verification in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1875–1882 (2014)
Google Scholar
Huang, J., Feris, R.S., Chen, Q., Yan, S.: Cross-domain image retrieval with a dual attribute-aware ranking network. In: International Conference on Computer Vision, pp. 1062–1070 (2015)
Google Scholar
Jing, Y., et al.: Visual search at pinterest. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1889–1898. ACM (2015)
Google Scholar
Khamis, S., Kuo, C.-H., Singh, V.K., Shet, V.D., Davis, L.S.: Joint learning for attribute-consistent person re-identification. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 134–146. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16199-0_10
Chapter Google Scholar
Kiapour, M.H., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it: matching street clothing photos in online shops. In: IEEE International Conference on Computer Vision, pp. 3343–3351 (2015)
Google Scholar
Kiapour, M.H., Yamaguchi, K., Berg, A.C., Berg, T.L.: Hipster wars: discovering elements of fashion styles. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 472–488. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_31
Chapter Google Scholar
Lin, K., Yang, H.F., Liu, K.H., Hsiao, J.H., Chen, C.S.: Rapid clothing retrieval via deep learning of binary codes and hierarchical search. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pp. 499–502. ACM (2015)
Google Scholar
Lin, Y., Zheng, L., Zheng, Z., Wu, Y., Yang, Y.: Improving person re-identification by attribute and identity learning. arXiv:1703.07220 (2017)
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1096–1104 (2016)
Google Scholar
Paszke, A., et al.: PyTorch. http://pytorch.org
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNnet: a unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Google Scholar
Shankar, D., Narumanchi, S., Ananya, H., Kompalli, P., Chaudhury, K.: Deep learning based large scale visual recommendation and search for e-commerce. arXiv:1703.02344 (2017)
Simo-Serra, E., Ishikawa, H.: Fashion style in 128 floats: joint ranking and classification using weak data for feature extraction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 298–307 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv:1511.05879 (2015)
Wang, J., et al.: Learning fine-grained image similarity with deep ranking. arXiv:1404.4661 (2014)
Wang, X., Sun, Z., Zhang, W., Zhou, Y., Jiang, Y.G.: Matching user photos to online products with robust deep features. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 7–14. ACM (2016)
Google Scholar
Yang, F., et al.: Visual Search at eBay. arXiv:1706.03154 (2017)
Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch. arXiv:1411.7923 (2014)

Download references

Acknowledgment

This research was carried out at the Rapid-Rich Object Search (ROSE) Lab at the Nanyang Technological University, Singapore. The ROSE Lab is supported by the Infocomm Media Development Authority, Singapore. We gratefully acknowledge the support of NVIDIA AI Technology Center for their donation of GPUs used for our research.

Author information

Authors and Affiliations

Nanyang Technological University, Singapore, 639798, Singapore
Dipu Manandhar, Muhammet Bastan & Kim-Hui Yap

Authors

Dipu Manandhar
View author publications
You can also search for this author in PubMed Google Scholar
Muhammet Bastan
View author publications
You can also search for this author in PubMed Google Scholar
Kim-Hui Yap
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dipu Manandhar .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7826 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manandhar, D., Bastan, M., Yap, KH. (2019). Tiered Deep Similarity Search for Fashion. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11131. Springer, Cham. https://doi.org/10.1007/978-3-030-11015-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-11015-4_3
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11014-7
Online ISBN: 978-3-030-11015-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Tiered Deep Similarity Search for Fashion

Abstract

Similar content being viewed by others

MMFL-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval

Learning Image Representation via Attribute-Aware Attention Networks for Fashion Classification

Learning Type-Aware Embeddings for Fashion Compatibility

Keywords

1 Introduction

2 Proposed Method

2.1 Attribute Prediction Network

2.2 Attribute Guided Triplet Training

3 Experiments

4 Conclusions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7826 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Tiered Deep Similarity Search for Fashion

Abstract

Similar content being viewed by others

MMFL-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval

Learning Image Representation via Attribute-Aware Attention Networks for Fashion Classification

Learning Type-Aware Embeddings for Fashion Compatibility

Keywords

1 Introduction

2 Proposed Method

2.1 Attribute Prediction Network

2.2 Attribute Guided Triplet Training

3 Experiments

4 Conclusions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7826 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation