Keywords

1 Introduction

Fashion contributes a significant portion in rapidly growing online shopping and social media [2, 9]. With such a growth, visual fashion analysis has received a huge research attention [1, 15, 16, 19, 24] and has been successfully deployed in large e-commerce companies and websites [13, 23, 29] such as eBay, Amazon, Pinterest, Flipkart, etc. One of the most important aspects in visual fashion analysis is fashion search. This paper presents a new deep learning based fashion search framework with an interesting fusion of multitask and metric learning.

With the recent advances in deep learning, end-to-end metric learning methods for visual similarity measure have been proposed [3, 22]. The main task here is to learn a discriminative feature space for image representation. Particularly for fashion search, the feature space should incorporate various elements of fashion. As fashion domain demonstrates a huge diversity in visual concepts with their designs, styles, brands, there exist tiers of similarity for fashion clothing. Visual fashion similarity can be defined based on various concepts such as categories (e.g. dress, hoodie), brands (e.g. Adidas, Nike), attributes (e.g. color, pattern) or design (cropped, zippered). Figure 1 illustrates tiers of similarity for clothing images. Clothing (A) and the reference clothing (R) are the exact same (brand, model, categories, color etc.) clothing and hence lies closest within the inner circle. Clothing (B) shares the same model as clothing (R) with a different color and hence it is second nearest to (R). Similarly, clothing items (C), (D) & (E) lie farther away. We aim to learn such a tiered feature space, as this provides the desired retrieval outcome for practical fashion search applications.

Fig. 1.
figure 1

Example of tiers of similarity in feature space for fashion clothing. Different tiers of similarity are denoted by the dotted concentric circles. Distances between the reference clothing (R) and clothing (A)–(E) represent the degrees of visual similarity.

Deep metric learning has demonstrated huge success in learning visual similarity [3, 8, 11, 22, 28]. Siamese networks [5, 8, 28] and triplet networks [22, 27] are the most popular models for metric learning, the latter being reported to be better [10, 22]. Although successful, the existing triplet based methods [3, 10, 22] have few limitations. First, they require exact instance/ID level annotations, and do not perform well with weak label annotations e.g., category labels (shown in Sect. 3). Secondly, these methods employ hard binary decisions during the triplet selection and treat the selected triplets with equal importance, which creates a restriction to learn tiers of similarity.

To learn discriminative feature space, researchers have combined metric learning with auxiliary information using multitask networks which have achieved better performance for face identification and recognition [6, 22, 30], person re-identification [14, 18], clothing search [12]. Particularly for fashion representation, multitask learning with attribute information is used in [12, 24]. Where-To-Buy-It (WTBI) [15] used pre-trained features and learned a similarity network using Siamese networks. Recently, FashionNet in [19] proposed to jointly optimize classification, attribute prediction and triplet loss for fashion search. However, they do not explore the possible interaction between the tasks and hence do not effectively learn a tiered similarity space required for fashion search.

In view of this, we propose a new attribute guided metric learning (AGML) framework using multitask learning for fashion search. The proposed framework utilizes the interactions between the attribute prediction and triplet network, by jointly training them. This has two major advantages over the existing methods. First, it helps in mining informative triplets especially when exact anchor-positive pair annotations are not available. Second, training samples are treated based on their importance in a soft manner which helps in capturing multiple tiers of similarity required for fashion search. We demonstrate its effectiveness for fashion search using a new BrandFashion dataset. Compared to the existing fashion datasets [4, 7, 19], this dataset is richly annotated with essential elements of fashion including clothing categories, attributes, and brand information which capture different tiers of information in fashion.

2 Proposed Method

The architecture of the proposed framework is shown in Fig. 2. It consists of three identical CNNs with shared parameters \(\theta \) and accepts image triplets \(\left\{ x^a,x^p,x^n \right\} \) i.e. an anchor image \((x^a)\), a positive image \((x^p)\) from the same class as the anchor, and a negative image \((x^n)\) from a different class. The last fully connected layer has two branches for learning the feature embedding f(x), and the attribute vector \( \mathbf {v} \). The guiding signal links two tasks and helps triplet sampling based on the importance of the samples. The network is trained end-to-end using the loss,

$$\begin{aligned} L_{total}(\theta ) = L^{G}_{tri}(\theta ) + \lambda L_{attr}(\theta ) \end{aligned}$$
(1)

where \(L^{G}_{tri}(\theta )\) & \(L_{attr}(\theta )\) represent the attribute-guided triplet loss & attribute loss respectively, and \(\lambda \) balances the contribution of the two losses.

Fig. 2.
figure 2

Architecture of the proposed attribute-guided triplet network.

2.1 Attribute Prediction Network

We use K semantic attributes to describe the image appearance, denoted \( \mathbf {a} = [ a_1, a_2, \dots , a_K ]\), where each element \(a_i \in \left\{ 0,1 \right\} \) indicates the presence or absence of the \(i^{th}\) attribute. The problem of attribute prediction is treated as multilabel classification. We pass the first branch from last fully connected layers into a sigmoid layer to squash the output to [0, 1] and output \( \mathbf {v} \). The attribute prediction is optimized using binary-cross entropy loss \(L_{attr}(\theta ) = -\sum _{i=1}^K \left[ a_i \log (v_i) + \, (1-a_i) \log (1- v_i) \right] \), where \(a_i\) is binary target attribute labels for image x, and \(v_i\) is a component of \( \mathbf {v} = [v_1, v_2, \dots , v_K]\), which is the predicted attribute distribution.

2.2 Attribute Guided Triplet Training

We use the predicted attribute vectors to guide both triplet mining and triplet loss training.

A. Triplet Mining

Random sampling based on class/ID labels for triplet does not assure the selection of the most informative examples for training. This is especially critical when only category information is available. For effective training, anchor-positive pairs should be reliable. Therefore, we propose to leverage cosine similarity \(\langle \mathbf {x} , \mathbf {y} \rangle = \frac{ \mathbf {x} ^\intercal \mathbf {y} }{\Vert \mathbf {x} \,\Vert _2\Vert \mathbf {y} \,\Vert _2}\), between the anchor-positive attribute vectors (outputs of the attribute prediction network) to sample better triplets. In particular, we use a threshold \((\varPhi )\) such that only the triplets with \(\langle \mathbf {v} ^a, \mathbf {v} ^p\rangle > \varPhi \) are selected for the training. This ensures that the anchor-positive pairs are similar in attribute space and hence are reliable.

B. Attribute Guided Triplet Training

We propose two ways to guide the triplet metric learning network. The first weights the whole triplet loss while the second operates on the margin parameter of the loss function. Let \( \left\{ x^a, x^p, x^n \right\} \) be an input triplet sample and \(\left\{ f(x^a), f(x^p), f(x^n)\right\} \) be the corresponding embedding. The proposed attribute-guided triplet loss given by,

$$\begin{aligned} L^{G}_{tri}(\theta ) = w(a,p,n) \left[ \Vert f(x^a) - f(x^p) \Vert _2^2 - \Vert f(x^a) - f(x^n) \Vert _2^2 + m(a,p,n)\right] _+\,, \end{aligned}$$
(2)

where w(apn) and m(apn) are the loss weighting factor and margin factor, which are functions of attribute distributions \(\left\{ \mathbf {v} ^a, \mathbf {v} ^p, \mathbf {v} ^n \right\} \), as explained below.

B1. Soft-Weighted (SW) Triplet Loss

The SW triplet loss operates on the overall loss using the weight factor w(apn). We use \(w(a,p,n) = \langle \mathbf {v} ^a, \mathbf {v} ^p\rangle \langle \mathbf {v} ^a, \mathbf {v} ^n \rangle \), the product of similarities between attribute vectors of anchor-positive and anchor-negative pairs. The above function adaptively alters the magnitude of the triplet loss. When the anchor-positive pair is similar in attribute space (i.e. \(\langle \mathbf {v} ^a, \mathbf {v} ^p \rangle \) is high), the sample is more confident and reliable. Likewise, when the anchor-negative pair is similar in attribute space (i.e. \(\langle \mathbf {v} ^a, \mathbf {v} ^n \rangle \) is high), it forms a hard negative example i.e. high information. Hence, the triplet is given higher priority and more attention during the network update. This is analogous to hard negative mining [22], but we handle them in a soft manner.

B2. Soft-Margin (SM) Triplet Loss

The SM triplet loss operates on the margin parameter using m(apn). The naive triplet loss uses a constant margin m, which treats all triplets equally and restricts learning desired tiered similarity. The soft margin is an adaptive margin \( m(a,p,n) = m_0\log (1+ \langle \mathbf {v} ^a, \mathbf {v} ^p\rangle \langle \mathbf {v} ^a, \mathbf {v} ^n \rangle )\), and promotes a tiered similarity space. Similar to SW triplet loss, when both \(\langle \mathbf {v} ^a, \mathbf {v} ^p\rangle \) and \(\langle \mathbf {v} ^{a}, \mathbf {v} ^n \rangle \) are high, the triplet is more reliable and informative (hard negative), and hence the effective margin becomes larger. In other words, when the negative image and anchor image are very similar in the attribute space, a reliable margin is used to learn the subtle difference and avoid the confusion. Hence, both SW and SM triplet loss explore the importance of the triplets, which helps in learning a tiered similarity space.

3 Experiments

We collected a new BrandFashion dataset with about 10K clothing images with distinctive logos from 15 brands. The images are categorized into 16 clothing categories and annotated with 32 semantic attributes. The goal is to demonstrate the tiered similarity space using the category, brand and attribute annotations. There are 50 query images in the dataset. We evaluated the performance for instance search using mean average precision (mAP) and the performance of tiered similarity search using normalized discounted cumulative gain (NDCG) i.e. \(NDCG@k = \frac{1}{Z}\sum _i^k \frac{2^{r(i)}-1}{\log _2(1+i)}\). The relevance score of \(i^{th}\) ranked image is calculated based on similarity match considering three levels of information, namely category, brand and attribute i.e. \(r(i) = r_i^{cat} + r_i^{brand} + r_i^{attr}\), where \(r_i^{cat} \in \left\{ 0,1 \right\} \), \(r_i^{brand}\in \left\{ 0,1 \right\} \). The attribute match \(r_i^{attr}\) is computed by taking the ratio of the number of matched attributes to the total number of query attributes [12]. Overall, the relevance score summarizes the tiered similarity search performance.

We used VGG16 [25] as the base CNN network, which is trained using the loss defined in Eq. (1), with SGD momentum of 0.5 & learning rate of 0.001. We set \(\lambda \) to 1. The value of margin \(m_0\) is experimentally set to 0.5. For the SM triplet loss, the value of \(m_0\) set to 0.8 such that the effective margin swings around the original value. We set the threshold (\(\phi \)) to 0.7 and observe that the performance is fairly stable on \(\phi \in [0.5,0.9]\).

Items from the same category and brand are sampled for the anchor-positive pairs, and items from different categories or brands constitute the negative samples. \(L_2\)-normalized feature from the last fully connected layer is used as the feature vector. We used PyTorch [20] for the implementation. Similar to [15, 19, 23], we crop out the clothing region prior to feature extraction. We used Faster-RCNN [21] to jointly detect the brand logo and clothing items in the images.

Table 1. Comparison of the proposed method with state-of-the-art-methods

Table 1 compares performance of different methods in terms of mAP and NDCG@20. In terms of mAP, the naive triplet loss achieves 33.8%, while the multitask network (triplet+attribute) achieves 56.4%. This shows that there is a clear benefit of using auxiliary information using multitask metric learning. The proposed method additionally guides the triplet loss using the predicted attributes. The proposed AGML-SW and AGML-SM achieve mAPs of 63.79% and 63.71%. This demonstrates the advantage of attribute guided triplet loss. The proposed method clearly outperforms the deep feature encoding based methods [13, 17, 26], and state-of-the art metric learning methods [15, 19, 23].

Similar trend in results can be observed for NDCG in Table 1. The proposed attribute-supervised SW and SM triplet network achieve NDCG@20 of 83.66% and 85.12% respectively. Our method clearly outperforms other state-of-the-art methods which demonstrate the advantage of learning a tiered similarity space. We further take advantage of logo detection to re-rank the retrieval results. The proposed method achieves mAP \(\approx \) 71% and NDCG@20 \(\approx \) 96% with re-ranking based on detected brand logo information. Figure 3 shows example search results obtained using WTBI [15], FashionNet [19] and the proposed method which further demonstrates the advantage of the proposed method.

Fig. 3.
figure 3

Sample search results with query images and top-5 retrieved images. Exact same instance matches are highlighted with green borders. Best viewed in color. (Color figure online)

4 Conclusions

We presented a new deep attribute-guided triplet network which explores the importance of training samples and learns a tiered similarity space. The method uses multitask CNN which shares the mutual information among the tasks for better tuning the loss. Using the predicted attributes, the proposed method first mines informative triplets, and then uses them to train the triplet loss in a soft-manner, which helps in capturing the tiered similarity desirable for fashion search. We believe that the tiered similarity search will be appreciated by fashion companies, online retailers as well as customers.