Abstract
Nowadays, in a communicating society, fashion is an integral part of a human life, and it is more comfortable and confident when people dress well. Outfit compatibility is not only a combination of different items but also regarding various aspects, such as style, user preferences, and specific occasions. Most of the existing works lead to address the outfit compatibility concerning only style or user preferences, and have no regard for occasions. In this paper, we propose an efficient method for both outfit compatibility and the fill-in-the-blank tasks according to specific occasions. To this end, we utilized an auxiliary classification branch to learn the significantly important features regarding specific occasions. Besides, a sequence to sequence approach is also applied to learn the relationship of different items along with a visual semantic space, which is able to learn the connection between visual features and their semantic presentation. To demonstrate the effectiveness of the proposed method, we conduct experiments on our newly collected Shoplook-Occasion dataset. The experimental results indicate that our proposed method improved the AUC metric from 0.02 to 0.15% and from 0.5 to 4% on accuracy, compared with other approaches for outfit compatibility problem conditioning on specific occasions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Recently, computer vision is broadly applied in various problems of fashion industry, for example, fashion attribute prediction Liu and Lu [23]; Zhang et al. [47], item matching Song et al. [34], fashion design Sbai et al. [32]; Yu et al. [46], virtual try on Han et al. [10]; Ge et al. [8]; Choi et al. [4]; Li et al. [18], trend forecasting Al-Halah et al. [1], fashion captioning Bao et al. [2], fashion style models Simo-Serra and Ishikawa [33]; Takagi et al. [37]; Veit et al. [42]; Hsiao and Grauman [13], clothing category classification Wang et al. [44]; Liu et al. [26]. Among them, the most popular fashion application is item recommendation Liu et al. [24]; Hu et al. [15]; Panagiotakis et al. [30]; Hou et al. [12], where the objective is to suggest items to customers based on customers’ and/or society’s preferences. Another study focused on automatic capsule wardrobe generation Hsiao and Grauman [14], which is utilized to generate an outfit based on the existing garments in their wardrobe.
Most of the previous works focused on addressing item-based recommendations Liu et al. [24]; Lin et al. [21]; Hou et al. [12]; Lu et al. [29], or outfit-based recommendations conditioned on the criteria of compatibility or versatility Han et al. [9]; Vasileva et al. [41]; Lai et al. [17]; Tan et al. [38]; Veit et al. [42]; Tangseng and Okatani [39]; Sufang et al. [35]. These works have only solved the problem of whether the outfit of items matched together is suitable or not, regardless about other things like weather, occasion,...
On the other way, another interesting problem of fashion item recommendations is how to select an outfit relevant or appropriate for a certain occasion. Not everyone is capable of selecting an appropriate outfit for a particular occasion because they need to have basic knowledge about how well items in an outfit are compatibly combined together. For example, a wedding party needs an outfit that presents formal rather than occasions such as casual or weekends. A red dress with high heels is closely related to a wedding party, rather than a red dress with sneakers. Thus, the relationship between items in an outfit should be marked for the attention of fashion item recommendations. Although there are several prior works, as mentioned in Liu et al. [24], which considered item recommendation problems based on specific occasions. However, they only focused on the relation of each pair of items (top - bottom) rather than the combination of many items in an outfit together, which an outfit usually has.
To deal with the existing shortcomings of the two approaches mentioned above, we propose a method for outfit compatibility based on specific occasions, on which each outfit includes a combination of many items rather than only top and bottom items. In addition, the major challenge of the fashion-item recommendation problem is the similarity in visual characteristics among occasions, which hardly induces recognition and prediction correctly. Casual and weekends usually share similarities in several items in their outfits, or there are ambiguities among occasions, such as parties, cocktail parties, and dating. To solve this problem, we propose an auxiliary classifier branch, including global average pooling, to capture the global visual features of an entire outfit and softmax layer to classify specific occasions.
Another challenge of the outfit compatibility and the fill-in-the-blank problems is the variance of the item number belonging to an input outfit, with a bag of item images varying in size. Therefore, it is necessary to have a system that is able to deal with input outfits having different numbers of items, and can also be trained as an end-to-end model. The combination of bidirectional long short-term memory (Bi-LSTM) and visual semantic embedding (VSE) is an efficient method to address this problem.
In addition, in order to train the model, it is necessary to have a dataset that provides sufficient information about the occasion consisting of item product images and metadata corresponding to each item in an outfit. However, up to present, as far as we know, there is a lack of that kind of occasion fashion dataset, and it is also an important reason making the outfit compatibility and the fill-in-the- blank problems become more challenging. Thus, to address this problem, the first step was to build a fashion dataset related to occasions. The main contributions of this study are threefold.
First, we collected a fashion dataset according to specific occasions, namely the Shoplook-Occasion dataset. This collected dataset provides essential information related to each outfit, which helps to address the outfit compatibility problem and the fill-in-the-blank problem conditioned on specific occasions.
Second, we adopt an efficient framework for outfit compatibility, inspired by Han et al. [9]. Our framework utilizes a VSE space to capture the semantics between the visual and text embedding. In addition, the system is combined with Bi-LSTM to learn the complicated relationships between fashion items in both directions from top to bottom and vice versa. Moreover, we propose an auxiliary classification branch to address the occasion-based outfit compatibility and fill-in-the-blank problems. To extract visual features, we adopted Residual architecture He et al. [11], which is proven to provide more essential information for each item, than the Inception architecture Szegedy et al. [36]. We utilized the pre-trained residual architecture model as image embedding to improve model performance.
Finally, the experiments were conducted on the collected dataset, in which the results indicated that the proposed model with the auxiliary classification branch incorporated the combination of VSE and Bi-LSTM, outperformed the other methods in terms of outfit compatibility and specific occasion-based.
The rest of this work is organized as follows: Sect. 2 presents some previous related works. The details of the proposed framework are described in Sect. 3. In the next part (Sect. 4), we conduct experiments on the newly collected Shoplook-Occasion dataset, and a comparison between our proposed method and other approaches is presented. Finally, conclusions and future work are specifically discussed in Sect. 5.
2 Related works
The outfit compatibility prediction and fill-in-the-blank problems have already been addressed by different approaches such as sequence-based approaches Han et al. [9]; Lin et al. [21]; Tangseng and Okatani [39], metric learning-based approaches Vasileva et al. [41]; Lai et al. [17]; Tan et al. [38]; Veit et al. [42]; Hou et al. [12], graph-based approaches Cucurull et al. [5]; Cui et al. [6], and the other approaches Liu et al. [24]; Tangseng et al. [40]; Hsiao and Grauman [13]; Li et al. [20]; Song et al. [34]; Li and Xu [19]; Lu et al. [28]; Lu et al. [29]. The following section describes each method in more detail.
The metric learning-based approach has focused on measuring the similarity between images in a single similarity context. Specifically, these images are projected into a general embedding space, where they can provide a measurement of the similarity between objects following the respective distance and the loss function. Some common loss functions can be counted as hinge loss Weinberger et al. [45] and triplet loss Vasileva et al. [41]; Lai et al. [17]; Tan et al. [38]. Vasileva et al. [41] proposed a learning method for image embedding that respects item type, which is known as the Conditional Similarity Networks (CSN). Veit et al. [43] model to learn type-aware embeddings for an outfit compatibility model. In their study, a total of 66 conditional subspaces were learned for each pair category (for example shoes-tops, bottom-hat, tops-bottoms, bottoms-shoes, etc,.). However, one existing limitation of this approach is that it does not to consider different types of visual features. Therefore, several other studies have attempted to overcome this barrier by learning disentangled representations to capture different notions of similarity following supervision by predefined similarity conditions. Lin et al. [22] introduced a new framework, which includes a category-based attention mechanism that only depends on the item categories and a new outfit ranking loss function. Tan et al. [38] proposed a solution to improve the performance of the model Vasileva et al. [41] by learning the shared subspaces along with the importance of each subspace. Lai et al. [17] extended the outfit compatibility prediction model Vasileva et al. [41] by proposing a theme-attention model over category-specific embedding space. Hou et al. [12] introduced a solution which leverage on the semantics of visual attributes to train convolutional networks (CNN) that learn attribute-specific subspaces for each attribute to obtain disentangled representations.
Graph-based approach is also utilized for outfit compatibility prediction as in Cucurull et al. [5]; Cui et al. [6]. The idea of this approach is based on considering a context when the items are known to be compatible with other items. A graph neural network (GNN) learns to generate item embeddings conditioned on their context. Specifically, the outfit compatibility is considered as an edge prediction problem, which consists of the encoder and decoder phase. The encoder phase computes new embedding for each item depending on their connections, and the decoder phase predicts the compatibility score of two items. However, one notable shortcoming of this approach is the high computational cost from leveraging test-time graph connections when new items can be introduced into the catalog (i.e., testing set).
The sequence-based approach is based on inspiration from the sequence-based model, which has been widely applied in natural language processing problems and sequential recommendation Lonjarret et al. [27]. Han et al. [9] considered every item in an input outfit as a word in a sentence, and Bi-LSTM was utilized to learn the relationship between items in an outfit following both directions. Meanwhile, Lin et al. [21] used a gated recurrent unit (GRU) along with the attention mechanism, including mutual attention and cross-modality attention for predicting compatibility and generating comments.
Other approaches several other studies have been applied different methods to address compatibility problems Liu et al. [24]; Tangseng et al. [40]; Hsiao and Grauman [13]; Li et al. [20]; Song et al. [34]; Li and Xu [19]. Liu et al. [24] introduced a novel method based on mined occasion-attribute matching rules and mined attribute-attribute matching rules. Hsiao et al. [13] considered the compatibility task as a subset selection problem. Specifically, they built submodular object functions to capture key ingredients of visual compatibility, versatility, and user preference. Then, an unsupervised approach was utilized to learn visual compatibility from the real world. Tangseng et al. [40] proposed a deep learning-based method to compute the grading score of outfit compatibility conditioned on fixed input items. Similarly, Song et al. [34]; Li et al. [20] also applied cross-modality as the inputs for deep learning to address this problem. Lu et al. [28] proposed a framework that learn binary code for efficient personalized fashion outfits recommendation. Lu et al. [29] introduced a new solution for personalized outfit recommendation using a stacked self-attention mechanism to model the high-order interactions among the items.
In this study, we focused on investigating two main approaches that have recently attracted an increasing number of studies: the metric learning-based approach and the sequence-based approach. We consider these two approaches as the baseline methods for comparison with our proposed method, which is presented in more detail in the next section.
3 Proposed method
Recently, to deal with the outfit compatibility prediction and the fill-in-the-blank problem, most of the approaches Tangseng et al. [40]; Li et al. [20]; Han et al. [9]; Vasileva et al. [41] have mainly focused on esthetics rather than specific occasions. In this paper, we introduce an efficient framework to address both the outfit comparability prediction and the-fill-in-the-blank problem with respect to specific occasions. The proposed framework comprises three major parts. In the first part, Bi-LSTM is utilized to capture the compatibility relationships among fashion items in the entire outfit. Next, a visual semantic embedding space is built to learn the similarity of each fashion item and text description. The last part is an auxiliary classification branch, which is proposed to score the compatibility outfit on specific occasions. In particular, these input items of an outfit is passed into a residual network to extract the visual features, which each visual feature has 2048 dimensions. Simultaneously, the descriptions of these items are also converted to one-hot vectors. The outputs of these processing are utilized further into the visual semantic space (VSE), meanwhile each visual feature and the one-hot vector in it’s description are embedded into a visual semantic space with the same dimension 512. Besides, the visual features with 2048 dimensions are continuously passed through a bidirectional LSTM to learn the relationship between items of an outfit in both directions (top - bottom and versus). An auxiliary classifier branch, which includes one global average pooling layer, one fully connected layer, and one softmax function, is proposed to learn disentangled representations of each occasion. This auxiliary classifier branch is able to categorize the outfit compatibility with respect to specific occasions. The overview of our propose method is presented in Fig. 1. The more detail of these components is as follows.
3.1 Outfit compatibility learning
Inspired by Han et al. [9], we considered the entire outfit with a set of fashion items as a sentence with a sequence of words. Considering the input items as a sequence problem in natural language processing, Bi-LSTM is adopted as an important part in which it helps to capture complicated relationships among fashion items from top to bottom and opposite directions. Bi-LSTM is an extended version of Long Short-Term Memory (LSTM). Based on the modification in the structure of the cell state as an input gate, forget gate and output gate, LSTM, as well as Bi-LSTM, is able to prevent the major challenges in the sequence-dependence problem including the vanishing gradient and long dependence problem, which cannot not be solved by traditional RNN. In addition, the combination of the outputs obtained by the forward and backward processes of LSTM, plays a vital role in preserving the essential features in the entire outfit. More specifically, a fashion image sequence \(X = \{x_1, x_2,...,x_N\}\) is passed into the first module with Bi-LSTMs, where \(x_t\) is the feature presentation of each fashion item for the t-th fashion item in the fashion image sequence, extracted by a CNN model. Following each hidden state \(h_t\), known as the output of LSTM Han et al. [9], a softmax function is utilized. This function calculates the probability of the next fashion item \(p(x_{t+1} \Vert x_1, x_2,...,x_t)\) conditioned on the previously seen items, and the probability of the previous fashion item \(p(x_{t} \Vert x_N,...,x_{t+1})\) conditioned on the next items, where \(x_0\) and \(x_{N+1}\) are two zero vectors that are added in X, for Bi-LSTM to determine when to stop predicting the next item. The objective function of the forward direction in the Bi-LSTM is:
and the objective function of the backward direction in Bi-LSTM is:
Finally, the Bi-LSTM loss function is calculated by summing up \(\mathcal {L}_{f}\) and \(\mathcal {L}_{b}\) as follows
3.2 Visual-semantic embedding space
The multi-model embedding space of texts and images, also known as visual semantic embedding (VSE) space, plays an important role in learning the semantic relations between texts and images. VSE has the capability to deal with not only computer vision tasks, but also natural language processing tasks. To train a VSE space, a fashion item image from an outfit is projected into the VSE space by putting the image representation \(x^{2048 \times 1}\) through the image embedding matrix \(W_f^{512 \times 2048}\): \(f = W_fx\). In addition to the image embedding process, the text description of each fashion item image is also projected into the VSE space. However, the input text \(S=\{w_1, w_2,...,w_M\}\) is converted to one-hot vectors \(e = \{e_1, e_2,...,e_M\}\) before being embedded into the word embedding matrix \(W_T^{512 \times 5180}\). Specifically, the word \(w_i\) is presented by the i-th one-hot vector \(e_i\), then is transformed to \(v_i = W_Te_i\). Finally, a set of embedding vectors \(v = \{v_1, v_2,...,v_M\}\) is encoded by the bag of words to obtain \(\hat{v}\) and is computed by the average of v. The contrastive loss function \(\mathcal {L}_\mathrm{{vse}}\) is utilized to learn the similarity relation between the image embedding vector and its text embedding vector, \(\mathcal {L}_\mathrm{{vse}}\), which is defined as follows
where \(\delta (i,j)\) is the cosine similarity distance, represents the distance between the image i and its description j, \(v_k\) is the non-matching description of image f and \(\hat{v}\) describes the non-matching of description of \(f_k\). The \(\mathcal {L}_\mathrm{{vse}}\) function is minimized when the cosine similarity distance between f and its description embedding \(\hat{v}\) is smaller than the distance from the unmatched description \(v_k\) by margin \(\xi\), as in Eq. 4.
3.3 Auxiliary classification branch
Most existing models adopt only the case of outfit compatibility following esthetics. In this paper, we propose an auxiliary classification branch that can address the outfit compatibility on specific occasions. The auxiliary classification structure consisted of three parts. First, a global average pooling layer is utilized to capture the abstract visual features from stacked visual features \(X = \{x_1, x_2...x_N\}\), which are extracted by residual network, to form a global feature vector. Next, the output of this layer is passed through a fully connected layer to obtain the logit values. The last function is a softmax function for computing the probability of each occasion. To train the auxiliary classification branch, we minimized the cross-entropy loss function as follows:
where \(y^k\) is the ground truth label, \(\hat{y}^k\) denotes the predicted label, and C represents the set of the eight occasions including casual, cocktail party, dating, party, school, wedding guest, weekend and work from the Shoplook-Occasion dataset. The details of this dataset are presented in the next section.
In the last stage of this framework, we incorporate the Bi-LSTM and visual semantic embedding model (VSE) with the aim of improving the effectiveness of our proposed model by providing both fashion compatibility information (from Bi-LSTM) and visual semantic information (from VSE). In addition, the auxiliary classification branch is added to address the outfit compatibility problem and fill-in-the-blank one with respect to specific occasions. The final objective function is as follows:
where \(\lambda , \beta\) and \(\gamma\) are the hyperparameters of the proposed model.
4 Experiments
4.1 Shoplook-Occasion dataset
Normally, an outfit consists of about eight popular parts, such as top, bottom, all-body, outerwear, shoes, jewellery, bag, and accessories. Due to that, we collected the corresponding garments for each outfit as presented in Table 1. In this study, we collected outfits from Shoplook.io with a specific occasion to create a new fashion dataset, called Shoplook-Occasion dataset. The shoplook.io is a social commerce website that provides important information such as the item category, titles, and many different occasion labels. However, outfits that are crawled from this online website bring much noisy information, not only visual but also textual data. Thus, it is necessary to clean this crawled dataset.
The cleaning acquisition data process includes a two-stage procedure as following. Firstly, a machine learning model is built to automatically classify the semantic category of each item based on existing meta-data. The purpose is to remove images that do not have the corresponding item categories. At the end of this step, the dataset is collected with many outfits, and each outfit has a list of items with corresponding semantic item categories and its description. Table 1 shows the number of each semantic category in Shoplook-Occasion dataset. However, an outfit has many occasion labels, it is necessary to select the best appropriate occasion for each outfit. In the second stage, or the validation step, a number of experienced annotators are asked to label the occasion for each outfit to improve the quality of the dataset obtained from the previous stage. For choosing the number of annotators, we refer to some previous works Liu et al. [25]; Bourdev et al. [3]; Duran et al. [7]. In Liu et al. [25], authors collected the WOW (What to Wear) dataset for occasion-oriented clothing recommendation tasks with five annotators. As the meanwhile, Bourdev et al. [3] tried to build a dataset to describe people with some attributes, such as the gender, hair style and types of clothes of people under large variation in viewpoint, pose, articulation and occlusion typical of personal photo album images. Similar to Liu et al. [25], Bourdev et. al. also asked five annotators for labelling all attributes. Recently, Duran et al. [7] aimed to capture both the semantic and syntactic structure of dialogue, and the labeling task is undertaken by 15 novice annotators. Based on these considerations, in our study, to construct the Shoplook-Occasion dataset, we asked 20 experienced annotators for choosing the right label for each outfit. All these annotators have prior knowledge about fashion, style and outfit. Specifically, we utilized tag information from users on the Shoplook.io website to obtain occasions. With each outfit, we asked them to rank these occasions from 1 to 3. Finally, we decided to choose only one occasion label for each outfit based on the highest occasion label selected from all the annotators. This is the second step after using a model automatically classify the semantic category of each item based on existing meta-data. The purpose of this step is to continuous improving the quality of the results obtained from the previous stage by exploiting the knowledge and expertise of our experienced annotators. In our point-view, the number of 20 experienced annotators is sufficient for validating the quality of our dataset when combining with the auto-classification from the previous step.
At the end of the process, we obtained a total of 8 occasions, 4,752 outfits and 16,968 unique items. This dataset includes many oufits, which each outfit contains the only one occasion label and a list of corresponding items. Besides, each item is labelled with a corresponding semantic category, image and it’s description. It presents that this dataset is suitable for outfit compatibility problem according to the specific occasion. Figure 3 shows some outfit samples of specific occasion labels in Shoplook-Occasion dataset.
Next, we split Shoplook-Occasion dataset into three disjoint subsets (training, validation and testing sets) based on the modification of Tang’s method Tangseng et al. [40]. The training and validation sets were utilized to learn and optimize the hyperparameters in the trained model, whereas the test set was used to evaluate the learned model.
A tripartite graph is built as in Algorithm 1, where each item is considered as a vertex, and the edges represent the relationship between items in each outfit. It is a disjoint set sampling algorithm, and each item in one set is not able to be shared with other sets, which helps to remove the duplicated items in the outfit. Finally, we obtained 2,967 outfits for the training set, 889 outfits for the validation set, and 896 outfits for the testing set. More details about the splitting dataset are shown in Table 2 and the distribution of each occasion label in each set is illustrated in Fig. 2.
4.2 Implementation details
Inspired by previous works Tangseng et al. [40]; Vasileva et al. [41]; Han et al [9]; Tan et al. [38], in this study, we address the fashion outfits compatibility prediction task and fill-in-the-blank task (FITB) with respect to specific occasions. For the fashion outfit compatibility task, a candidate item is combined with other items that are scored for compatibility on specific occasions. For the fill-in-the-blank task, a sequence of fashion items is provided with one missing item at a random position. We need to choose an appropriate item from a list of given multiple choices, which is compatible with other existing items according to the specific occasion. Up to present, as far as we know, there is no existing dataset for these tasks on occasion compatibility scenarios. Thus, we created a fill-in-the-blank dataset conditioned on occasions by using all the outfits in the test set of Shoplook-Occasion. Similar to the method in Han et al. [9], to create an outfit compatibility dataset, with each outfit, we randomly chose one item and replaced it with a blank. Then, we selected three items from other outfits, which have the same semantic category as the replaced item, to create a set of answers, including three selected items and one ground truth item. In addition, we created 844 incompatible outfits by selecting fashion items from the test set for the fashion outfit compatibility task.
In this study, the experimental models were implemented on Python 3.7 with Pytorch framework. The operating system was Ubuntu 16.04 LTS with Intel Core i7-4790K (4.0 GHz x 8 cores), 32 GB of RAM and GeForce GTX 1080 Ti.
To compare with other state-of-the-art methods Han et al. [9]; Tan et al. [38]; Vasileva et al. [41], we conducted other experiments as in Table 3 and 4. The results of the fashion compatibility prediction task and the fill-in-the-blank task are shown in Table 3, while Table 4 presents the results of the same task but with respect to specific occasions. In the case of the fashion compatibility prediction problem, a new outfit is created by a user and is expected to determine if they are compatible with the specific occasion. Performance was evaluated using the AUC of the ROC curve in multiple classes. In addition, the accuracy metric is also used to evaluate the fill-in-the-blank task. The accuracy metric was computed for each option-selection process. We choose the best answer by substituting candidate items in the blank item, and then compute the score for each occasion. For optimization task in these experimental models, we utilized Stochastic Gradient Descent optimizer with epoch number of 500. The other detailed parameters in each experiment are described as following.
CSN, T1:1 + VSE (HGLMM) + Sim + Metric + Auxiliary Branch. Similar to Vasileva et al. [41], we utilized a \(18-layer\) Resnet for image embedding, and text embedding was extracted using the word embedding model. We used the pre-trained HGLMM Fisher vector encoding Klein et al. [16], and reduced the dimension using Principle Component Analysis (PCA) to obtain 600 dimensions. VSE is used to learn the compatibility between different categories of fashion items. In addition, a similarity branch technique is also learned to measure similarity between fashion items of the same category.
In addition, a pairwise type-dependent transformation (following as Vasileva et al. [41]) is used to measure the compatibility between two fashion items by projecting them into a type-specific space. A triplet loss function is also adopted by taking an element-wise product of the embedding vectors from type specific spaces and passing it into a fully connected layer, as in Vasileva et al. [41]. Moreover, we added an auxiliary classification branch that is suitable for outfit compatibility in terms of specific occasions as well as outfit compatibility.
CSN, T1:1 + VSE (Glove) + Sim + Metric + Auxiliary Classification Branch. In this experiment, we conducted similar to the previous experiment. However, for feature extraction, we utilized the Glove method Pennington et al. [31], which was trained from Wikipedia and Gigaword to apply on the textual data instead of the HGLMM method as in the previous experiment. The reason is that this approach considers both semantics and context as compared to the Fisher vector-based method as the gradients of the log-likelihood of descriptors. Specifically, the textual data were extracted to 300 dimension feature. Hyperparameters are utilized with a learning rate of \(5e^-5\), batch size of 256, and a margin m of 0.2. With VSE, \(l_1, l_2\) loss, parameters are set with \(\lambda _1 = \lambda _2 = 5e^-5\) and \(\lambda _3 = 5e^-3\).
Bi-LSTM + VSE + Auxiliary Classification Branch We performed similar experiment as Han et al. [9] based on the sequence-based model. More particularly, Inception-v3 model is utilized as image embedding to represent the input image as a 2048-dimension vector. Then, image feature vector is placed into one fully connected layer to reduce feature dimensions to 512. The obtained image feature of this process is continuously fed into the Bi-LSTM, with 512 hidden units and a drop out rate of 0.7. With VSE, the dimension was set to 512, the image embedding matrix \(W_I \in R^{2048 \times 512}\), and the word embedding matrix \(W_T \in R^{5180 \times 512}\), where 5180 represents the size of the vocabulary in the Shoplook-Occasion dataset. The learning rate and margin m are initialized with 0.2 and then changed by a factor of 2 for every two epochs with a batch size of 10. Hyperparameters of VSE (\(\lambda\)) and Bi-LSTM (\(\gamma\)) are set to \(\lambda = 1\) and \(\gamma = 1\) respectively.
However, visual features extracted by Inception-v3 model usually lead to a lack of information between essential features of the current layer and the previous layer. Therefore, we propose to use the \({\rm Resnet}-50\) model, which is capable of preserving essential information from the previous layer by skipping connection inside the structure of the Resnet-50 network. Moreover, we added an auxiliary classification branch to deal with fashion problems on specific occasions. Hyperparameter \(\beta = 1\) was utilized to adjust the auxiliary classification branch.
Experimental results in Table 3 show that the proposed method outperforms than other methods in both the FITB task and outfit compatibility prediction task. As observed in the first and second rows of Table 3, we address this problem using the pair-wise type-dependent approach Vasileva et al. [41] in both tasks (fashion compatibility prediction and FITB). In the first row, we adopted the model of Vasileva et al. [41] and modified their model by using Glove to extract text information instead of HGLMM. Experimental results indicate that by replacing HGLMM with the Glove method, the model is more efficient when considering FITB task by approximately 0.9%. However, the method based on Glove decreases approximately 0.04% when compared with HGLMM, regarding the fashion outfit compatibility prediction. In addition, we also adopted a sequence-based approach Han et al. [9] and proposed a modification for image embedding by using the Resnet model rather than Inception model on their method. Results are presented in next two rows of Table 3. It shows that this modification improves the performance of the model with 0.04% for AUC, and 2.8% for accuracy on both tasks, respectively. Moreover, these results also indicate that the sequence-based approach seems to be better than the metric learning-based approach on occasion dataset as Shoplook-Occasion dataset. Moreover, to demonstrate the effectiveness of auxiliary classifier branch for this case, we conducted experiments by adding an auxiliary classifier branch to both the sequence-based approach and the metric learning-based approach. Consequently, it has been proved the effectiveness of auxiliary classifier when improving not only on fashion outfit compatibility prediction but also fill-in-the-blank task.
In cases of occasion-based fashion problems, Table 4 shows the overall comparison results obtained using various methods. Glove-based method obtains an AUC score of 0.69, 0.62 and 0.63 concerning the metric dimension of visual feature with \(512-D\), \(256-D\) and \(128-D\), respectively. It has also been presented that Glove-based method is better than HGLMM-based method by approximately \(1-8\)% on AUC score for the fashion outfit compatibility task conditioned on occasions. Meanwhile, sequence-based approach also outperforms a pairwise type-dependent approach with 0.75 and 0.77 on AUC, respectively, which are shown in the last two rows. Experimental results in Table 4 demonstrate the effectiveness of our proposed model, which is able to preserve the lower visual features by skipping the connection inside the Reset architecture. Therefore, it helps to improve AUC score by approximately 2% compared with Inception network. In addition, Fig. 4 visually illustrates the possibility of our proposed model on each occasion, which is defined in Shoplook-Occasion dataset. Overall, our proposed method outperformed other methods on most of occasions. However, a comparison between Fig. 4c and 4d shows that wedding guest label is only recognized as 0.7 on our proposed method, which is lower than 0.05, as shown in Fig. 4c. In terms of the fill-in-the-blank task for occasion-based methods, the best result is obtained by the Glove-based method with the feature dimension of 128-D. Meanwhile, our proposed method and the HGLMM-based method achieved the same performance of 15.2.
In addition to the AUC and accuracy, we evaluated all methods with respect to the computational cost. We computed the running time of each method on the training and testing sets of the Shoplook-Occasion dataset, and the results are presented in Table 5. Specifically, our proposed method consumes 14273.8 (s) for training and 57.3 (s) for testing, although it is longer than the Glove-based method and the HGLMM-based method, but it is better than the method based on Han et al. [9].
Figures 5 and 6 visually illustrate the qualitative results of our method for both tasks. First, our method can predict a set of fashion items that are suitable for a specific occasion, as shown in Fig. 5. The predicted score is 6.18 for the choice of casual label, as in the first row of Fig. 5a. The green box illustrates the ground truth answer, and the red box shows the answer predicted by our method. Figure 5a shows the successful cases and Fig. 5b illustrates the failed cases, which are predicted by our method. Figure 6 shows the results of our method for the fashion outfit compatibility task on occasion. In Fig. 6a, we indicate the failed cases: for example, the casual label is wrongly predicted to cocktail party label. Figure 6b presents the cases that were correctly predicted using our method. Overall, the predicted scores showed that the fashion outfit compatibility prediction and the fill-in-the-blank tasks are difficult to predict with a certain occasions. Some occasions have presented the similar characteristics of visual features in an outfit, such as, casual and weekend, or cocktail party and party. Therefore, the prediction score values seemed to be less differential among these occasions.
5 Conclusion
In this paper, we proposed a framework to address the outfit compatibility problem and the fill-in-the-blank problem with respect to specific occasions. Our proposed framework adopts visual semantic embedding to capture the related semantic meanings between visual and meta-data based on the semantic metric. Bi-LSTMs are also utilized to learn the complicated relationships between fashion items in both the forward and backward directions of an outfit. We also used an auxiliary classification branch to recognize outfit compatibility in terms of specific occasions. Moreover, we collected a fashion dataset from a social website related to occasion. This dataset provides important information about occasion labels, namely the Shoplook-Occasion dataset. Experiments were conducted on the collected dataset to compare the performance of our proposed model with that of other state-of-the-art methods. The experimental results proved that our method achieved positive results for both the outfit compatibility problem and the fill-in-the-blank problem regarding occasions. Although our work has some encourage results, it also has some limitations. First of all, it is time-consuming to learn the parameters of each module. Secondly, due to the limitation of human resource, the Shoplook-Occasion dataset is restricted on the number of outfits. Finally, some occasions have presented similar characteristics of visual features in an outfit, such as, casual and weekend, or cocktail party and party.
In the future, we are going to extend Shoplook-Occasion dataset by adopting semi-supervised learning to enrich collected data. Moreover, the attention mechanism would be integrated into our method in order to capture the relationship between fashion items. Last but not least, we also consider to tackle the outfit complementary item retrieval tasks regarding some other complicate and difficult occasions.
Data availability
The original dataset is collected from http://shoplook.io. The new Shoplook-Occasion dataset with respect to occasions that support the findings of this study is not openly available due to the large size, and is available from the corresponding author upon reasonable request for sharing.
References
Al-Halah Z, Stiefelhagen R, Grauman K (2017) Fashion forward: Forecasting visual style in fashion. CoRR
Bao NT, Prakash O, Vo AH (2020) Attention Mechanism for Fashion Image Captioning. In: Advances in Intelligent Systems and Computing
Bourdev L, Maji S, Malik J (2011) Describing people: a poselet-based approach to attribute classification. pp 1543–1550, https://doi.org/10.1109/ICCV.2011.6126413
Choi S, Park S, Lee M, et al (2021) Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 14,131–14, 140
Cucurull G, Taslakian P, Vazquez D (2019) Context-aware visual compatibility prediction. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Cui Z, Li Z, Wu S, et al (2019) Dressing as a whole: Outfit compatibility learning based on node-wise graph neural networks. In: The world wide web conference. ACM, New York, NY, USA, WWW ’19, pp 307–317, https://doi.org/10.1145/3308558.3313444
Duran N, Battle S, Smith J (2022) Inter-annotator agreement using the conversation analysis modelling schema, for dialogue. Commun Methods Meas 16(3):182–214. https://doi.org/10.1080/19312458.2021.2020229
Ge Y, Song Y, Zhang R, et al (2021) Parser-free virtual try-on via distilling appearance flows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 8485–8493
Han X, Wu Z, Jiang Y, et al (2017) Learning fashion compatibility with bidirectional lstms. MM’17
Han X, Wu Z, Wu Z, et al (2018) Viton: an image-based virtual try-on network. In: CVPR
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 770–778
Hou Y, Vig E, Donoser M, et al (2021) Learning attribute-driven disentangled representations for interactive fashion retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 12,147–12,157
Hsiao W, Grauman K (2017) Learning the latent ’look’: unsupervised discovery of a style-coherent embedding from fashion images. In: IEEE International conference on computer vision (ICCV)
Hsiao W, Grauman K (2018) Creating capsule wardrobes from fashion images. In: 2018 IEEE Conference on computer vision and pattern recognition. IEEE Computer Society, pp 7161–7170
Hu Y, Yi X, Davis LS (2015) Collaborative fashion recommendation: a functional tensor factorization approach. In: ACM Multimedia. ACM, pp 129–138
Klein B, Lev G, Sadeh G, et al (2014) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. CoRR arXiv:abs/1411.7399
Lai JH, Wu B, Wang X, et al (2019) Theme-matters: fashion compatibility learning via theme attention
Li K, Chong MJ, Zhang J, et al (2021) Toward accurate and realistic outfits visualization with attention to details. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 15,546–15,555
Li W, Xu B (2020) Aspect-based fashion recommendation with attention mechanism. IEEE Access 8:141,814-141,823
Li Y, Cao L, Zhu J, et al (2017) Mining fashion outfit composition using an end-to-end deep learning approach on set data. In: IEEE Transactions on multimedia, pp 1946–1955
Lin Y, Ren P, Chen Z, et al (2019) Explainable outfit recommendation with joint outfit matching and comment generation. IEEE Transactions on knowledge and data engineering
Lin Y, Tran SD, Davis LS (2020) Fashion outfit complementary item retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3311–3319
Liu J, Lu H (2018) Deep fashion analysis with feature map upsampling and landmark-driven attention. In: European conference on computer vision, Springer, pp 30–36
Liu S, Feng J, Song Z, et al (2012a) Hi, magic closet, tell me what to wear! In: Proceedings of the ACM international conference on multimedia pp 619-628
Liu S, Nguyen T, Feng J, et al (2012b) Hi, magic closet, tell me what to wear! pp 1333–1334, https://doi.org/10.1145/2393347.2396470
Liu Z, Luo P, Qiu S, et al (2016) Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. pp 1096–1104, https://doi.org/10.1109/CVPR.2016.124
Lonjarret C, Auburtin R, Robardet C et al (2021) Aspect-based fashion recommendation with attention mechanism. Data Min Knowl Disc 35(3):1087–1133
Lu Z, Hu Y, Jiang Y, et al (2019) Learning binary code for personalized fashion recommendation. In: CVPR
Lu Z, Hu Y, Chen Y, et al (2021) Personalized outfit recommendation with learnable anchors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12,722–12,731
Panagiotakis C, Papadakis H, Papagrigoriou A et al (2021) Improving recommender systems via a dual training error based correction approach. Expert Syst Appl 183:115386
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: In EMNLP
Sbai O, Elhoseiny M, Bordes A, et al (2018) Design: design inspiration from generative networks. In: Proceedings of the European conference on computer vision (ECCV)
Simo-Serra E, Ishikawa H (2016) Fashion Style in 128 Floats: joint ranking and classification using weak data for feature extraction. In: Proceedings of the conference on computer vision and pattern recognition (CVPR)
Song X, Feng F, Liu J, et al (2017) Neurostylist: neuro compatibility modeling for clothing matching. In: Proceedings of the 2017 ACM on multimedia conference, pp 753–761
Sufang LX, Zhu Y et al (2021) Outfit compatibility prediction with multi-layered feature fusion network. Pattern Recogn Lett 147:150–156
Szegedy C, Vanhoucke V, Ioffe S, et al (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2818–2826
Takagi M, Simo-Serra E, Iizuka S, et al (2017) What makes a style: experimental analysis of fashion prediction. In: The IEEE international conference on computer vision (ICCV) Workshops
Tan R, Vasileva MI, Saenko K, et al (2019) Learning similarity conditions without explicit supervision. In: ICCV
Tangseng P, Okatani T (2020) Toward explainable fashion recommendation. In: Proceedings of the IEEE winter conference on applications of computer vision (WACV), pp 3311–3319
Tangseng P, Yamaguchi K, Okatani T (2017) Recommending outfits from personal closet. In ICCV
Vasileva MI, Plummer BA, Dusad K, et al (2018) Learning type-aware embeddings for fashion compatibility. CoRR arXiv:abs/1803.09196
Veit A, Kovacs B, Bell S, et al (2015) Learning visual clothing style with heterogeneous dyadic co-occurrences. In: International conference on computer vision (ICCV), Santiago, Chile
Veit A, Belongie S, Karaletsos T (2017) Conditional similarity networks. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp 1781–1789
Wang W, Xu Y, Shen J, et al (2018) Attentive fashion grammar network for fashion landmark detection and clothing category classification. In: IEEE Conference on computer vision and pattern recognition, CVPR 2018. IEEE Computer Society, pp 4271–4280
Weinberger KQ, Blitzer J, Saul LK (2006) Distance metric learning for large margin nearest neighbor classification. In: Weiss Y, Schölkopf B, Platt JC (eds) advances in neural information processing systems 18. MIT Press, pp 1473–1480
Yu C, Hu Y, Chen Y, et al (2019) Personalized fashion design. In: The IEEE International conference on computer vision (ICCV)
Zhang S, Song Z, Cao X et al.(2020) Task-aware attention model for clothing attribute prediction. IEEE Trans Circuits Syst Video Technol 30(4):1051-64
Acknowledgements
Authors would like to thank Hao C.S Duong, Tuan D. Tran, Mai X. Phan for their generous helps in manually cleaning and clearing Shoplook-Occasion dataset, which plays a vital role in the success of this research.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions. No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Vo, A.H., Le, T.B.T., Pham, H.V. et al. An efficient framework for outfit compatibility prediction towards occasion. Neural Comput & Applic 35, 14213–14226 (2023). https://doi.org/10.1007/s00521-023-08431-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08431-1