1 Introduction

Recently, computer vision is broadly applied in various problems of fashion industry, for example, fashion attribute prediction Liu and Lu [23]; Zhang et al. [47], item matching Song et al. [34], fashion design Sbai et al. [32]; Yu et al. [46], virtual try on Han et al. [10]; Ge et al. [8]; Choi et al. [4]; Li et al. [18], trend forecasting Al-Halah et al. [1], fashion captioning Bao et al. [2], fashion style models Simo-Serra and Ishikawa [33]; Takagi et al. [37]; Veit et al. [42]; Hsiao and Grauman [13], clothing category classification Wang et al. [44]; Liu et al. [26]. Among them, the most popular fashion application is item recommendation Liu et al. [24]; Hu et al. [15]; Panagiotakis et al. [30]; Hou et al. [12], where the objective is to suggest items to customers based on customers’ and/or society’s preferences. Another study focused on automatic capsule wardrobe generation Hsiao and Grauman [14], which is utilized to generate an outfit based on the existing garments in their wardrobe.

Most of the previous works focused on addressing item-based recommendations Liu et al. [24]; Lin et al. [21]; Hou et al. [12]; Lu et al. [29], or outfit-based recommendations conditioned on the criteria of compatibility or versatility Han et al. [9]; Vasileva et al. [41]; Lai et al. [17]; Tan et al. [38]; Veit et al. [42]; Tangseng and Okatani [39]; Sufang et al. [35]. These works have only solved the problem of whether the outfit of items matched together is suitable or not, regardless about other things like weather, occasion,...

On the other way, another interesting problem of fashion item recommendations is how to select an outfit relevant or appropriate for a certain occasion. Not everyone is capable of selecting an appropriate outfit for a particular occasion because they need to have basic knowledge about how well items in an outfit are compatibly combined together. For example, a wedding party needs an outfit that presents formal rather than occasions such as casual or weekends. A red dress with high heels is closely related to a wedding party, rather than a red dress with sneakers. Thus, the relationship between items in an outfit should be marked for the attention of fashion item recommendations. Although there are several prior works, as mentioned in Liu et al. [24], which considered item recommendation problems based on specific occasions. However, they only focused on the relation of each pair of items (top - bottom) rather than the combination of many items in an outfit together, which an outfit usually has.

To deal with the existing shortcomings of the two approaches mentioned above, we propose a method for outfit compatibility based on specific occasions, on which each outfit includes a combination of many items rather than only top and bottom items. In addition, the major challenge of the fashion-item recommendation problem is the similarity in visual characteristics among occasions, which hardly induces recognition and prediction correctly. Casual and weekends usually share similarities in several items in their outfits, or there are ambiguities among occasions, such as parties, cocktail parties, and dating. To solve this problem, we propose an auxiliary classifier branch, including global average pooling, to capture the global visual features of an entire outfit and softmax layer to classify specific occasions.

Another challenge of the outfit compatibility and the fill-in-the-blank problems is the variance of the item number belonging to an input outfit, with a bag of item images varying in size. Therefore, it is necessary to have a system that is able to deal with input outfits having different numbers of items, and can also be trained as an end-to-end model. The combination of bidirectional long short-term memory (Bi-LSTM) and visual semantic embedding (VSE) is an efficient method to address this problem.

In addition, in order to train the model, it is necessary to have a dataset that provides sufficient information about the occasion consisting of item product images and metadata corresponding to each item in an outfit. However, up to present, as far as we know, there is a lack of that kind of occasion fashion dataset, and it is also an important reason making the outfit compatibility and the fill-in-the- blank problems become more challenging. Thus, to address this problem, the first step was to build a fashion dataset related to occasions. The main contributions of this study are threefold.

First, we collected a fashion dataset according to specific occasions, namely the Shoplook-Occasion dataset. This collected dataset provides essential information related to each outfit, which helps to address the outfit compatibility problem and the fill-in-the-blank problem conditioned on specific occasions.

Second, we adopt an efficient framework for outfit compatibility, inspired by Han et al. [9]. Our framework utilizes a VSE space to capture the semantics between the visual and text embedding. In addition, the system is combined with Bi-LSTM to learn the complicated relationships between fashion items in both directions from top to bottom and vice versa. Moreover, we propose an auxiliary classification branch to address the occasion-based outfit compatibility and fill-in-the-blank problems. To extract visual features, we adopted Residual architecture He et al. [11], which is proven to provide more essential information for each item, than the Inception architecture Szegedy et al. [36]. We utilized the pre-trained residual architecture model as image embedding to improve model performance.

Finally, the experiments were conducted on the collected dataset, in which the results indicated that the proposed model with the auxiliary classification branch incorporated the combination of VSE and Bi-LSTM, outperformed the other methods in terms of outfit compatibility and specific occasion-based.

The rest of this work is organized as follows: Sect. 2 presents some previous related works. The details of the proposed framework are described in Sect. 3. In the next part (Sect. 4), we conduct experiments on the newly collected Shoplook-Occasion dataset, and a comparison between our proposed method and other approaches is presented. Finally, conclusions and future work are specifically discussed in Sect. 5.

2 Related works

The outfit compatibility prediction and fill-in-the-blank problems have already been addressed by different approaches such as sequence-based approaches Han et al. [9]; Lin et al. [21]; Tangseng and Okatani [39], metric learning-based approaches Vasileva et al. [41]; Lai et al. [17]; Tan et al. [38]; Veit et al. [42]; Hou et al. [12], graph-based approaches Cucurull et al. [5]; Cui et al. [6], and the other approaches Liu et al. [24]; Tangseng et al. [40]; Hsiao and Grauman [13]; Li et al. [20]; Song et al. [34]; Li and Xu [19]; Lu et al. [28]; Lu et al. [29]. The following section describes each method in more detail.


The metric learning-based approach has focused on measuring the similarity between images in a single similarity context. Specifically, these images are projected into a general embedding space, where they can provide a measurement of the similarity between objects following the respective distance and the loss function. Some common loss functions can be counted as hinge loss Weinberger et al. [45] and triplet loss Vasileva et al. [41]; Lai et al. [17]; Tan et al. [38]. Vasileva et al. [41] proposed a learning method for image embedding that respects item type, which is known as the Conditional Similarity Networks (CSN). Veit et al. [43] model to learn type-aware embeddings for an outfit compatibility model. In their study, a total of 66 conditional subspaces were learned for each pair category (for example shoes-tops, bottom-hat, tops-bottoms, bottoms-shoes, etc,.). However, one existing limitation of this approach is that it does not to consider different types of visual features. Therefore, several other studies have attempted to overcome this barrier by learning disentangled representations to capture different notions of similarity following supervision by predefined similarity conditions. Lin et al. [22] introduced a new framework, which includes a category-based attention mechanism that only depends on the item categories and a new outfit ranking loss function. Tan et al. [38] proposed a solution to improve the performance of the model Vasileva et al. [41] by learning the shared subspaces along with the importance of each subspace. Lai et al. [17] extended the outfit compatibility prediction model Vasileva et al. [41] by proposing a theme-attention model over category-specific embedding space. Hou et al. [12] introduced a solution which leverage on the semantics of visual attributes to train convolutional networks (CNN) that learn attribute-specific subspaces for each attribute to obtain disentangled representations.


Graph-based approach is also utilized for outfit compatibility prediction as in Cucurull et al. [5]; Cui et al. [6]. The idea of this approach is based on considering a context when the items are known to be compatible with other items. A graph neural network (GNN) learns to generate item embeddings conditioned on their context. Specifically, the outfit compatibility is considered as an edge prediction problem, which consists of the encoder and decoder phase. The encoder phase computes new embedding for each item depending on their connections, and the decoder phase predicts the compatibility score of two items. However, one notable shortcoming of this approach is the high computational cost from leveraging test-time graph connections when new items can be introduced into the catalog (i.e., testing set).


The sequence-based approach is based on inspiration from the sequence-based model, which has been widely applied in natural language processing problems and sequential recommendation Lonjarret et al. [27]. Han et al. [9] considered every item in an input outfit as a word in a sentence, and Bi-LSTM was utilized to learn the relationship between items in an outfit following both directions. Meanwhile, Lin et al. [21] used a gated recurrent unit (GRU) along with the attention mechanism, including mutual attention and cross-modality attention for predicting compatibility and generating comments.


Other approaches several other studies have been applied different methods to address compatibility problems Liu et al. [24]; Tangseng et al. [40]; Hsiao and Grauman [13]; Li et al. [20]; Song et al. [34]; Li and Xu [19]. Liu et al. [24] introduced a novel method based on mined occasion-attribute matching rules and mined attribute-attribute matching rules. Hsiao et al. [13] considered the compatibility task as a subset selection problem. Specifically, they built submodular object functions to capture key ingredients of visual compatibility, versatility, and user preference. Then, an unsupervised approach was utilized to learn visual compatibility from the real world. Tangseng et al. [40] proposed a deep learning-based method to compute the grading score of outfit compatibility conditioned on fixed input items. Similarly, Song et al. [34]; Li et al. [20] also applied cross-modality as the inputs for deep learning to address this problem. Lu et al. [28] proposed a framework that learn binary code for efficient personalized fashion outfits recommendation. Lu et al. [29] introduced a new solution for personalized outfit recommendation using a stacked self-attention mechanism to model the high-order interactions among the items.

In this study, we focused on investigating two main approaches that have recently attracted an increasing number of studies: the metric learning-based approach and the sequence-based approach. We consider these two approaches as the baseline methods for comparison with our proposed method, which is presented in more detail in the next section.

3 Proposed method

Fig. 1
figure 1

The overview of our proposed framework which is trained end-to-end as a single model. First, input items of an outfit is passed into a residual network (Resnet) to extract the visual features. At the same time, it’s descriptions are also converted to one-hot vectors. The outputs of this processing are utilized further into the visual semantic space (VSE). Simultaneously, the visual features extracted by Resnet are continuously passed through a bidirectional LSTM (Bi-LSTM) to learn the relationship between items of an outfit following both directions. Moreover, to deal with the outfit compatibility with respect to specific occasions, we propose an auxiliary classifier branch to learn disentangled representations of each occasion. The auxiliary classifier has one global average pooling layer, one fully connected layer, and one softmax function to score for each occasion

Recently, to deal with the outfit compatibility prediction and the fill-in-the-blank problem, most of the approaches Tangseng et al. [40]; Li et al. [20]; Han et al. [9]; Vasileva et al. [41] have mainly focused on esthetics rather than specific occasions. In this paper, we introduce an efficient framework to address both the outfit comparability prediction and the-fill-in-the-blank problem with respect to specific occasions. The proposed framework comprises three major parts. In the first part, Bi-LSTM is utilized to capture the compatibility relationships among fashion items in the entire outfit. Next, a visual semantic embedding space is built to learn the similarity of each fashion item and text description. The last part is an auxiliary classification branch, which is proposed to score the compatibility outfit on specific occasions. In particular, these input items of an outfit is passed into a residual network to extract the visual features, which each visual feature has 2048 dimensions. Simultaneously, the descriptions of these items are also converted to one-hot vectors. The outputs of these processing are utilized further into the visual semantic space (VSE), meanwhile each visual feature and the one-hot vector in it’s description are embedded into a visual semantic space with the same dimension 512. Besides, the visual features with 2048 dimensions are continuously passed through a bidirectional LSTM to learn the relationship between items of an outfit in both directions (top - bottom and versus). An auxiliary classifier branch, which includes one global average pooling layer, one fully connected layer, and one softmax function, is proposed to learn disentangled representations of each occasion. This auxiliary classifier branch is able to categorize the outfit compatibility with respect to specific occasions. The overview of our propose method is presented in Fig. 1. The more detail of these components is as follows.

3.1 Outfit compatibility learning

Inspired by Han et al. [9], we considered the entire outfit with a set of fashion items as a sentence with a sequence of words. Considering the input items as a sequence problem in natural language processing, Bi-LSTM is adopted as an important part in which it helps to capture complicated relationships among fashion items from top to bottom and opposite directions. Bi-LSTM is an extended version of Long Short-Term Memory (LSTM). Based on the modification in the structure of the cell state as an input gate, forget gate and output gate, LSTM, as well as Bi-LSTM, is able to prevent the major challenges in the sequence-dependence problem including the vanishing gradient and long dependence problem, which cannot not be solved by traditional RNN. In addition, the combination of the outputs obtained by the forward and backward processes of LSTM, plays a vital role in preserving the essential features in the entire outfit. More specifically, a fashion image sequence \(X = \{x_1, x_2,...,x_N\}\) is passed into the first module with Bi-LSTMs, where \(x_t\) is the feature presentation of each fashion item for the t-th fashion item in the fashion image sequence, extracted by a CNN model. Following each hidden state \(h_t\), known as the output of LSTM Han et al. [9], a softmax function is utilized. This function calculates the probability of the next fashion item \(p(x_{t+1} \Vert x_1, x_2,...,x_t)\) conditioned on the previously seen items, and the probability of the previous fashion item \(p(x_{t} \Vert x_N,...,x_{t+1})\) conditioned on the next items, where \(x_0\) and \(x_{N+1}\) are two zero vectors that are added in X, for Bi-LSTM to determine when to stop predicting the next item. The objective function of the forward direction in the Bi-LSTM is:

$$\begin{aligned} \mathcal {L}_{f} = -\frac{1}{N}\sum _{t=1}^N{\textrm{log}(p(x_{t+1}\Vert x_1,...,x_{t}))} \end{aligned}$$
(1)

and the objective function of the backward direction in Bi-LSTM is:

$$\begin{aligned} \mathcal {L}_{b} = -\frac{1}{N}\sum _{t=N-1}^0{\textrm{log}(p(x_{t} \Vert x_N,...,x_{t+1}))} \end{aligned}$$
(2)

Finally, the Bi-LSTM loss function is calculated by summing up \(\mathcal {L}_{f}\) and \(\mathcal {L}_{b}\) as follows

$${\mathcal{L}}_{{Bi - LSTM}} = \sum\limits_{X} {\left( {{\mathcal{L}}_{f} + {\mathcal{L}}_{b} } \right)}$$
(3)

3.2 Visual-semantic embedding space

The multi-model embedding space of texts and images, also known as visual semantic embedding (VSE) space, plays an important role in learning the semantic relations between texts and images. VSE has the capability to deal with not only computer vision tasks, but also natural language processing tasks. To train a VSE space, a fashion item image from an outfit is projected into the VSE space by putting the image representation \(x^{2048 \times 1}\) through the image embedding matrix \(W_f^{512 \times 2048}\): \(f = W_fx\). In addition to the image embedding process, the text description of each fashion item image is also projected into the VSE space. However, the input text \(S=\{w_1, w_2,...,w_M\}\) is converted to one-hot vectors \(e = \{e_1, e_2,...,e_M\}\) before being embedded into the word embedding matrix \(W_T^{512 \times 5180}\). Specifically, the word \(w_i\) is presented by the i-th one-hot vector \(e_i\), then is transformed to \(v_i = W_Te_i\). Finally, a set of embedding vectors \(v = \{v_1, v_2,...,v_M\}\) is encoded by the bag of words to obtain \(\hat{v}\) and is computed by the average of v. The contrastive loss function \(\mathcal {L}_\mathrm{{vse}}\) is utilized to learn the similarity relation between the image embedding vector and its text embedding vector, \(\mathcal {L}_\mathrm{{vse}}\), which is defined as follows

$$\begin{array}{*{20}r} \hfill {{\mathcal{L}}_{{{\text{vse}}}} = \sum\limits_{f} {\sum\limits_{k} {{\text{max}}} } (0,\xi - \delta (f,\hat{v}) + \delta (f,v_{k} ))} \\ \hfill { + \sum\limits_{v} {\sum\limits_{k} {{\text{max}}} } (0,\xi - \delta (\hat{v},f) + \delta (\hat{v},f_{k} ))} \\ \end{array}$$
(4)

where \(\delta (i,j)\) is the cosine similarity distance, represents the distance between the image i and its description j, \(v_k\) is the non-matching description of image f and \(\hat{v}\) describes the non-matching of description of \(f_k\). The \(\mathcal {L}_\mathrm{{vse}}\) function is minimized when the cosine similarity distance between f and its description embedding \(\hat{v}\) is smaller than the distance from the unmatched description \(v_k\) by margin \(\xi\), as in Eq. 4.

3.3 Auxiliary classification branch

Most existing models adopt only the case of outfit compatibility following esthetics. In this paper, we propose an auxiliary classification branch that can address the outfit compatibility on specific occasions. The auxiliary classification structure consisted of three parts. First, a global average pooling layer is utilized to capture the abstract visual features from stacked visual features \(X = \{x_1, x_2...x_N\}\), which are extracted by residual network, to form a global feature vector. Next, the output of this layer is passed through a fully connected layer to obtain the logit values. The last function is a softmax function for computing the probability of each occasion. To train the auxiliary classification branch, we minimized the cross-entropy loss function as follows:

$$\begin{aligned} \mathcal {L}_\mathrm{{auxiliary}} = -\sum _{k=1}^{C}{y^klog(\hat{y}^k)} \end{aligned}$$
(5)

where \(y^k\) is the ground truth label, \(\hat{y}^k\) denotes the predicted label, and C represents the set of the eight occasions including casual, cocktail party, dating, party, school, wedding guest, weekend and work from the Shoplook-Occasion dataset. The details of this dataset are presented in the next section.

In the last stage of this framework, we incorporate the Bi-LSTM and visual semantic embedding model (VSE) with the aim of improving the effectiveness of our proposed model by providing both fashion compatibility information (from Bi-LSTM) and visual semantic information (from VSE). In addition, the auxiliary classification branch is added to address the outfit compatibility problem and fill-in-the-blank one with respect to specific occasions. The final objective function is as follows:

$${\mathcal{L}} = \lambda \,{\mathcal{L}}_{{vse}} + \beta \,{\mathcal{L}}_{{auxiliary}} + \gamma \,{\mathcal{L}}_{{Bi - LSTM}}$$
(6)

where \(\lambda , \beta\) and \(\gamma\) are the hyperparameters of the proposed model.

4 Experiments

4.1 Shoplook-Occasion dataset

Fig. 2
figure 2

The distribution of training set, validation set and test set on Shoplook-Occasion dataset

Fig. 3
figure 3

Some samples of outfits and corresponding occasion labels in Shoplook-Occasion dataset

Normally, an outfit consists of about eight popular parts, such as top, bottom, all-body, outerwear, shoes, jewellery, bag, and accessories. Due to that, we collected the corresponding garments for each outfit as presented in Table 1. In this study, we collected outfits from Shoplook.io with a specific occasion to create a new fashion dataset, called Shoplook-Occasion dataset. The shoplook.io is a social commerce website that provides important information such as the item category, titles, and many different occasion labels. However, outfits that are crawled from this online website bring much noisy information, not only visual but also textual data. Thus, it is necessary to clean this crawled dataset.

The cleaning acquisition data process includes a two-stage procedure as following. Firstly, a machine learning model is built to automatically classify the semantic category of each item based on existing meta-data. The purpose is to remove images that do not have the corresponding item categories. At the end of this step, the dataset is collected with many outfits, and each outfit has a list of items with corresponding semantic item categories and its description. Table 1 shows the number of each semantic category in Shoplook-Occasion dataset. However, an outfit has many occasion labels, it is necessary to select the best appropriate occasion for each outfit. In the second stage, or the validation step, a number of experienced annotators are asked to label the occasion for each outfit to improve the quality of the dataset obtained from the previous stage. For choosing the number of annotators, we refer to some previous works Liu et al. [25]; Bourdev et al. [3]; Duran et al. [7]. In Liu et al. [25], authors collected the WOW (What to Wear) dataset for occasion-oriented clothing recommendation tasks with five annotators. As the meanwhile, Bourdev et al. [3] tried to build a dataset to describe people with some attributes, such as the gender, hair style and types of clothes of people under large variation in viewpoint, pose, articulation and occlusion typical of personal photo album images. Similar to Liu et al. [25], Bourdev et. al. also asked five annotators for labelling all attributes. Recently, Duran et al. [7] aimed to capture both the semantic and syntactic structure of dialogue, and the labeling task is undertaken by 15 novice annotators. Based on these considerations, in our study, to construct the Shoplook-Occasion dataset, we asked 20 experienced annotators for choosing the right label for each outfit. All these annotators have prior knowledge about fashion, style and outfit. Specifically, we utilized tag information from users on the Shoplook.io website to obtain occasions. With each outfit, we asked them to rank these occasions from 1 to 3. Finally, we decided to choose only one occasion label for each outfit based on the highest occasion label selected from all the annotators. This is the second step after using a model automatically classify the semantic category of each item based on existing meta-data. The purpose of this step is to continuous improving the quality of the results obtained from the previous stage by exploiting the knowledge and expertise of our experienced annotators. In our point-view, the number of 20 experienced annotators is sufficient for validating the quality of our dataset when combining with the auto-classification from the previous step.

Table 1 Statistics of semantic category in Shoplook-Occasion dataset
figure a

At the end of the process, we obtained a total of 8 occasions, 4,752 outfits and 16,968 unique items. This dataset includes many oufits, which each outfit contains the only one occasion label and a list of corresponding items. Besides, each item is labelled with a corresponding semantic category, image and it’s description. It presents that this dataset is suitable for outfit compatibility problem according to the specific occasion. Figure 3 shows some outfit samples of specific occasion labels in Shoplook-Occasion dataset.

Next, we split Shoplook-Occasion dataset into three disjoint subsets (training, validation and testing sets) based on the modification of Tang’s method Tangseng et al. [40]. The training and validation sets were utilized to learn and optimize the hyperparameters in the trained model, whereas the test set was used to evaluate the learned model.

Table 2 Number of outfits and items in Shoplook-Occasion dataset, used in our experiments as Training set (first row), Validation set (second row), and Test set (the last row). Note that each item in one set is not able to be shared with other sets

A tripartite graph is built as in Algorithm 1, where each item is considered as a vertex, and the edges represent the relationship between items in each outfit. It is a disjoint set sampling algorithm, and each item in one set is not able to be shared with other sets, which helps to remove the duplicated items in the outfit. Finally, we obtained 2,967 outfits for the training set, 889 outfits for the validation set, and 896 outfits for the testing set. More details about the splitting dataset are shown in Table 2 and the distribution of each occasion label in each set is illustrated in Fig. 2.

4.2 Implementation details

Inspired by previous works Tangseng et al. [40]; Vasileva et al. [41]; Han et al [9]; Tan et al. [38], in this study, we address the fashion outfits compatibility prediction task and fill-in-the-blank task (FITB) with respect to specific occasions. For the fashion outfit compatibility task, a candidate item is combined with other items that are scored for compatibility on specific occasions. For the fill-in-the-blank task, a sequence of fashion items is provided with one missing item at a random position. We need to choose an appropriate item from a list of given multiple choices, which is compatible with other existing items according to the specific occasion. Up to present, as far as we know, there is no existing dataset for these tasks on occasion compatibility scenarios. Thus, we created a fill-in-the-blank dataset conditioned on occasions by using all the outfits in the test set of Shoplook-Occasion. Similar to the method in Han et al. [9], to create an outfit compatibility dataset, with each outfit, we randomly chose one item and replaced it with a blank. Then, we selected three items from other outfits, which have the same semantic category as the replaced item, to create a set of answers, including three selected items and one ground truth item. In addition, we created 844 incompatible outfits by selecting fashion items from the test set for the fashion outfit compatibility task.

In this study, the experimental models were implemented on Python 3.7 with Pytorch framework. The operating system was Ubuntu 16.04 LTS with Intel Core i7-4790K (4.0 GHz x 8 cores), 32 GB of RAM and GeForce GTX 1080 Ti.

To compare with other state-of-the-art methods Han et al. [9]; Tan et al. [38]; Vasileva et al. [41], we conducted other experiments as in Table 3 and 4. The results of the fashion compatibility prediction task and the fill-in-the-blank task are shown in Table 3, while Table 4 presents the results of the same task but with respect to specific occasions. In the case of the fashion compatibility prediction problem, a new outfit is created by a user and is expected to determine if they are compatible with the specific occasion. Performance was evaluated using the AUC of the ROC curve in multiple classes. In addition, the accuracy metric is also used to evaluate the fill-in-the-blank task. The accuracy metric was computed for each option-selection process. We choose the best answer by substituting candidate items in the blank item, and then compute the score for each occasion. For optimization task in these experimental models, we utilized Stochastic Gradient Descent optimizer with epoch number of 500. The other detailed parameters in each experiment are described as following.

Table 3 The comparison of our method (the last row) and the other state-of-the-art methods (the first 11 rows) for fashion outfit compatibility prediction problem when running on Shoplook-Occasion dataset, with the result of AUC in the second column, and fill-in-the-blank accuracy in the last column. Note that Auxiliary is a short cut of Auxiliary Classifier Branch
Table 4 The comparison of our method (the last row) and the other state-of-the-art methods (the first 7 rows) for fashion outfit compatibility prediction problem with respect to occasions when running on Shoplook-Occasion dataset, with the result of AUC in the second column, and fill-in-the-blank accuracy in the last column. Note that Auxiliary is a short cut of Auxiliary Classifier Branch

CSN, T1:1 + VSE (HGLMM) + Sim + Metric + Auxiliary Branch. Similar to Vasileva et al. [41], we utilized a \(18-layer\) Resnet for image embedding, and text embedding was extracted using the word embedding model. We used the pre-trained HGLMM Fisher vector encoding Klein et al. [16], and reduced the dimension using Principle Component Analysis (PCA) to obtain 600 dimensions. VSE is used to learn the compatibility between different categories of fashion items. In addition, a similarity branch technique is also learned to measure similarity between fashion items of the same category.

In addition, a pairwise type-dependent transformation (following as Vasileva et al. [41]) is used to measure the compatibility between two fashion items by projecting them into a type-specific space. A triplet loss function is also adopted by taking an element-wise product of the embedding vectors from type specific spaces and passing it into a fully connected layer, as in Vasileva et al. [41]. Moreover, we added an auxiliary classification branch that is suitable for outfit compatibility in terms of specific occasions as well as outfit compatibility.


CSN, T1:1 + VSE (Glove) + Sim + Metric + Auxiliary Classification Branch. In this experiment, we conducted similar to the previous experiment. However, for feature extraction, we utilized the Glove method Pennington et al. [31], which was trained from Wikipedia and Gigaword to apply on the textual data instead of the HGLMM method as in the previous experiment. The reason is that this approach considers both semantics and context as compared to the Fisher vector-based method as the gradients of the log-likelihood of descriptors. Specifically, the textual data were extracted to 300 dimension feature. Hyperparameters are utilized with a learning rate of \(5e^-5\), batch size of 256, and a margin m of 0.2. With VSE, \(l_1, l_2\) loss, parameters are set with \(\lambda _1 = \lambda _2 = 5e^-5\) and \(\lambda _3 = 5e^-3\).


Bi-LSTM + VSE + Auxiliary Classification Branch We performed similar experiment as Han et al. [9] based on the sequence-based model. More particularly, Inception-v3 model is utilized as image embedding to represent the input image as a 2048-dimension vector. Then, image feature vector is placed into one fully connected layer to reduce feature dimensions to 512. The obtained image feature of this process is continuously fed into the Bi-LSTM, with 512 hidden units and a drop out rate of 0.7. With VSE, the dimension was set to 512, the image embedding matrix \(W_I \in R^{2048 \times 512}\), and the word embedding matrix \(W_T \in R^{5180 \times 512}\), where 5180 represents the size of the vocabulary in the Shoplook-Occasion dataset. The learning rate and margin m are initialized with 0.2 and then changed by a factor of 2 for every two epochs with a batch size of 10. Hyperparameters of VSE (\(\lambda\)) and Bi-LSTM (\(\gamma\)) are set to \(\lambda = 1\) and \(\gamma = 1\) respectively.

However, visual features extracted by Inception-v3 model usually lead to a lack of information between essential features of the current layer and the previous layer. Therefore, we propose to use the \({\rm Resnet}-50\) model, which is capable of preserving essential information from the previous layer by skipping connection inside the structure of the Resnet-50 network. Moreover, we added an auxiliary classification branch to deal with fashion problems on specific occasions. Hyperparameter \(\beta = 1\) was utilized to adjust the auxiliary classification branch.

Experimental results in Table 3 show that the proposed method outperforms than other methods in both the FITB task and outfit compatibility prediction task. As observed in the first and second rows of Table 3, we address this problem using the pair-wise type-dependent approach Vasileva et al. [41] in both tasks (fashion compatibility prediction and FITB). In the first row, we adopted the model of Vasileva et al. [41] and modified their model by using Glove to extract text information instead of HGLMM. Experimental results indicate that by replacing HGLMM with the Glove method, the model is more efficient when considering FITB task by approximately 0.9%. However, the method based on Glove decreases approximately 0.04% when compared with HGLMM, regarding the fashion outfit compatibility prediction. In addition, we also adopted a sequence-based approach Han et al. [9] and proposed a modification for image embedding by using the Resnet model rather than Inception model on their method. Results are presented in next two rows of Table 3. It shows that this modification improves the performance of the model with 0.04% for AUC, and 2.8% for accuracy on both tasks, respectively. Moreover, these results also indicate that the sequence-based approach seems to be better than the metric learning-based approach on occasion dataset as Shoplook-Occasion dataset. Moreover, to demonstrate the effectiveness of auxiliary classifier branch for this case, we conducted experiments by adding an auxiliary classifier branch to both the sequence-based approach and the metric learning-based approach. Consequently, it has been proved the effectiveness of auxiliary classifier when improving not only on fashion outfit compatibility prediction but also fill-in-the-blank task.

In cases of occasion-based fashion problems, Table 4 shows the overall comparison results obtained using various methods. Glove-based method obtains an AUC score of 0.69, 0.62 and 0.63 concerning the metric dimension of visual feature with \(512-D\), \(256-D\) and \(128-D\), respectively. It has also been presented that Glove-based method is better than HGLMM-based method by approximately \(1-8\)% on AUC score for the fashion outfit compatibility task conditioned on occasions. Meanwhile, sequence-based approach also outperforms a pairwise type-dependent approach with 0.75 and 0.77 on AUC, respectively, which are shown in the last two rows. Experimental results in Table 4 demonstrate the effectiveness of our proposed model, which is able to preserve the lower visual features by skipping the connection inside the Reset architecture. Therefore, it helps to improve AUC score by approximately 2% compared with Inception network. In addition, Fig. 4 visually illustrates the possibility of our proposed model on each occasion, which is defined in Shoplook-Occasion dataset. Overall, our proposed method outperformed other methods on most of occasions. However, a comparison between Fig. 4c and 4d shows that wedding guest label is only recognized as 0.7 on our proposed method, which is lower than 0.05, as shown in Fig. 4c. In terms of the fill-in-the-blank task for occasion-based methods, the best result is obtained by the Glove-based method with the feature dimension of 128-D. Meanwhile, our proposed method and the HGLMM-based method achieved the same performance of 15.2.

Fig. 4
figure 4

Compatibility AUC results for fashion outfit compatibility prediction problem following occasions: casual, cocktail party, dating, party, school, wedding guest, weekend, work on Shoplook-Occasion dataset, using different methods: GloVe + Sim + Metric (128-D) + Auxiliary Classifier Branch a, HGLMM+ Sim + Metric (512-D) + Auxiliary Classifier Branch b, Inception-v3 + Bi-LSTM + VSE + Auxiliary Classifier Branch c, and Resnet50 + Bi-LSTM + VSE + Auxiliary Classifier Branch d

In addition to the AUC and accuracy, we evaluated all methods with respect to the computational cost. We computed the running time of each method on the training and testing sets of the Shoplook-Occasion dataset, and the results are presented in Table 5. Specifically, our proposed method consumes 14273.8 (s) for training and 57.3 (s) for testing, although it is longer than the Glove-based method and the HGLMM-based method, but it is better than the method based on Han et al. [9].

Table 5 The computational time in second of our proposed method (the last row) and other methods (row 1, 2 and 3) when running on Shoplook-Occasion dataset. (Auxiliary-Auxiliary Classifier Branch)

Figures 5 and 6 visually illustrate the qualitative results of our method for both tasks. First, our method can predict a set of fashion items that are suitable for a specific occasion, as shown in Fig. 5. The predicted score is 6.18 for the choice of casual label, as in the first row of Fig. 5a. The green box illustrates the ground truth answer, and the red box shows the answer predicted by our method. Figure 5a shows the successful cases and Fig. 5b illustrates the failed cases, which are predicted by our method. Figure 6 shows the results of our method for the fashion outfit compatibility task on occasion. In Fig. 6a, we indicate the failed cases: for example, the casual label is wrongly predicted to cocktail party label. Figure 6b presents the cases that were correctly predicted using our method. Overall, the predicted scores showed that the fashion outfit compatibility prediction and the fill-in-the-blank tasks are difficult to predict with a certain occasions. Some occasions have presented the similar characteristics of visual features in an outfit, such as, casual and weekend, or cocktail party and party. Therefore, the prediction score values seemed to be less differential among these occasions.

Fig. 5
figure 5

Results of our method on the fill-in-the-blank task regarding occasions. The green boxes indicate the correct answers as well as the red boxes illustrate the predicted boxes. Moreover, prediction score of the best choice is also presented as in Figure a and b

Fig. 6
figure 6

Results of our method on the fashion outfit compatibility prediction task regarding occasions. The red texts show the predicted labels from our method, and the prediction scores are also displayed. Figure a presents the unsuccessful cases and Figure b depicts the successful cases, using our method

5 Conclusion

In this paper, we proposed a framework to address the outfit compatibility problem and the fill-in-the-blank problem with respect to specific occasions. Our proposed framework adopts visual semantic embedding to capture the related semantic meanings between visual and meta-data based on the semantic metric. Bi-LSTMs are also utilized to learn the complicated relationships between fashion items in both the forward and backward directions of an outfit. We also used an auxiliary classification branch to recognize outfit compatibility in terms of specific occasions. Moreover, we collected a fashion dataset from a social website related to occasion. This dataset provides important information about occasion labels, namely the Shoplook-Occasion dataset. Experiments were conducted on the collected dataset to compare the performance of our proposed model with that of other state-of-the-art methods. The experimental results proved that our method achieved positive results for both the outfit compatibility problem and the fill-in-the-blank problem regarding occasions. Although our work has some encourage results, it also has some limitations. First of all, it is time-consuming to learn the parameters of each module. Secondly, due to the limitation of human resource, the Shoplook-Occasion dataset is restricted on the number of outfits. Finally, some occasions have presented similar characteristics of visual features in an outfit, such as, casual and weekend, or cocktail party and party.

In the future, we are going to extend Shoplook-Occasion dataset by adopting semi-supervised learning to enrich collected data. Moreover, the attention mechanism would be integrated into our method in order to capture the relationship between fashion items. Last but not least, we also consider to tackle the outfit complementary item retrieval tasks regarding some other complicate and difficult occasions.