Unsupervised Qualitative Scoring for Binary Item Features

Binary features, such as categories, keywords, or tags, are widely used to describe product properties. However, these features are incomplete in that they do not contain several aspects of numerical information. The qualitative score of tags is widely used to describe which product is better in terms of the given property. For example, in a restaurant navigation site, properties such as mood, dishes, and location are given in the form of numerical values, representing the goodness of each aspect. In this paper, we propose a novel approach to estimate the qualitative score from the binary features of products. Based on a natural assumption that an item with a better property is more popular among users who prefer that property, in short, “experts know best,” we introduce both discriminative and generative models with which user preferences and item qualitative scores are inferred from user--item interactions. We constrain the space of the item qualitative score by item binary features so that the score of each item and tag can only have nonzero values when the item has the corresponding tag. This approach contributes to resolving the following difficulties: (1) no supervised data for the score estimation, (2) implicit user purpose, and (3) irrelevant tag contamination. We evaluate our models by using two artificial datasets and two real-world datasets of movie and book ratings. In the experiment, we evaluate the performances of our model under sparse transaction and noisy tag settings by using two artificial datasets. We also evaluate our models’ resolution for irrelevant tags using the real-world dataset of movie ratings and observe that our models outperform a baseline model. Finally, tag rankings obtained from the real-world datasets are compared with a baseline model.


Introduction
Keywords, tags, and categories are widely used to describe product properties.Most e-commerce services, such as Amazon, Alibaba, or eBay, use categories and tags for item filtering.In social bookmarking and recommendation services such as Delicious, Last.fm, and MovieLens, tags are sometimes annotated by users.Even without explicit tags, we unconsciously infer item tags from side information, i.e., product names, explanation texts, reviews, and package designs.
However, such binary expression is incomplete because it does not contain three aspects of numerical information: quantity, relevance, and quality.Quantity represents the strength of a tag with a certain unit.For example, tags, such as low-calorie, light, and hot, lose quantitative information such as 50 kcal, 100 g, and four degrees out of five.With relevance, on the other hand, one can see which tag mostly describes the product that does not have a clear unit such as with quantity.For example, if one has a picture of a dog in a park, the tags of that picture may be "dog" and "park."However, if that picture includes a passerby in the background, it may have the "man" tag.In that case, the relevance of "dog" and "park" should be higher than "man" because the subject of the picture may be "dog in a park."Finally, Quality provides us with information about the goodness of products from the viewpoint of each property.One typical example is the food, location, and mood properties of resort hotels.We often refer to this information in the form of a qualitative score for comparison.We stress the difference between the qualitative score and the other two types of numerical information.Let us consider a protein bar containing high-quality protein.Compared to the quantitative information, the qualitative score is for "which product contains better protein?," but not for "how much protein does the product contains?"because the quality of the protein does not directly correspond to the amount of protein.The relevance score is also not directly interpreted as the qualitative score: even for a low-purity protein bar, the tag of "protein" is relevant for that product.This qualitative information is particularly useful for manufacturers and promoters by which they can know how more attractive their product can be to customers compared to their competitors.In this paper, we focus on estimating the qualitative score for each binary item feature.
One difficulty for qualitative score estimation is that we cannot usually obtain supervised data.The qualitative score can be estimated by comparing similar products, i.e., by asking "which hotel is better from the viewpoint of food."However, a comprehensive survey is unrealistic.More seriously, since scoring is a much more difficult task than tagging, scoring is always affected by psychological biases and human dependence.From this point of view, we focus on a problem setting of unsupervised estimation.
Another difficulty arises when considering an unsupervised setting, because the purpose of the customer is usually implicit in most customer--product interactions such as ratings, purchases, views, clicks, and bookmarks.If user purposes or needs can be determined explicitly, it is possible to obtain a qualitative score by ranking items for each user purpose.One representative example is query and purchase actions on an e-commerce site.A query represents customer demand for products, and a search engine in the site filters products based on the query and lists it.Therefore, by considering the query as an item tag, one can interpret the popularity for each product as the qualitative score for that product in terms of that query.However, in brick-andmortar shopping, most customers purposes are implicit: we usually do not declare any purpose or query.Even in the e-commerce case, a typical purchase process is that one first filters items by category, query, or stars, reads the descriptions of each item, and finally decides which item to buy.In this process, which point the user prefers is implicit.
We also stress that product features in real-world data often contain tags with low relevance, i.e., tags which do not adequately describe the product.One typical case is tagging for web sites by using keyword extraction.The extraction process always contains a number of irrelevant keywords from links, advertisements, or daily rankings, and one finds only 5-10 keywords that adequately describe the web content.One can also observe many low relevance tagging by user annotation in a social service site.More interestingly, irrelevant tagging occurs when there is a gap between manufacturers and customers, i.e., when tag information provided by manufacturers does not correctly describe item features from the customer's point of view.
Although label enhancement has recently been addressed, few studies have focused on qualitative score estimation.
In the context of multi-labeled classification, Geng [1] proposed a label enhancement approach for binarized sample labels.Supervised learning for label distribution was developed in the earlier implementation, and a later unsupervised approach was proposed by Shao et al. [2].They used a similarity assumption between the topology of input features and that of the binary label distributions.However, these approaches are mainly for the item relevance scores discussed above and do not directly compare the label quality between items.On the other hand, item ranking is widely used in the context of a recommendation system that suggests suitable items for specific users [3][4][5][6][7][8][9][10][11][12][13][14].However, most studies focused on constructing a personalized item ranking for each user and did not provide information about the qualitative aspect of the apparent item binary features.
In this paper, we propose a novel approach for qualitative score estimation from binary and noisy item features, which contributes to resolving the difficulties mentioned above, i.e., (1) no supervised data, (2) implicit customer purpose, and (3) noisy tags.With this approach, we infer the qualitative score based on the assumption that an item with a better property is more popular among users who prefer that property, in short, "experts know best."One discriminative and two generative models are introduced with this approach with which user preferences and item qualitative scores are inferred from user--item interactions.One novelty is that we constrain the space of the item qualitative score by the item binary features so that it can only have a numerical value according to their tags.We evaluate our models by using two artificial and two real-world datasets.In the experiment of the two artificial datasets, we evaluate the performances of our model under the sparse transaction and noisy tag settings.We also evaluated our models' resolution for irrelevant tags using the real-world dataset of movie ratings and observed that our models outperformed a baseline model.Tag rankings obtained with our models from the two real-world datasets were compared with a baseline ranking model.
The organization of this paper is as follows: In Sect.2, we review recent related work.In Sect.3, we discuss the proposed approach and models.In Sect.4, we present the minimization and inferences for our models.We present the experimental results in Sect. 5 and finally give a discussion and conclusion in Sect.6.

Related Work
We infer both qualitative tag score and user preferences by using user--item interactions such as ratings, purchases, views, clicks, and bookmarks.Collaborative filtering is a conventional approach for latent feature extraction from user--item interactions.With this approach, the user and item latent features are extracted through low-dimensional matrix factorization for user--item interactions.Earlier studies focused on data with user ratings [3][4][5][6][7][8][9][10][11][12], and later studies developed approaches for implicit data such as clicks, bookmarks, and purchases [13,14].Probabilistic approaches for feature extraction were also proposed by using the framework of the topic model [15][16][17].The topic model was first introduced as latent semantic analysis (LSA) [15] for document clustering and probabilistic extension was later proposed [16].Latent Dirichlet allocation (LDA) [17] is a Bayesian extension of the probabilistic LSA in which a document has topic distribution, and terms of the document are generated by the term distributions of the topics.They give interpretable latent topics, giving the word distribution of each topic.Many extensions have been proposed in terms of document relation [18], topic correlation [19], hierarchical structure of topics [20,21], and combining meaning of words [22][23][24][25][26][27].In collaborative LDA, Wang and Blei [28] proposed a combined model of the topic model for item documents and matrix factorization for user--item interactions.One significant difference from our models is that their model uses the rich text information of items, such as articles or reviews, and constructs the topic distribution for item features.Our approach, however, uses sparse, noisy binary tags for item features.
As we mentioned in introduction, the qualitative score relates to tag relevance.In the context of multi-label classification, label importance was introduced by Geng [1], with which label distribution is inferred with supervised learning.Later, Shao et al. [2] proposed an unsupervised approach, with which numerical label distribution is inferred under a constraint of given binary labels, assuming the similarity between the topology of input features and that of the label distributions.Tag feature extraction was also investigated in the context of tag recommendation, with which proper tags are recommended to annotators.The latent feature of the tag was constructed from a tensor decomposition of user--item--tag relations, enabling personalized tag recommendation [29,30].

Model Descriptions
In this section, we introduce our tag scoring models, where the qualitative tag score is inferred from user--item interactions and binary item tags.We first give notations and the problem, define the qualitative tag score, and finally describe our models.We propose three models.One is based on matrix factorization, and the other two are probabilistic realizations based on the topic model.

Problem Setting and Notations
We consider user--item interactions composed of a set of users U = {u 1 , u 2 , … , u U } and items I = {i 1 , i 2 , … , i I } .Here, U and I represent the number of users and items, respectively.We represent user--item interactions by a form of an adjacency matrix.For countable interactions, such as purchases and clicks, we consider Y u,i ∈ ℕ U×I where Y u,i = n if user u ∈ U interacts with item i ∈ I for n times.For interac- tions with numerical values such as ratings, the adjacency matrix can be represented as Y u,i ∈ ℝ U×I where Y u,i = r if user u ∈ U gives rating values of r to item i ∈ I .In both cases, Y u,i = 0 implies a missing sample.For simplicity, we consider user purchases of items.We also represent the set of items purchased by user u by Y u = {i u1 , i u2 , … } and the set of all user--item interactions by Y = {(u, i) | i ∈ Y u } .We define a set of tags K = {k 1 , k 2 , … , k K } and express item features by a binary matrix of tags T ∈ {0, 1} K×I .In our set- ting, we assume an item has several tags and The problem is to estimate keyword score Φ ∈ ℝ K×I under observed Y and T, where Φ k,i represents the qualita- tive score of an item i in terms of a tag k.Simple examples are given in Tables 1 and 2, where item binary feature T and qualitative score Φ correspond to Tables 1 and 2, respectively.

Definition of Qualitative Score and Model Concept
To clarify the model concept, we first define qualitative score for our approach.Our definition of the qualitative score is that, in a nutshell, if an item has a tag with higher qualitative score, then that item is more popular or attractive for users who prefer that property.In the following, we will explain this concept in more detail and introduce two models derived from the definition.As we mentioned in introduction, the simplest definition can be obtained from manual scoring.However, since manual scoring is difficult and affected by psychological bias, scoring based on user--item interactions is preferable.One simple qualitative score is obtained using the popularity of items: by comparing item popularity tag by tag, we can construct tag rankings.However, this scoring gives the same tag score for one product.For example, movies with fantasy and sci-fi tags have the same score for these two tags.We propose a user preference based score where an item with a higher tag score implies it is more popular or attractive for users who prefer that property.From this definition, the behavior of experts of each tag determines each tag score; therefore, we describe this concept as the "experts know best" assumption.In the previous example, if a movie is more popular among those who like fantasy, a higher score is assigned to the fantasy tag than the sci-fi tag.In other words, we assume that item purchase can be approximated by a combination of user preference and tag scoring.This definition naturally enables the derivation of an extension of matrix factorization.In this paper, we call this model factorized feature scoring model (FFSM).
We can also define the score by a probability.In this definition, Φ k,i represents a probability that item i is chosen in terms of tag k.Table 2 illustrates a simple example.We have three items that have tags of hot, curry, and healthy, and each column is normalized to 1 for the probabilistic definition.The score shows that when one wants something hot, Chili Chicken, Chicken Curry, and Beef Curry are chosen with probabilities of 0.3, 0.3, and 0.4.Similarly, when one wants curry, Chicken Curry and Beef Curry are chosen with a probability of 0.4 and 0.6.Since Chili Chicken is not a curry, i.e., does not have the curry tag, the probability is zero.By using this definition, we model the purchase by extending the topic model, which we call the probabilistic feature scoring model (PFSM).Note that in this definition, the item selection probability of a given tag is inferred from the selection behavior of users who prefer that tag; therefore, it is also based on the "experts know best" assumption.

Factorized Feature Scoring Model
Based on the first definition of the qualitative score, we develop a matrix factorization model.The basic idea of this model is that matrix Y can be approximated by decomposing low-rank matrices P and Q: Y ui ∼ ∑ D d=1 P ud (Q T ) di , where we define P ∈ ℝ M×D , Q ∈ ℝ N×D , and D ≪ U, I .The rows of P represent the user's latent preferences, and the rows of Q represent the item's latent features.The notation D is the dimension of the latent space.The root mean squared error (RMSE) between Y and PQ T is minimized under the regu- larization of P and Q, which derives the object function L as where p ud ( q id ) ∈ ℝ denotes the entry of U (V) in the u(i)th row and dth column, and the hyper-parameter controls the strength of the regularization.Weight w ui ∈ [0, 1] is intro- duced for missing samples, which takes w ui = 1 if user u purchases item i (i.e., Y ui > 0 ), otherwise a small value.We simply adopt the uniform weight [13].
With FFSM, we define the tag-constrained objective function by assuming a product of user preference Θ and the qualitative tag score Φ explains the purchase matrix Y: In our approach, we minimize L under the nonnegative constraint: Θ u,k , Φ k,i > 0 .In contrast to the original matrix factorization, we let the dimension of the latent space D be equal to the dimension of the tag space, i.e., D = K and introduce the tag constraint Reg: Here c ∞ is a large constant that suppresses Φ at T ik = 0 .In this paper, we adopt the hard constraint, where c ∞ = ∞ , i.e., Φ k,i = 0 if item i does not have tag k.

Probabilistic Feature Scoring Model
We develop this model based on the topic model.In the topic model, the purchase history is modeled by a product of the probability of which user preference appears at this purchase and the item distribution with respect to the chosen preference: here Θ ∈ ℝ U×D and Φ ∈ ℝ D×I and D is the dimension of the topics.The Θ and Φ rows represent distributions of a user's latent preferences and item's latent topics, respectively.Topic index z ∈ {1, 2, 3, … , D} can be interpreted as the pur- pose of user u for purchase i and is sampled from the multinomial distribution of the user preference Θ : P(z|Θ) = Θ u,z , and item i is chosen according to a multinomial distribution of Φ k,i : In this model, we define the matrices of Θ and Φ based on the tag space T. First, we let D be equal to the dimension of the tag space, i.e., D = K .Moreover, we impose a constraint where Φ is nonzero if and only if T ≠ 0 , i.e., satisfying ∑ i Φ k,i = 1 .By introducing the constraint from the keyword space, we can interpret that each component of Φ k,i is the qualitative score defined in the previous subsection.
The posterior probability ∏ u,i∈Y P(Θ, Φ�Y u,i ) is inferred by introducing priors of Θ and Φ .For Θ , we adopt the con- jugate prior for the multinomial distribution: where Dir is the Dirichlet distribution and ∈ ℝ U×K is a hyperparameter of Dir .Similarly, we introduce the Dirichlet prior for Φ .In this case, since Φ is constrained by T, the sample space of the prior is also constrained.We denote and a hyperparameter ∈ ℝ K×I takes k,i = 0 when T k,i = 0 .Figure 1 illustrates the graphical model of the generative process.

Correlated Probabilistic Feature Scoring Model
We extend PFSM by incorporating the tag--tag correlation.For user preference, we can naturally consider that preferences with a similar meaning are correlated.For example, one with a higher preference for "humor" tends to have a higher value for "comedy."Since each dimension k of the topic matrix Θ can connect to a tag name, we use the external knowledge for topic correlation.On the other hand, we conjecture that the correlation from meaning of words is not always valid for user preference.Therefore, we use the word correlation matrix as a (4) prior of this extended model, which we call correlated PFSM (CPFSM) and infer topic correlation with it.
Based on the correlated topic model [19], we introduce a multivariate normal distribution for correlated topics.The generative process of CPFSM can be described as follows: • Draw the means and covariances of user topics: , Σ ∼ NIW( 0 , Σ 0 , , ) .Here NIW is the normal-inverse- Wishart distribution, which is a conjugate prior for the multivariate normal distribution.Prior knowledge 0 , Σ 0 control mean and covariance matrix for a topic parameter, respectively, and and are hyperparameters of the distribution.
• Draw a parameter for user topic distribution by a multivariate normal distribution: ∼ N( , Σ) .Here N( , Σ) represents a normal distribution with a mean of and covariance matrix of Σ. • Convert to Θ by the softmax function: • Draw item distribution Φ by a constrained multinominal distribution in Eq. ( 4).• For each purchase with index c of user u, draw user attention z u,c from the multinomial distribution of Θ : Similar to PFSM, we impose a constraint where Φ is nonzero if and only if T ≠ 0 .Figure 2 illustrates the graphical model of the generative process.

Minimization for FFSM
For the minimization of Eq. ( 1), we develop an efficient algorithm for FFSM based on the alternative least square approach proposed by [31].In the algorithm, we carry out element-wise minimizations of Θ u,k and Φ k,i .The update equation for u ud is given by for any 1 ≤ u ≤ U and 1 ≤ k ≤ K .Similarly, the update equa- tion for Φ k,i is given by To satisfy the nonnegative constraint for Θ and Φ , we set Θ = 0 and Φ = 0 for negative values for each iteration.Algorithm 1 illustrates the pseudo-code for this algorithm.

Inference for PFSM
In PFSM, most of the inference techniques developed for vanilla LDA can be applied with a small modification, i.e., by constraining the word distribution of Φ according to T. We adopt the collapsed Gibbs sampling, where z u,c is sam- pled after marginalizing Φ and Θ: for T ik = 1 and 0 otherwise.We define that n k,i is the total number of topics k assigned to an item i and n u,k is the total number of topics k appearing in purchases of a user u: and ⋅∖u, c represents the calculation excluding the cth pur- chased item by user u.
The elements of Φ and Θ can be written as: and ( 11) Fig. 2 Graphical model of CPFSM Algorithm 2 illustrates the pseudo-code for this algorithm.

Inference for CPFSM
For CPFSM, we also infer the parameter of the model by using the Gibbs sampling algorithm.Although P( ) is no longer a conjugate distribution to P(z| ) , an efficient sampling algo- rithm for z and has been proposed by [32].The sampling probability of z u,c = k is given by the Φ-collapsed posterior function: and we obtain the tag score by Eq. ( 11).On the other hand, the scale mixture representation of p( u,k | �k ) enables inte- gration of auxiliary variable u,k (see the study by Chen et al. [32] for more detail).By utilizing this technique, we finally obtain the sampling distribution for and : where we define ( 12) here , and u,k = ln( ∑ l≠k e l,u ) .The sampling distribution for u,k is given by PG(n u , u,k − u,k ) where PG is the Polya--Gamma distribution.
Since n u is generally large in our case, we can approxi- mate PG to a normal distribution function with the mean and standard deviation of PG(n u , u,k − u,k ).
For and Σ , due to the conjugacy between NIW and mul- tivariate normal distributions, posterior distribution can be p a r a m e t e r i z e d b y N I W : P(, Σ|) ∝ P(|, Σ)P(, Σ) = NIW( μ, Σ, λ, ν) , where we d e f i n e μ = ( The MAP inference gives Algorithm 3 illustrates the pseudo-code for this algorithm.

Consistency of Tag Score
We now discuss the consistency of the rankings.In our models, we approximate Y by non-negative Φ and Θ , which can be interpreted as one realization of the non-negative matrix factorization (NMF).Generally, NMF does not always provide the same solution for Φ and Θ .As a result, the obtained tag score may differ trial by trial.The reason for this inconsistency stems from the following two reasons: non-uniqueness and local minimum.Non-uniqueness of non-negative matrix factorization is a well-known problem.Even after removing trivial degrees of freedom, such as permutation of the components or scaling, the solution of NMF is not generally unique.The uniqueness relates to the form of the decomposition, and the sufficient and necessary condition for the uniqueness has been investigated [33][34][35][36].In our model, thanks to the binary tag constraints, the decomposition tends to offer the unique solution: The scaling arbitrariness is removed by conditioning ∑ k Φ k,i = 1 , and one can easily see that the permutation is not allowed unless there exist a pair k and k ′ ≠ k that satisfies T k,i = T k � ,i for all i ∈ I .Moreover, a recent study has observed that if T and transaction Y are sparse, the decomposition tends to offer a unique solution [36].
Another possibility of this inconsistency comes from the fact that global minimum search is NP hard [37] and there exist local minimums where another list of tag scoring would be suggested.One remedy for this local minimum is to run an algorithm several times by changing the random seeds and observe the behavior of the list.In this paper, we ( 14) simply discard tags with unstable rankings and leave further development for future work.

Model Performance Evaluation
We first evaluated our models' performance by using two artificial datasets.

Dataset
We generated two artificial datasets.In the first dataset, we assumed that user--item interactions were generated in a discriminant model fashion, i.e., via the product of user preferences and item tag scores.In this dataset, we randomly generated Φ and Θ in a range of 0 to 1 and normalized them along the tag direction and the user direction, respectively.The user--item interaction Y was constructed from the product of Φ and Θ .By imposing thresholds on Φ and Y, we binarized them.The threshold values were determined by the given number of tags and user purchases, respectively.Below, we call this dataset D1.In the second dataset, we assumed that the user--item interactions were generated via the generative process described in Sect.3. We first generated an item binary tag matrix T under a fixed number of tags for each item (we adopted 5 and 15) then generated Φ and Θ by using Eqs.( 5) and ( 6), respectively, under the constraint of T, setting , = 0.5 , and finally generated Y by using Eq. ( 3).Below, we call this dataset D2.We fixed the user, item, and keyword spaces to U = 500 , I = 500 and K = 50 for both models.We generated these two datasets ten times and computed the average scores and standard deviations.

Evaluation Metrics
We evaluated the models' performance by introducing two standard ranking metrics: • HitRate (HR) When evaluating the qualitative score, it is often the case that the ranking of items is more important than the numerical score.To evaluate qualitative ranking performance, we introduce HR.The HR for top N items (HR@N) is an intuitive ranking metric with which sets of the top N items of the prediction and the ground truth are compared.The HR provides the ratio of the intersection of the two item sets: • Normalized discounted cumulated gain (nDCG) The second metric is for considering both ranking and score accuracy.Intuitively, score accuracy for higher ranking items is more important than that for lower ranking items.DCG is a popular weighted rank metric with which the larger weight is imposed on higher-ranked items.DCG@N for tag k is defined by We assume that a model predicts the top N items of tag k: { i k 1 , i k 2 , … , i k N } and i) represents the relevance of item i for tag k.In our problem setting, the relevance is rel(k, i) = Φ k,i .The nDCG is defined by nDCG k @N = DCG k @N∕IDCG k @N , where IDCG rep- resents the ideal DCG, in which the top k item index series i u 1 , i u 2 , … , i u k are computed from the ground truth Φ k,i .For the metric of our experiments, we used the aver- aged nDCG: nDCG@N = 1 K ∑ k nDCG k @N.

Experimental Setting
The first experiment was for observing the relationship between the accuracy and sparsity of a dataset.The transaction data were generated by changing the number of interactions in a range from 5 to 100 for each user.Performance was measured using the metrics introduced above.
For the second experiment, we evaluated the models performance under contaminated tag features.We generated Φ by the original T and added noise tags.The noise tags were generated by a probability of tag frequency.By using the contaminated tag feature, we fitted the transaction.We changed the number of contaminations from 1 to 30 for each item and obtained the metric scores.In both experiments, we generated a validation dataset for each number of interactions/contaminations and tuned the model hyperparameters by Bayesian search [38,39].The hyperparameters were determined from the results of the 50 trials.We set the search range w ∈ [0.001, 1] , and ∈ [0.001, 100] for FFSM and , ∈ [0.0001, 10] for PFSM. (15) .

Baseline Model
We adopted a random prediction-and a popularity-based scoring model as the baseline. 1For each tag k, we define item set I k in which items have tag k.We computed the qualitative score for item i in I k by accumulating user inter- actions to that item.The formula for the popularity score can be written as ( 17)

Results
Figures 3 and 4 show the results.In both experiments, we observed that our models significantly outperformed the baseline models over a wide range of the number of interactions/contaminations.The performance decreased as the number of tags increased, which can be seen by comparing the scores of #Tags = 5 and #Tags = 15 as well as from the results of the second experiments.This implies that the tag score is more difficult to resolve in a dense tag situation.
Comparing the performances of FFSM and PFSM, we found that the FFMS gave competitive performances even in the generative dataset (D2).

Performance Evaluating Application to Tag Relevance
In this experiment, we evaluated our models performance using a real-world dataset.Since our feature scoring is a new problem setting, few open datasets are available regarding user behavior and item qualitative score.Therefore, we 3 Performance regarding sparsity 4 Model performance regarding tag contamination 1 Since unsupervised qualitative score estimation is rather new concept, there are few existing methods.We investigated the performance of the top-n ranking method, such as the weighted regularized matrix factorization [13,14] and Bayesian personalized ranking [40], which utilize the structure of the tag matrix to predict the scores.However, we confirmed that the prediction accuracy was not significantly better than the random prediction because the tag matrix was generated randomly.
instead evaluated our models performance regarding relevance scoring, in which we assumed tags with low qualitative scores are likely to be irrelevant.

Dataset
We adopted the MovieLens dataset. 2 The dataset has transaction data of user ratings to movies, which consists of 610 users, 9724 items, and 100,836 ratings.We extracted samples rated above 3.5 as positive samples.For the relevance score of movies, we used the tag genomes [41], for which the relevance scores of 1128 tags were accumulated from questionnaires for a small sample of movies and a prediction model was constructed to complete the relevance scores for other movies.We aimed to demonstrate how our models precisely reconstruct the relevance score without supervised data using only user--item interactions and contaminated binary tag information.

Experimental Setting
We evaluated the performance of FFSM, PFSM, and predictions combined with the tag frequency.The last predictions are obtained by the product of the tag frequency and the tag scores of FFSM and PFSM.We constructed a binary tag matrix by imposing thresholds so that each movie had the top ten relevant tags and binarized these top ten relevant tags.We considered these tags as relevant tags.We selected items that appeared at least ten times in the transaction.To evaluate our models resolution, we added artificial irrelevant tags to the original binary feature.The irrelevant tags were generated by the probability of tag frequency.We changed the number of contaminated tags and evaluated model resolution by using HR and nDCG.The metric scores were computed for each item not for each tag.

Baseline Model
We adopted the tag frequency based the relevance score for the baseline model.Based on an intuitive concept in which a less frequent tag is likely less relevant, we define the score of tag k for item i as S freq ik = ∑ i T i,k .s

Results
Figure 5 shows the results.Both FFSM and PFSM resulted in higher scores than the baseline in a region of low contamination, while the tag frequency weighted prediction of FFSM significantly outperformed the other models above the tag contamination of 50 percent.Since our model was not optimized for the irrelevant tag removal task, other methods that focus on that task may produce better results. 3We note, however, that combining our method with other approaches may yield better accuracy.4 Fig. 5 Model performances regarding relevance score

Exploration for the Qualitative Scores
Finally, we show the item ranking for each tag obtained with our models using two real-world datasets.

Dataset
We adopted Moivelens dataset and Goodbooks-10k dataset [42], where 10,000 user's ratings to 53,424 books were provided.In the Goodbooks dataset, the number of userannotated tags is also provided, and we used them for the item features.

Experimental Setting
We implemented CPFSM and compared the results with those from the baseline ranking models.For the baseline ranking, we adopted the popularity based ranking introduced in Eq. ( 17) for MovieLens and used the tag count for Goodbooks.For the binary item features, we used the genre tags for MovieLens and the tag count for Goodbooks.
In the Goodbooks dataset, we made a set of tags shown in Fig. 6 by merging tags with orthographic variants or similar meaning and filtered the books having at least one tag among them.We considered a series of books as the same title.We filtered the transaction so that it contains books with at least > 10 times ratings of > 3.5 and users with > 100 interac- tions.We constructed the prior correlation matrix Σ 0 by the cosine similarity of the word embeddings of FastText [43].
To obtain Σ similar to the word relation, we scaled the word similarity matrix S to Σ 0 = ( − K − 1)S .We fixed 0 = 0 , = U + K and = U.

Results
We show some of the tag rankings in 3 and 4. With the baseline, the titles with high popularity were highly ranked for different tags.For example, the Star Wars series was ranked at the top for both sci-fi and adventure tags in MovieLens.Due to the decomposition of the user preferences, CPFSM provided other rankings for both tags: Interstellar, Inception, and Arrival for sci-fi and the Lord of the Ring series for adventure.Similarly, while Harry Potter was the highest ranked for adventure, fantasy, and children tags by tag count ranking in the Goodbooks dataset, CPFSM provided more divergent rankings, in which experts of each tag may prefer these books.
We also observed the keyword correlation obtained from correlated LDA. Figure 6 shows a word relation graph of the Goodbooks dataset by imposing a threshold to the correlation matrix.The difference between the original prior and posterior correlation can be seen.For example, while the word "funny" and "paranormal" are similar in the word embedding space, this edge decreases after the fit.On the other hand, "children" and "juvenile" do not have a strong similarity in terms of word vector, while the inference shows that this is a similar notion of preference.

Conclusion
We have proposed an approach to estimate the qualitative score from the binary features of products.Based on a natural assumption that an item with a better property is more popular among users who prefer that property, we have also proposed one discriminative and two generative models with which user preferences and item qualitative scores are inferred from user--item interactions.In these models, the space of the item qualitative score is constrained by the binary item features so that the score of each item and tag can only be nonzero when the item has the corresponding tag.
We have evaluated our models' performance by using two artificial and two real-world datasets.In the first experiment, the performances of our model under the sparse transaction and noisy tag settings have been demonstrated by using the two artificial datasets.We have also evaluated our models' resolution for irrelevant tags using the MovieLens dataset and observed that they outperform the baseline model.Finally, we have discussed the tag rankings and tag correlations obtained from the two real-world datasets.
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creat iveco mmons .org/licenses/by/4.0/.

Table 1
Example of binary features

Table 2
Example of probabilistic qualitative scores

Table 3
Tag rankings for MovieLens dataset Long titles are omitted due to space limitations Episode VI... Raiders of the Lost Ark... Raiders of the Lost Ark... Star Wars: Episode VI...