1 Introduction

Keywords, tags, and categories are widely used to describe product properties. Most e-commerce services, such as Amazon, Alibaba, or eBay, use categories and tags for item filtering. In social bookmarking and recommendation services such as Delicious, Last.fm, and MovieLens, tags are sometimes annotated by users. Even without explicit tags, we unconsciously infer item tags from side information, i.e., product names, explanation texts, reviews, and package designs.

However, such binary expression is incomplete because it does not contain three aspects of numerical information: quantity, relevance, and quality. Quantity represents the strength of a tag with a certain unit. For example, tags, such as low-calorie, light, and hot, lose quantitative information such as 50 kcal, 100 g, and four degrees out of five. With relevance, on the other hand, one can see which tag mostly describes the product that does not have a clear unit such as with quantity. For example, if one has a picture of a dog in a park, the tags of that picture may be “dog” and “park.” However, if that picture includes a passerby in the background, it may have the “man” tag. In that case, the relevance of “dog” and “park” should be higher than “man” because the subject of the picture may be “dog in a park.” Finally, Quality provides us with information about the goodness of products from the viewpoint of each property. One typical example is the food, location, and mood properties of resort hotels. We often refer to this information in the form of a qualitative score for comparison. We stress the difference between the qualitative score and the other two types of numerical information. Let us consider a protein bar containing high-quality protein. Compared to the quantitative information, the qualitative score is for “which product contains better protein?,” but not for “how much protein does the product contains?” because the quality of the protein does not directly correspond to the amount of protein. The relevance score is also not directly interpreted as the qualitative score: even for a low-purity protein bar, the tag of “protein” is relevant for that product. This qualitative information is particularly useful for manufacturers and promoters by which they can know how more attractive their product can be to customers compared to their competitors. In this paper, we focus on estimating the qualitative score for each binary item feature.

One difficulty for qualitative score estimation is that we cannot usually obtain supervised data. The qualitative score can be estimated by comparing similar products, i.e., by asking “which hotel is better from the viewpoint of food.” However, a comprehensive survey is unrealistic. More seriously, since scoring is a much more difficult task than tagging, scoring is always affected by psychological biases and human dependence. From this point of view, we focus on a problem setting of unsupervised estimation.

Another difficulty arises when considering an unsupervised setting, because the purpose of the customer is usually implicit in most customer--product interactions such as ratings, purchases, views, clicks, and bookmarks. If user purposes or needs can be determined explicitly, it is possible to obtain a qualitative score by ranking items for each user purpose. One representative example is query and purchase actions on an e-commerce site. A query represents customer demand for products, and a search engine in the site filters products based on the query and lists it. Therefore, by considering the query as an item tag, one can interpret the popularity for each product as the qualitative score for that product in terms of that query. However, in brick-and-mortar shopping, most customers purposes are implicit: we usually do not declare any purpose or query. Even in the e-commerce case, a typical purchase process is that one first filters items by category, query, or stars, reads the descriptions of each item, and finally decides which item to buy. In this process, which point the user prefers is implicit.

We also stress that product features in real-world data often contain tags with low relevance, i.e., tags which do not adequately describe the product. One typical case is tagging for web sites by using keyword extraction. The extraction process always contains a number of irrelevant keywords from links, advertisements, or daily rankings, and one finds only 5–10 keywords that adequately describe the web content. One can also observe many low relevance tagging by user annotation in a social service site. More interestingly, irrelevant tagging occurs when there is a gap between manufacturers and customers, i.e., when tag information provided by manufacturers does not correctly describe item features from the customer’s point of view.

Although label enhancement has recently been addressed, few studies have focused on qualitative score estimation. In the context of multi-labeled classification, Geng [1] proposed a label enhancement approach for binarized sample labels. Supervised learning for label distribution was developed in the earlier implementation, and a later unsupervised approach was proposed by Shao et al. [2]. They used a similarity assumption between the topology of input features and that of the binary label distributions. However, these approaches are mainly for the item relevance scores discussed above and do not directly compare the label quality between items. On the other hand, item ranking is widely used in the context of a recommendation system that suggests suitable items for specific users  [3,4,5,6,7,8,9,10,11,12,13,14]. However, most studies focused on constructing a personalized item ranking for each user and did not provide information about the qualitative aspect of the apparent item binary features.

In this paper, we propose a novel approach for qualitative score estimation from binary and noisy item features, which contributes to resolving the difficulties mentioned above, i.e., (1) no supervised data, (2) implicit customer purpose, and (3) noisy tags. With this approach, we infer the qualitative score based on the assumption that an item with a better property is more popular among users who prefer that property, in short, “experts know best.” One discriminative and two generative models are introduced with this approach with which user preferences and item qualitative scores are inferred from user--item interactions. One novelty is that we constrain the space of the item qualitative score by the item binary features so that it can only have a numerical value according to their tags. We evaluate our models by using two artificial and two real-world datasets. In the experiment of the two artificial datasets, we evaluate the performances of our model under the sparse transaction and noisy tag settings. We also evaluated our models’ resolution for irrelevant tags using the real-world dataset of movie ratings and observed that our models outperformed a baseline model. Tag rankings obtained with our models from the two real-world datasets were compared with a baseline ranking model.

The organization of this paper is as follows: In Sect. 2, we review recent related work. In Sect. 3, we discuss the proposed approach and models. In Sect. 4, we present the minimization and inferences for our models. We present the experimental results in Sect. 5 and finally give a discussion and conclusion in Sect. 6.

2 Related Work

We infer both qualitative tag score and user preferences by using user--item interactions such as ratings, purchases, views, clicks, and bookmarks. Collaborative filtering is a conventional approach for latent feature extraction from user--item interactions. With this approach, the user and item latent features are extracted through low-dimensional matrix factorization for user--item interactions. Earlier studies focused on data with user ratings [3,4,5,6,7,8,9,10,11,12], and later studies developed approaches for implicit data such as clicks, bookmarks, and purchases   [13, 14]. Probabilistic approaches for feature extraction were also proposed by using the framework of the topic model [15,16,17]. The topic model was first introduced as latent semantic analysis (LSA)  [15] for document clustering and probabilistic extension was later proposed [16]. Latent Dirichlet allocation (LDA)  [17] is a Bayesian extension of the probabilistic LSA in which a document has topic distribution, and terms of the document are generated by the term distributions of the topics. They give interpretable latent topics, giving the word distribution of each topic. Many extensions have been proposed in terms of document relation  [18], topic correlation   [19], hierarchical structure of topics   [20, 21], and combining meaning of words   [22,23,24,25,26,27]. In collaborative LDA, Wang and Blei   [28] proposed a combined model of the topic model for item documents and matrix factorization for user--item interactions. One significant difference from our models is that their model uses the rich text information of items, such as articles or reviews, and constructs the topic distribution for item features. Our approach, however, uses sparse, noisy binary tags for item features.

As we mentioned in introduction, the qualitative score relates to tag relevance. In the context of multi-label classification, label importance was introduced by Geng   [1], with which label distribution is inferred with supervised learning. Later, Shao et al. [2] proposed an unsupervised approach, with which numerical label distribution is inferred under a constraint of given binary labels, assuming the similarity between the topology of input features and that of the label distributions. Tag feature extraction was also investigated in the context of tag recommendation, with which proper tags are recommended to annotators. The latent feature of the tag was constructed from a tensor decomposition of user--item--tag relations, enabling personalized tag recommendation [29, 30].

3 Model Descriptions

In this section, we introduce our tag scoring models, where the qualitative tag score is inferred from user--item interactions and binary item tags. We first give notations and the problem, define the qualitative tag score, and finally describe our models. We propose three models. One is based on matrix factorization, and the other two are probabilistic realizations based on the topic model.

3.1 Problem Setting and Notations

We consider user--item interactions composed of a set of users \({\mathcal {U}}=\{u_{1}, u_{2}, \dots , u_{U} \}\) and items \({\mathcal {I}}= \{i_{1}, i_{2}, \dots , i_{I} \}\). Here, U and I represent the number of users and items, respectively. We represent user--item interactions by a form of an adjacency matrix. For countable interactions, such as purchases and clicks, we consider \(Y_{u,i} \in {\mathbb {N}}^{U \times I}\) where \(Y_{u,i} = n\) if user \(u \in {\mathcal {U}}\) interacts with item \(i \in {\mathcal {I}}\) for n times. For interactions with numerical values such as ratings, the adjacency matrix can be represented as \(Y_{u,i} \in {\mathbb {R}}^{U \times I}\) where \(Y_{u,i} = r\) if user \(u \in {\mathcal {U}}\) gives rating values of r to item \(i \in {\mathcal {I}}\). In both cases, \(Y_{u,i} = 0\) implies a missing sample. For simplicity, we consider user purchases of items. We also represent the set of items purchased by user u by \({\mathcal {Y}}_{u} = \{i_{u1}, i_{u2}, \dots \}\) and the set of all user--item interactions by \({\mathcal {Y}}= \{(u, i) \mid i \in {\mathcal {Y}}_{u}\}\). We define a set of tags \({\mathcal {K}}=\{k_{1}, k_{2}, \dots , k_{K} \}\) and express item features by a binary matrix of tags \(T \in \{0, 1\}^{K \times I} \). In our setting, we assume an item has several tags and \(T_{k,i} = 1\) if item \(i \in {\mathcal {I}}\) has tag \(k \in {\mathcal {K}}\).

The problem is to estimate keyword score \(\Phi \in {\mathbb {R}}^{K \times I}\) under observed Y and T, where \(\Phi _{k,i}\) represents the qualitative score of an item i in terms of a tag k. Simple examples are given in Tables 1 and 2, where item binary feature T and qualitative score \(\Phi \) correspond to Tables 1 and  2, respectively.

3.2 Definition of Qualitative Score and Model Concept

Table 1 Example of binary features
Table 2 Example of probabilistic qualitative scores

To clarify the model concept, we first define qualitative score for our approach. Our definition of the qualitative score is that, in a nutshell, if an item has a tag with higher qualitative score, then that item is more popular or attractive for users who prefer that property. In the following, we will explain this concept in more detail and introduce two models derived from the definition.

As we mentioned in introduction, the simplest definition can be obtained from manual scoring. However, since manual scoring is difficult and affected by psychological bias, scoring based on user--item interactions is preferable. One simple qualitative score is obtained using the popularity of items: by comparing item popularity tag by tag, we can construct tag rankings. However, this scoring gives the same tag score for one product. For example, movies with fantasy and sci-fi tags have the same score for these two tags. We propose a user preference based score where an item with a higher tag score implies it is more popular or attractive for users who prefer that property. From this definition, the behavior of experts of each tag determines each tag score; therefore, we describe this concept as the “experts know best” assumption. In the previous example, if a movie is more popular among those who like fantasy, a higher score is assigned to the fantasy tag than the sci-fi tag. In other words, we assume that item purchase can be approximated by a combination of user preference and tag scoring. This definition naturally enables the derivation of an extension of matrix factorization. In this paper, we call this model factorized feature scoring model (FFSM).

We can also define the score by a probability. In this definition, \(\Phi _{k,i}\) represents a probability that item i is chosen in terms of tag k. Table 2 illustrates a simple example. We have three items that have tags of hot, curry, and healthy, and each column is normalized to 1 for the probabilistic definition. The score shows that when one wants something hot, Chili Chicken, Chicken Curry, and Beef Curry are chosen with probabilities of 0.3, 0.3, and 0.4. Similarly, when one wants curry, Chicken Curry and Beef Curry are chosen with a probability of 0.4 and 0.6. Since Chili Chicken is not a curry, i.e., does not have the curry tag, the probability is zero. By using this definition, we model the purchase by extending the topic model, which we call the probabilistic feature scoring model (PFSM). Note that in this definition, the item selection probability of a given tag is inferred from the selection behavior of users who prefer that tag; therefore, it is also based on the “experts know best” assumption.

3.3 Factorized Feature Scoring Model

Based on the first definition of the qualitative score, we develop a matrix factorization model. The basic idea of this model is that matrix Y can be approximated by decomposing low-rank matrices P and Q: \(Y_{ui} \sim \sum _{d=1}^{D} P_{ud} (Q^{T})_{di}\), where we define \(P \in {\mathbb {R}}^{M \times D}\), \(Q \in {\mathbb {R}}^{N \times D}\), and \(D \ll U, I\). The rows of P represent the user’s latent preferences, and the rows of Q represent the item’s latent features. The notation D is the dimension of the latent space. The root mean squared error (RMSE) between Y and \(PQ^T\) is minimized under the regularization of P and Q, which derives the object function L as

$$\begin{aligned} L = \sum _{u,i} w_{ui} \left( Y_{ui} - \sum _{d} p_{ud} q_{id} \right) ^{2} + \lambda \left( \sum _{u, d} p_{ud}^{2} + \sum _{i, d} q_{id}^2\right) \,, \end{aligned}$$

where \(p_{ud}\) (\({q}_{id}\)) \( \in {\mathbb {R}}\) denotes the entry of U (V) in the u(i)th row and dth column, and the hyper-parameter \(\lambda \) controls the strength of the regularization. Weight \(w_{ui} \in [0, 1] \) is introduced for missing samples, which takes \(w_{ui} = 1\) if user u purchases item i (i.e., \(Y_{ui} > 0\)), otherwise a small value. We simply adopt the uniform weight  [13].

With FFSM, we define the tag-constrained objective function by assuming a product of user preference \(\Theta \) and the qualitative tag score \(\Phi \) explains the purchase matrix Y:

$$\begin{aligned} L = \sum _{u,i}&w_{ui} \left( Y_{ui} - \sum _{k}^{K} \Theta _{u, k} \Phi _{k, i} \right) ^{2} \nonumber \\&+ \lambda \sum _{u, k} \Theta _{u,k}^{2} + \lambda \sum _{k, i} Reg(\Phi _{k, i}, T) \,. \end{aligned}$$
(1)

In our approach, we minimize L under the nonnegative constraint: \(\Theta _{u, k}, \Phi _{k, i} > 0\). In contrast to the original matrix factorization, we let the dimension of the latent space D be equal to the dimension of the tag space, i.e., \(D = K\) and introduce the tag constraint Reg:

$$\begin{aligned} Reg(\Phi _{k, i}, T) = \left\{ \begin{array}{ll} \Phi _{k, i}^2 &{} (T_{k, i} = 1) \\ c_{\infty } \Phi _{k, i}^2 &{} (T_{k, i} = 0) \end{array} \right. \,. \end{aligned}$$
(2)

Here \(c_{\infty }\) is a large constant that suppresses \(\Phi \) at \(T_{ik} = 0\). In this paper, we adopt the hard constraint, where \(c_{\infty } = \infty \), i.e., \(\Phi _{k, i} = 0\) if item i does not have tag k.

3.4 Probabilistic Feature Scoring Model

We develop this model based on the topic model. In the topic model, the purchase history is modeled by a product of the probability of which user preference appears at this purchase and the item distribution with respect to the chosen preference:

$$\begin{aligned} P(Y_{u,i} = 1 | \Theta , \Phi ) = \sum _{z} P(Y_{u,i} = 1 | z, \Phi ) P(z | \Theta ) \,. \end{aligned}$$
(3)

here \(\Theta \in {\mathbb {R}}^{U \times D}\) and \(\Phi \in {\mathbb {R}}^{D \times I}\) and D is the dimension of the topics. The \(\Theta \) and \(\Phi \) rows represent distributions of a user’s latent preferences and item’s latent topics, respectively. Topic index \(z \in \{1, 2, 3, \dots , D\}\) can be interpreted as the purpose of user u for purchase i and is sampled from the multinomial distribution of the user preference \(\Theta \): \(P(z | \Theta ) = \Theta _{u, z}\), and item i is chosen according to a multinomial distribution of \(\Phi _{k,i}\): \(P(Y_{u,i} = 1 | z, \Phi ) = \Phi _{z, i}\).

In this model, we define the matrices of \(\Theta \) and \(\Phi \) based on the tag space T. First, we let D be equal to the dimension of the tag space, i.e., \(D = K\). Moreover, we impose a constraint where \(\Phi \) is nonzero if and only if \(T \ne 0\), i.e.,

$$\begin{aligned} \Phi _{k, i} \left\{ \begin{array}{ll} > 0 &{} (T_{k, i} = 1) \\ = 0 &{} (T_{k, i} = 0) \end{array} \right. \,, \end{aligned}$$
(4)

satisfying \(\sum _i \Phi _{k, i} = 1\). By introducing the constraint from the keyword space, we can interpret that each component of \(\Phi _{k, i}\) is the qualitative score defined in the previous subsection.

The posterior probability \(\prod _{u, i \in {\mathcal {Y}}} P(\Theta , \Phi | Y_{u,i} )\) is inferred by introducing priors of \(\Theta \) and \(\Phi \). For \(\Theta \), we adopt the conjugate prior for the multinomial distribution:

$$\begin{aligned} P(\Theta ) = \text {Dir}(\Theta | \alpha ) \,, \end{aligned}$$
(5)

where \(\text {Dir}\) is the Dirichlet distribution and \(\alpha \in {\mathbb{R}}^{U \times K}\) is a hyperparameter of \(\text {Dir}\). Similarly, we introduce the Dirichlet prior for \(\Phi \). In this case, since \(\Phi \) is constrained by T, the sample space of the prior is also constrained. We denote

$$\begin{aligned} P(\Phi _{k}) = \text {Dir}(\Phi | \beta , T) , \end{aligned}$$
(6)

and a hyperparameter \(\beta \in {\mathbb {R}}^{K \times I}\) takes \(\beta _{k, i} = 0\) when \(T_{k, i} = 0\). Figure 1 illustrates the graphical model of the generative process.

3.5 Correlated Probabilistic Feature Scoring Model

We extend PFSM by incorporating the tag--tag correlation. For user preference, we can naturally consider that preferences with a similar meaning are correlated. For example, one with a higher preference for “humor” tends to have a higher value for “comedy.” Since each dimension k of the topic matrix \(\Theta \) can connect to a tag name, we use the external knowledge for topic correlation. On the other hand, we conjecture that the correlation from meaning of words is not always valid for user preference. Therefore, we use the word correlation matrix as a prior of this extended model, which we call correlated PFSM (CPFSM) and infer topic correlation with it.

Based on the correlated topic model  [19], we introduce a multivariate normal distribution for correlated topics. The generative process of CPFSM can be described as follows:

  • Draw the means and covariances of user topics: \(\mu , \Sigma \sim \text {NIW}(\mu _0, \Sigma _0, \lambda , \nu )\). Here NIW is the normal-inverse-Wishart distribution, which is a conjugate prior for the multivariate normal distribution. Prior knowledge \(\mu _0, \Sigma _0\) control mean and covariance matrix for a topic parameter, respectively, and \(\lambda \) and \(\nu \) are hyperparameters of the distribution.

  • Draw a parameter for user topic distribution \(\eta \) by a multivariate normal distribution: \(\eta \sim \text {N}(\mu , \Sigma )\). Here \(\text {N}(\mu , \Sigma )\) represents a normal distribution with a mean of \(\mu \) and covariance matrix of \(\Sigma \).

  • Convert \(\eta \) to \(\Theta \) by the softmax function: \( \Theta _{uk} = e^{\eta _{uk}} / \sum _{k} e^{\eta _{uk}} \).

  • Draw item distribution \(\Phi \) by a constrained multinominal distribution in Eq. (4).

  • For each purchase with index c of user u, draw user attention \(z_{u,c}\) from the multinomial distribution of \(\Theta \): \( z_{u,c} \sim \text {Mult} (\Theta ) \).

  • Draw user select item \(i_{u,c}\) from the multinomial distribution of \(\Phi _{z_{u,c}}\): \( i_{u,c} \sim \text {Mult} (\Phi _{z_{u,c}}) \).

Similar to PFSM, we impose a constraint where \(\Phi \) is nonzero if and only if \(T \ne 0\). Figure 2 illustrates the graphical model of the generative process.

Fig. 1
figure 1

Graphical model of PFSM

Fig. 2
figure 2

Graphical model of CPFSM

4 Minimization and Inferences

4.1 Minimization for FFSM

For the minimization of Eq. (1), we develop an efficient algorithm for FFSM based on the alternative least square approach proposed by [31]. In the algorithm, we carry out element-wise minimizations of \(\Theta _{u, k}\) and \(\Phi _{k, i}\). The update equation for \(u_{ud}\) is given by

$$\begin{aligned} \Theta _{u, k} = \frac{ \sum _{i} w_{u, i} \Phi _{k, i} Y_{u, i} - \sum _{i} \sum _{k' \ne k} w_{ui} \Phi _{k, i} \Theta _{uk'} \Phi _{k', i} }{ \lambda + \sum _{i} w_{ui} \Phi _{k, i}^2 } \,, \end{aligned}$$
(7)

for any \(1 \le u \le U\) and \(1 \le k \le K\). Similarly, the update equation for \(\Phi _{k, i}\) is given by

$$\begin{aligned} \Phi _{k, i} = \frac{ \sum _{u} w_{u, i} \Theta _{u, k} Y_{u, i} - \sum _{u} \sum _{k' \ne k} w_{u, i} \Theta _{u, k} \Phi _{i, k'} \Theta _{u, k'} }{ \lambda + \sum _{u} w_{u, i} \Theta _{u, k}^2 } \,, \end{aligned}$$
(8)

if \(T_{k, i} = 1\) and otherwise \(\Phi _{k, i} = 0\) for any \(1 \le i \le I\) and \(1 \le k \le K\). To satisfy the nonnegative constraint for \(\Theta \) and \(\Phi \), we set \(\Theta = 0\) and \(\Phi = 0\) for negative values for each iteration.

Algorithm 1 illustrates the pseudo-code for this algorithm.

4.2 Inference for PFSM

In PFSM, most of the inference techniques developed for vanilla LDA can be applied with a small modification, i.e., by constraining the word distribution of \(\Phi \) according to T. We adopt the collapsed Gibbs sampling, where \(z_{u,c}\) is sampled after marginalizing \(\Phi \) and \(\Theta \):

$$\begin{aligned} P(&z_{u,c} = k | z^{\backslash u,c}, Y, \alpha , \beta ) \nonumber \\&\propto \left( \frac{\beta _{k, i_{u,c}} + n_{k, i_{u,c}\backslash u,c}}{\sum _{i} \beta _{k, i} + n_{k, i\backslash u,c}} \right) \left( \frac{ \alpha _{u, k} + n_{u, k\backslash u,c}}{\sum _{k'} \alpha _{u, k'} + n_{u, k'\backslash u,c}} \right) \,. \end{aligned}$$
(9)

for \(T_{ik} = 1\) and 0 otherwise. We define that \(n_{k, i}\) is the total number of topics k assigned to an item i and \(n_{u, k}\) is the total number of topics k appearing in purchases of a user u:

$$\begin{aligned} \begin{array}{ll} n_{k, i} &{}= \sum _{u, c} I(z_{u, c} = k) I(i_{u, c} = i) \,, \\ n_{u, k} &{}= \sum _{c} I(z_{u, c} = k) \,, \end{array} \end{aligned}$$
(10)

and \(\cdot \backslash u, c\) represents the calculation excluding the cth purchased item by user u.

The elements of \(\Phi \) and \(\Theta \) can be written as:

$$\begin{aligned} \Phi _{ki}&\propto \left\{ \begin{array}{ll} \beta _{k, i} + n_{k, i} &{} (T_{k, i} = 1) \\ 0 &{} (T_{k, i} = 0) \end{array} \right. \,, \end{aligned}$$
(11)

and

$$\begin{aligned} \Theta _{uk}&\propto \alpha _{u, k} + n_{u, k} \,. \end{aligned}$$
figure a
figure b
figure c

Algorithm 2 illustrates the pseudo-code for this algorithm.

4.3 Inference for CPFSM

For CPFSM, we also infer the parameter of the model by using the Gibbs sampling algorithm. Although \(P(\eta )\) is no longer a conjugate distribution to \(P(z | \eta )\), an efficient sampling algorithm for z and \(\eta \) has been proposed by [32]. The sampling probability of \(z_{u,c} = k\) is given by the \(\Phi \)-collapsed posterior function:

$$\begin{aligned} P(z_{u,c}&= k | z^{\backslash u,c}, Y, \eta , \beta , T) \nonumber \\&\propto \left( \frac{\beta _{k, i_{u,c}} + n_{k, i_{u,c}\backslash u,c}}{\sum _{i} \beta _{k, i} + n_{k, i\backslash u,c}} \right) \left( \frac{e^{\eta _{u, k}}}{\sum _{k'} e^{\eta _{u, k'}}} \right) \,. \end{aligned}$$
(12)

and we obtain the tag score by Eq. (11). On the other hand, the scale mixture representation of \(p(\eta _{u, k} | \eta ^{\backslash k})\) enables integration of auxiliary variable \(\lambda _{u, k}\) (see the study by Chen et al.  [32] for more detail). By utilizing this technique, we finally obtain the sampling distribution for \(\eta \) and \(\lambda \):

$$\begin{aligned} P(\eta _{u, k} | \eta _{u}^{\backslash k}, z, \Sigma ) \propto N(\gamma _{u, k}, \tau _{u, k}) \,, \end{aligned}$$
(13)

where we define

$$\begin{aligned} \gamma _{u, k}&= \tau _{u, k}^{2} ( \mu _{k, u}/ \Sigma _{k, k} + n_{u, k} - n_{u} / 2 + \lambda _{u, k} \zeta _{u, k} ) , \\ \tau _{u, k}&= (1/\Sigma _{k, k} + \lambda _{u, k})^{-1/2} . \end{aligned}$$

here \(n_{u} = \sum _{k} n_{u, k}\), \( \mu _{k, u} = \mu _{k} - \Sigma _{k, k} \Lambda _{k, \backslash k} (\eta _{u, \backslash k} - \mu _{\backslash k}) \), and \(\zeta _{u, k} = \ln (\sum _{l \ne k} e^{\eta _{l, u}})\). The sampling distribution for \(\lambda _{u, k}\) is given by \(PG(n_{u}, \eta _{u, k} -\zeta _{u, k})\) where PG is the Polya--Gamma distribution.

Since \(n_{u}\) is generally large in our case, we can approximate PG to a normal distribution function with the mean and standard deviation of \(PG(n_{u}, \eta _{u, k} -\zeta _{u, k})\).

For \(\mu \) and \(\Sigma \), due to the conjugacy between NIW and multivariate normal distributions, posterior distribution can be parameterized by NIW: \(P(\mu , \Sigma | \eta ) \propto P(\eta | \mu , \Sigma ) P(\mu , \Sigma ) = NIW ({\hat{\mu }}, {\hat{\Sigma }}, {\hat{\lambda }}, {\hat{\nu }}) \), where we define \({\hat{\mu }} = (\lambda \mu _0 + U {\tilde{\eta }}) / (\lambda + U)\), \({\hat{\Sigma }} = \Sigma _0 + \sum _u (\eta _u - {\tilde{\eta }}) (\eta _u - {\tilde{\eta }})^{T} + \frac{\lambda U}{\lambda + U} ({\tilde{\eta }} - \mu _0) ({\tilde{\eta }} - \mu _0)^{T}\), \({\hat{\lambda }} = \lambda + U\), and \({\hat{\nu }} = \nu + U\). Here, \({\tilde{\eta }}\) is the mean of \(\eta \): \({\tilde{\eta }} \equiv 1 /U \sum _u \eta _u\). The MAP inference gives

$$\begin{aligned} \begin{array}{ll} \Sigma _{MAP} &{}= {\hat{\Sigma }} / ({\hat{\nu }} + K + 1)\,, \\ \mu _{MAP} &{}= {\hat{\mu }}\,. \end{array} \end{aligned}$$
(14)

Algorithm 3 illustrates the pseudo-code for this algorithm.

5 Consistency of Tag Score

We now discuss the consistency of the rankings. In our models, we approximate Y by non-negative \(\Phi \) and \(\Theta \), which can be interpreted as one realization of the non-negative matrix factorization (NMF). Generally, NMF does not always provide the same solution for \(\Phi \) and \(\Theta \). As a result, the obtained tag score may differ trial by trial. The reason for this inconsistency stems from the following two reasons: non-uniqueness and local minimum.

Non-uniqueness of non-negative matrix factorization is a well-known problem. Even after removing trivial degrees of freedom, such as permutation of the components or scaling, the solution of NMF is not generally unique. The uniqueness relates to the form of the decomposition, and the sufficient and necessary condition for the uniqueness has been investigated  [33,34,35,36]. In our model, thanks to the binary tag constraints, the decomposition tends to offer the unique solution: The scaling arbitrariness is removed by conditioning \(\sum _{k} \Phi _{k, i} = 1\), and one can easily see that the permutation is not allowed unless there exist a pair k and \(k' \ne k\) that satisfies \(T_{k, i} = T_{k', i}\) for all \(i \in {\mathcal {I}}\). Moreover, a recent study has observed that if T and transaction Y are sparse, the decomposition tends to offer a unique solution  [36].

Another possibility of this inconsistency comes from the fact that global minimum search is NP hard  [37] and there exist local minimums where another list of tag scoring would be suggested. One remedy for this local minimum is to run an algorithm several times by changing the random seeds and observe the behavior of the list. In this paper, we simply discard tags with unstable rankings and leave further development for future work.

6 Experiments

Fig. 3
figure 3

Performance regarding sparsity

Fig. 4
figure 4

Model performance regarding tag contamination

Fig. 5
figure 5

Model performances regarding relevance score

6.1 Model Performance Evaluation

We first evaluated our models’ performance by using two artificial datasets.

6.1.1 Dataset

We generated two artificial datasets. In the first dataset, we assumed that user--item interactions were generated in a discriminant model fashion, i.e., via the product of user preferences and item tag scores. In this dataset, we randomly generated \(\Phi \) and \(\Theta \) in a range of 0 to 1 and normalized them along the tag direction and the user direction, respectively. The user--item interaction Y was constructed from the product of \(\Phi \) and \(\Theta \). By imposing thresholds on \(\Phi \) and Y, we binarized them. The threshold values were determined by the given number of tags and user purchases, respectively. Below, we call this dataset D1. In the second dataset, we assumed that the user--item interactions were generated via the generative process described in Sect. 3. We first generated an item binary tag matrix T under a fixed number of tags for each item (we adopted 5 and 15) then generated \(\Phi \) and \(\Theta \) by using Eqs. (5) and (6), respectively, under the constraint of T, setting \(\alpha ,\, \beta = 0.5\), and finally generated Y by using Eq. (3). Below, we call this dataset D2. We fixed the user, item, and keyword spaces to \(U= 500\), \(I = 500\) and \( K = 50\) for both models. We generated these two datasets ten times and computed the average scores and standard deviations.

6.1.2 Evaluation Metrics

We evaluated the models’ performance by introducing two standard ranking metrics:

  • HitRate (HR) When evaluating the qualitative score, it is often the case that the ranking of items is more important than the numerical score. To evaluate qualitative ranking performance, we introduce HR. The HR for top N items (HR@N) is an intuitive ranking metric with which sets of the top N items of the prediction and the ground truth are compared. The HR provides the ratio of the intersection of the two item sets:

    $$\begin{aligned} \text {HR}@N = \frac{1}{K} \sum _{k} \frac{1}{N} | \{\text {top { N} of } \Phi _{k} \} \cup \{\text {top { N} of } {\hat{\Phi }}_{k} \} | \,. \end{aligned}$$
    (15)
  • Normalized discounted cumulated gain (nDCG) The second metric is for considering both ranking and score accuracy. Intuitively, score accuracy for higher ranking items is more important than that for lower ranking items. DCG is a popular weighted rank metric with which the larger weight is imposed on higher-ranked items. DCG@N for tag k is defined by

    $$\begin{aligned} \text {DCG}_{k}@N = \sum _{n=1}^{N} \frac{2^{rel (k, i^{k}_{n})} - 1}{\log _2 (n+ 1)}\,. \end{aligned}$$
    (16)

    We assume that a model predicts the top N items of tag k: {\(i^{k}_{1}, i^{k}_{2}, \dots , i^{k}_{N}\)} and rel(ki) represents the relevance of item i for tag k. In our problem setting, the relevance is \(rel(k, i) = \Phi _{k,i}\). The nDCG is defined by \(\text {nDCG}_{k}@N = \text {DCG}_{k}@N / \text {IDCG}_{k}@N\), where IDCG represents the ideal DCG, in which the top k item index series \(i^{u}_{1}, i^{u}_{2}, \dots , i^{u}_{k}\) are computed from the ground truth \(\Phi _{k,i}\). For the metric of our experiments, we used the averaged nDCG: \(\text {nDCG}@N = \frac{1}{K}\sum _{k} \text {nDCG}_{k}@N\).

6.1.3 Experimental Setting

The first experiment was for observing the relationship between the accuracy and sparsity of a dataset. The transaction data were generated by changing the number of interactions in a range from 5 to 100 for each user. Performance was measured using the metrics introduced above. For the second experiment, we evaluated the models performance under contaminated tag features. We generated \(\Phi \) by the original T and added noise tags. The noise tags were generated by a probability of tag frequency. By using the contaminated tag feature, we fitted the transaction. We changed the number of contaminations from 1 to 30 for each item and obtained the metric scores. In both experiments, we generated a validation dataset for each number of interactions/contaminations and tuned the model hyperparameters by Bayesian search  [38, 39]. The hyperparameters were determined from the results of the 50 trials. We set the search range \(w \in [0.001, 1]\), and \(\lambda \in [0.001, 100]\) for FFSM and \(\alpha ,\,\beta \in [0.0001, 10]\) for PFSM.

6.1.4 Baseline Model

We adopted a random prediction- and a popularity-based scoring model as the baseline.Footnote 1 For each tag k, we define item set \({\mathcal {I}}_{k}\) in which items have tag k. We computed the qualitative score for item i in \({\mathcal {I}}_{k}\) by accumulating user interactions to that item. The formula for the popularity score can be written as

$$\begin{aligned} S^{\text {pop}}_{ik} = \frac{\sum _{u} Y_{u,i} T_{ik}}{\sum _{u, i} Y_{u,i} T_{ik}} \,. \end{aligned}$$
(17)

6.1.5 Results

Figures 3 and  4 show the results. In both experiments, we observed that our models significantly outperformed the baseline models over a wide range of the number of interactions/contaminations. The performance decreased as the number of tags increased, which can be seen by comparing the scores of \(\#{\text{Tags}} = 5\) and \(\#{\text{Tags}}=15\) as well as from the results of the second experiments. This implies that the tag score is more difficult to resolve in a dense tag situation. Comparing the performances of FFSM and PFSM, we found that the FFMS gave competitive performances even in the generative dataset (D2).

6.2 Performance Evaluating Application to Tag Relevance

In this experiment, we evaluated our models performance using a real-world dataset. Since our feature scoring is a new problem setting, few open datasets are available regarding user behavior and item qualitative score. Therefore, we instead evaluated our models performance regarding relevance scoring, in which we assumed tags with low qualitative scores are likely to be irrelevant.

6.2.1 Dataset

We adopted the MovieLens dataset.Footnote 2 The dataset has transaction data of user ratings to movies, which consists of 610 users, 9724 items, and 100,836 ratings. We extracted samples rated above 3.5 as positive samples. For the relevance score of movies, we used the tag genomes  [41], for which the relevance scores of 1128 tags were accumulated from questionnaires for a small sample of movies and a prediction model was constructed to complete the relevance scores for other movies. We aimed to demonstrate how our models precisely reconstruct the relevance score without supervised data using only user--item interactions and contaminated binary tag information.

6.2.2 Experimental Setting

We evaluated the performance of FFSM, PFSM, and predictions combined with the tag frequency. The last predictions are obtained by the product of the tag frequency and the tag scores of FFSM and PFSM. We constructed a binary tag matrix by imposing thresholds so that each movie had the top ten relevant tags and binarized these top ten relevant tags. We considered these tags as relevant tags. We selected items that appeared at least ten times in the transaction. To evaluate our models resolution, we added artificial irrelevant tags to the original binary feature. The irrelevant tags were generated by the probability of tag frequency. We changed the number of contaminated tags and evaluated model resolution by using HR and nDCG. The metric scores were computed for each item not for each tag.

6.2.3 Baseline Model

We adopted the tag frequency based the relevance score for the baseline model. Based on an intuitive concept in which a less frequent tag is likely less relevant, we define the score of tag k for item i as \(S^{freq}_{ik} = \sum _{i} T_{i, k}\). s

6.2.4 Results

Figure 5 shows the results. Both FFSM and PFSM resulted in higher scores than the baseline in a region of low contamination, while the tag frequency weighted prediction of FFSM significantly outperformed the other models above the tag contamination of 50 percent. Since our model was not optimized for the irrelevant tag removal task, other methods that focus on that task may produce better results.Footnote 3 We note, however, that combining our method with other approaches may yield better accuracy.Footnote 4

6.3 Exploration for the Qualitative Scores

Finally, we show the item ranking for each tag obtained with our models using two real-world datasets.

6.3.1 Dataset

We adopted Moivelens dataset and Goodbooks-10k dataset  [42], where 10,000 user’s ratings to 53,424 books were provided. In the Goodbooks dataset, the number of user-annotated tags is also provided, and we used them for the item features.

6.3.2 Experimental Setting

We implemented CPFSM and compared the results with those from the baseline ranking models. For the baseline ranking, we adopted the popularity based ranking introduced in Eq. (17) for MovieLens and used the tag count for Goodbooks. For the binary item features, we used the genre tags for MovieLens and the tag count for Goodbooks. In the Goodbooks dataset, we made a set of tags shown in Fig. 6 by merging tags with orthographic variants or similar meaning and filtered the books having at least one tag among them. We considered a series of books as the same title. We filtered the transaction so that it contains books with at least \(>10\) times ratings of \(> 3.5\) and users with \(>100\) interactions. We constructed the prior correlation matrix \(\Sigma _0\) by the cosine similarity of the word embeddings of FastText  [43]. To obtain \(\Sigma \) similar to the word relation, we scaled the word similarity matrix S to \(\Sigma _0 = \lambda (\nu - K - 1) S\). We fixed \(\mu _0 = 0\), \(\nu = U + K\) and \(\lambda = U\).

6.3.3 Results

Table 3 Tag rankings for MovieLens dataset
Table 4 Tag rankings for Goodbookss dataset
Fig. 6
figure 6

Topic correlation graph for Goodbooks dataset. Gray lines represent edges with strong word correlations, while orange lines represent strong posterior correlations. Thick red lines represent if both word and posterior correlations are high

We show some of the tag rankings in 3 and 4. With the baseline, the titles with high popularity were highly ranked for different tags. For example, the Star Wars series was ranked at the top for both sci-fi and adventure tags in MovieLens. Due to the decomposition of the user preferences, CPFSM provided other rankings for both tags: Interstellar, Inception, and Arrival for sci-fi and the Lord of the Ring series for adventure. Similarly, while Harry Potter was the highest ranked for adventure, fantasy, and children tags by tag count ranking in the Goodbooks dataset, CPFSM provided more divergent rankings, in which experts of each tag may prefer these books.

We also observed the keyword correlation obtained from correlated LDA. Figure  6 shows a word relation graph of the Goodbooks dataset by imposing a threshold to the correlation matrix. The difference between the original prior and posterior correlation can be seen. For example, while the word “funny” and “paranormal” are similar in the word embedding space, this edge decreases after the fit. On the other hand, “children” and “juvenile” do not have a strong similarity in terms of word vector, while the inference shows that this is a similar notion of preference.

7 Conclusion

We have proposed an approach to estimate the qualitative score from the binary features of products. Based on a natural assumption that an item with a better property is more popular among users who prefer that property, we have also proposed one discriminative and two generative models with which user preferences and item qualitative scores are inferred from user--item interactions. In these models, the space of the item qualitative score is constrained by the binary item features so that the score of each item and tag can only be nonzero when the item has the corresponding tag.

We have evaluated our models’ performance by using two artificial and two real-world datasets. In the first experiment, the performances of our model under the sparse transaction and noisy tag settings have been demonstrated by using the two artificial datasets. We have also evaluated our models’ resolution for irrelevant tags using the MovieLens dataset and observed that they outperform the baseline model. Finally, we have discussed the tag rankings and tag correlations obtained from the two real-world datasets.