Collaborative filtering with q‑divergence‑based fuzzy clustering for spherical data

Although recommendation systems are the most powerful tool to help people choose items, a higher recommendation accuracy is required to satisfy the needs of the people. Motivated by this requirement, this study proposes a novel collaborative filtering (CF) algorithm, which is the underlying technology of a recommendation system. It filters items for a target user based on the reactions of similar users. Cluster analysis helps detect similar users by grouping a set of users such that users in the same group are more similar to each other than to those in other groups. However, in most representative CF algorithms such as GroupLens algorithm, users are considered as spherical data, and as categorical multivariate data in the clustering phase of a previous study. This study overcomes this logic gap by proposing a novel CF method using fuzzy clustering for spherical data based on q-divergence as both the clustering phase and the GroupLens algorithm consistently deal with users as spherical data. Experiments were conducted on six real datasets—BookCrossing, Epinions, Jester, LibimSeTi, MovieLens, and SUSHI, to compare the performance of the proposed method with GroupLens and the method using fuzzy clustering for categorical multivariate data based on q-divergence, which are conventional methods, where the performance is measured by the area under the receiver operating curve. The results of the experiments indicate that the proposed algorithm outperforms the others in terms of recommendation accuracy.


Introduction
Currently, there exists a considerable amount of information on digital platforms, thus, it is extremely difficult to select information that is truly relevant to each user. Recommender systems are the most powerful tool to help people in choosing products, activities, and friends from many representative options. Although recommendation systems such as Amazon.com have been ubiquitous, their recommendation accuracy is not sufficient to satisfy the growing needs of the people. This study is motivated by the requirement of higher accuracy of recommendation system. Among many techniques combined in recommender systems, collaborative filtering (CF) is the most fundamental technique (Paul et al. 1994;Sarwar et al. 2001), which can filter items (products, activities, or friends) that a user may like based on the preferences of similar users. The most representative CF method is GroupLens (Herlocker et al. 1999), which is simple and time efficient. However, the similarity of "similar users" is heuristically defined. An adequate definition of similarity can help CF suggest more appropriate items to users. We consider that users implicitly belong to a latent group, where users have similar preferences in the same group. If we can determine such groups, we can determine users similar to a target user, and then, CF can help suggest items to the target user based on the preferences of similar users.
Clustering is only a technique to detect latent groups. Many clustering methods have been proposed and applied based on the type of given data. Honda (2016) suggested applying fuzzy clustering for categorical multivariate data induced by multinomial mixture models (FCCMM), which is based on a cluster-wise bag-of-words concept. Kondo and Kanzawa (2018) modified the FCCMM algorithm, referred to as q-divergence-based fuzzy clustering for categorical multivariate data induced by multinomial mixture models (QFCCMM). The q-divergence was focused because it is not only a generalization of the standard Kullback-Leibler divergence used in FCCMM but also the divergence discussed in Tsallis statistics with which the predictions and consequences in a wide spectrum of complex systems were confirmed (Tsallis 2009). Utilizing the q-divergence instead of the Kullback-Leibler one in clustering task, there was a potential that clusters could be captured adequately, and actually QFCCMM achieved higher clustering accuracy than FCCMM (Kondo and Kanzawa 2018). Furthermore, in a previous study , we proposed applying the QFCCMM as a preparatory step of GroupLens for CF tasks and indicated that the QFCCMM-based CF algorithm outperforms not only the GroupLens algorithm but also the FCCMM-based CF algorithm.
A clustering method should be applied based on the given data type. FCCMM (Honda et al. 2015) and QFC-CMM (Kondo and Kanzawa 2018) were proposed originally for categorical multivariate data, such as document data. In the case of applying FCCMM or QFCCMM for the CF task, we consider the vector of rating of items given by a user (rating vector) as categorical multivariate data. On the other hand, GroupLens does not deal with the rating vector as categorical multivariate data. Pearson's coefficient used in GroupLens focuses on the directions of the users' items rating vectors instead of their magnitudes. Since the users' items rating vectors are made of uniform magnitude, they are on the unit hypersphere with the dimension of items. In other words, GroupLens deals with the rating vector as spherical data. Therefore, there is a logic gap that users are considered as categorical multivariate data in the QFCCMM-based CF algorithm, whereas they are considered as spherical data in the GroupLens algorithm. There is a need to design a clustering method for spherical data that has the potential to solve this logic gap. In a previous study, Higashi et al. proposed q-divergence-based fuzzy clustering for spherical data, referred to as QFCS (Higashi et al. 2019), and demonstrated that the proposed clustering algorithm achieved higher clustering accuracy using several document datasets. Although QFCS is worth applying to not only clustering documents but also to clustering rating vectors for CF tasks, it was not applied in the literature.
In this study, we propose a CF algorithm with the help of QFCS clustering algorithm. First, for all unevaluated elements in the given rating matrix, the lowest value among all already evaluated values is tentatively set. Subsequently, all values are normalized such that all users' items rating vectors are on the unit hypersphere. Second, the QFCS algorithm segments the users' items rating vectors into some clusters. Third, the GroupLens algorithm is applied for each users' items rating cluster. Finally, every item is recommended if the corresponding estimated rating value is higher than the predefined cut-off value. Through numerical experiments using six real datasets, the results of the proposed method are compared with those of two counter candidates (GroupLens and the QFCCMMbased algorithm). The experimental results indicate that the proposed algorithm performs better than the two in terms of recommendation accuracy.
The remainder of this paper is organized as follows: Sect. 2 introduces a representative CF algorithm, GroupLens; a clustering-based CF algorithm from our previous work, QFC-CMM-based CF algorithm; and fuzzy clustering algorithm for spherical data, QFCS algorithm. Section 3 presents the proposed CF algorithm. Section 4 presents numerical experiments, and the conclusion is presented in Sect. 5.

Conventional collaborative filtering method: GroupLens
The most frequently used CF algorithms are based on the concept of "neighborhood" (Herlocker et al. 1999), wherein a user's neighbor is selected based on the preference of the target user, and then, the latent preferences of the target user are estimated from the preferences of the target users' neighbor. Let N and M be the number of users and items, respectively. Let x k, (≥ 0) ( k ∈ {1, … , N} , ∈ {1, … , M} ) be the rating value that the user #k evaluated the item # . The matrix whose (k, )-th element value is x k, is denoted by X. Since all users do not always evaluate all items, some elements of X are missing. Then, the goal of CF is to estimate such missing values. Let y k, ∈ {0, 1} be the indicator whether the user #k evaluated the item # , and it is defined as Y denotes the matrix for which the (k, )-th element value is y k, . Let item (k) be the set of items which the users #k evaluated. Let sim(k, k � ) be the similarity measure between the target user #k and the user #k ′ neighbor to the target user. The similarity measure sim(k, k � ) is defined by Pearson's correlation coefficient using the rating values of items that both users #k and #k ′ have evaluated, as described by (1) y k, = 1 (The user #k evaluated the item # ), 0 (The user #k has not evaluated the item # ). (2) where x k ′ k,⋅ is the mean rating value of the user #k for items that both users #k and #k ′ evaluated, described as If item (k) ∩ item (k � ) is empty, sim(k, k � ) is set to zero. Let x k, be the missing value for the item # , which the target user #k has not evaluated, and let x k,⋅ be the mean rating value of the user #k for items that the user #k evaluated, as The GroupLens method (Herlocker et al. 1999) estimates the unknown rating value x k, of the target user #k such that the deviance between x k, and x k,⋅ is Pearson's correlation coefficient-weighted mean of the deviance between x k ′ , and x k ′ ,⋅ , where #k ′ represents every user with a positive correlation for the target user #k . Then, the estimated rating value x k, of the target user #k is described as where user ( ) is the set of users who evaluated the item # . If there is no user #k ′ satisfying both k � ∈ user ( ) and sim(k, k � ) ≥ 0 for the target user #k , x k, in Eq. (4) is just x k,⋅ .
The GroupLens algorithm is summarized as [GroupLens] Step 1. Obtain the similarities among users according to their preferences as Eq. (2).
Step 2. Estimate the missing values

QFCCMM-based CF (Kondo and Kanzawa 2019)
In the GroupLens method, similar users #k ′ ( k � ∈ {1, … , N} ) to the target user #k are heuristically defined as those satisfying sim(k, k � ) ≥ 0 , in Eq. (4). Note that there is theoretical basis for this definition, and there exist many ways to define similar users to the target user. We focus on clustering users based on their preferences. Kondo and Kanzawa proposed the QFCCMM (Kondo and Kanzawa 2018) algorithm, as follows. Let be a categorical multivariate dataset, where x k, represents co-occurrence relations between the k-th user and the -th item. The membership of x k to the i-th cluster is denoted by u i,k (i ∈ {1, … , C}, k ∈ {1, … , N}) , and the set of u i,k is denoted by U. U obeys the constraint The typicality of the -th item for the i-th cluster is denoted by w i, (i ∈ {1, … , C}, ∈ {1, … , M}) ; the set of w i, is denoted by w, which obeys the constraint The variable controlling the i-th cluster size is denoted by i . The i-th element of vector is denoted by i , which obeys the following constraint: The QFCCMM algorithm is obtained by solving the optimization problem subject to Eqs. (5), (6), and (7), where (q, , t) are the fuzzification parameters satisfying q > 1 , > 0 , and t > 0 . This method is named "q-divergence-based fuzzy clustering for categorical multivariate data" because the second term of the objective function is the q-divergence. The algorithm is presented below (Kondo and Kanzawa 2018).
Step 1. Set fuzzification parameters q > 1 , > 0 and t > 0 , the number of clusters C. Initialize typicalities w, and initial variables controlling the cluster size .
Step 6. Check the limiting criterion for (U, w, ) . If the criterion is not satisfied, go to Step 2.
The cluster index i ∈ {1, … , C} for the user #k , f (x k ) is determined by Furthermore, Kondo and Kanzawa proposed using the above QFCCMM algorithm for CF tasks as follows (Kondo and Kanzawa 2019): Step 1. Define a cut-off value, x.
Step 2. Replace each missing value with the lowest value among all the ratings values.
Step 4. Calculate x using for the target user #k , set x k, =x k,⋅ .
Step 5. Recommend all items to the target user #k with x k, ≥x and y k, = 0 . ◻ It was shown through some numerical experiments that this algorithm is better than the GroupLens algorithm in terms of recommendation accuracy (Kondo and Kanzawa 2019). Higashi et al. (2019) proposed a fuzzy clustering method for spherical data based on q-Divergence (QFCS), defined as which is subject to the constraints in Eqs. (5), (7), and and (q, ) are the fuzzification parameters satisfying q > 1 and > 0 . This method is named as "q-divergence-based fuzzy clustering for spherical data" because the second term of the objective function is the q-divergence. Both QFCCMM and QFCS methods are based on q-divergence, and the difference between them is the target data type; the QFCCMM method is, as in the name, for categorical multivariate data, and the QFCS method is, as in the name, for spherical data. The QFCS algorithm is described as (Higashi et al. 2019).

Fuzzy clustering for spherical data based on q-divergence (Higashi et al. 2019)
Step 1. Fix q > 1 , > 0 . Assume initial cluster centers v and initial variable controlling cluster sizes .
Step 5. Check the limiting criterion for (U, v, ) . If the criterion is not satisfied, go to Step 2. Higashi et al. (2019) showed using numerical experiments using 16 real document datasets that QFCS outperformed the conventional methods in terms of clustering accuracy.

Proposed method
In a previous work , the neighborhood for the target users was defined using the QFC-CMM clustering algorithm.
QFCCMM (Kondo and Kanzawa 2018) was proposed originally for categorical multivariate data, such as document data. In the case of applying QFCCMM for the CF task, we consider the users' items rating vector as categorical multivariate data. On the other hand, GroupLens does not deal with users' items rating vector as categorical multivariate data. For Pearson's coefficient used in GroupLens, (15) ‖v i ‖ 2 = 1 for all i ∈ {1, … , C}, given in Eq.
(2), all rating vectors have uniform magnitude, and they are on the unit hypersphere with the dimension of items. In other words, GroupLens deals with user's items rating vectors as spherical data. Thus, we propose adopting QFCS instead of QFCCMM to segment users' items rating vectors, and we apply Grou-pLens to the users segment that the target user belongs to. Incorporating Algorithm 2.3, we propose the following algorithm for estimating the missing values: Step 1. Define a cut-off value, x.
Step 2. Replace each missing value with the lowest value among all ratings' values. Step Step 4. Process Algorithm 2.3 for x.
Step 6. Recommend all items to the target user #k with x k, ≥x and y k, = 0 . ◻ The flow of Algorithm 3 is described using Tables 1-6. Table 1 shows an initial rating matrix, for five users versus four items, where the user #1 has not evaluated the item #4 yet, and it is denoted by "N/A." On applying Step 2 of Algorithm 3 to Table 1, we obtain the rating matrix as shown in Table 2. Thus, x 1,4 , denoted by "N/A", is replaced with min 1 ≤ k ≤ 5 1 ≤ ≤ 4 (k, ) ∉ {(1, 4)} x k, = 1 . On applying Step 3 of Algorithm 3 to Table 2, we obtain the rating matrix as shown in Table 3. Thus, the rating values are normalized for each user, which is a preparation for applying clustering for spherical data. Applying Step 4 of Algorithm 3 to Table 3, we obtain the rating matrix as shown in Table 4, where the user #1 is placed in cluster #1. Immediately before Step 5 of Algorithm 3 is applied to cluster #1 in Table 3, the value x 1,4 is restored to "N/A", to be predicted, as shown in Table 5. Applying Step 4 of Algorithm 3 to cluster #1 in Table 5, the restored "N/A" is replaced with the predicted rating value, as shown in Table 6. If the estimated value is higher than a given cut-off value x , the corresponding item is recommended to the target user.

Item
Cluster x 5,1 = 1 x 5,2 = 1 x 5,3 = 5 x 5,4 = 5 Table 6 Example of the rating matrix after Step 5 of Algorithm 3: N = 3 and M = 4 x 1,4 is replaced with the predicted values, x 1,4 ≃ 3.83 . If the predicted value is higher than a predefined cut-off value x , then the corresponding item is recommended to the corresponding user x 1,1 = 1 x 1,2 = 1 x 1,3 = 5x 1,4 ≃ 3.83 #3 x 3,1 = 2 x 3,2 = 2 x 3,3 = 4 x 3,4 = 4 #5 x 5,1 = 1 x 5,2 = 1 x 5,3 = 5 x 5,4 = 5 from 1 to 10, with 10 being the best score. Thus, each profile was evaluated by at least 230 users, and each user evaluated at least 230 profiles. In our experiment, only 400,955 ratings from 866 users for 1156 profiles were used. The "Mov-ieLens" dataset was compiled through the "MovieLens" website (Harper and Konstan 2015). This dataset contains the ratings of users for kinds of movies. In "MovieLens", 6040 users recorded 1,000,000 ratings for 3900 movie titles, but we used 277,546 ratings from 905 users for 684 movies in our experiment. Therefore, each movie was evaluated by more than 240 people, and each user rated over 200 movies. Further, the ratings were scaled from 1 to 5, with 5 being the best score. The "SUSHI" dataset (Kamishima and Akaho 2009) was compiled by Toshihiro Kamishima, and contains the rating of users for kinds of sushi. In "SUSHI", 5000 users recorded 50,000 ratings for 100 kinds of sushi. Further, the ratings were scaled from 1 to 5, with 5 being the best score.

Experimental setting
Algorithm 2.1 did not contain parameter settings. In Algorithm 2.2, the cluster numbers and fuzzification parameters were set as C ∈ {2, 3, … , 20} , q ∈ {1.0001, 1.0004, 1.0007, 1.001, 1.01, 1.1} , ∈ {10 0 , … , 10 6 } , and t ∈ {10 −6 , … , 10 −2 } . In Step 1 of Algorithms 2.2, all the variables controlling cluster sizes were initialized with the reciprocal of the cluster number, and the item typicality values were initialized at random. For the 10 initial settings, the clustering result with the maximal objective function value was selected for Step 3 in Algorithm 2.2. In Algorithm 3, the cluster number and fuzzification parameters were set as the same as in Algorithm 2.2 except for t, which was not needed. In Step 1 of Algorithms 2.3, all the variables controlling cluster sizes were initialized with the reciprocal of the cluster number, and the cluster center values were initialized at random. For the 10 initial settings, the clustering result with the minimal objective function value was selected for Step 3 in Algorithm 3.
The experiment was performed as follows. First, 10,000 rating values in the "BookCrossing" dataset, 20,000 rating values in the "Epinions" dataset, 20,000 rating values in the "Jester" dataset, 20,000 rating values in the "LibimSeTi" dataset, 20,000 rating values in the "MovieLens" dataset, and 10,000 rating values in the "SUSHI" dataset, were randomly selected to be missing from originally evaluated values. It is because the originally evaluated values were used for evaluating the recommendation accuracy of algorithms. Note that the originally missing values were not used. After these true rating values were hidden from the original datasets, Algorithms 2.1, 2.2, and 3 predicted these hidden rating values. Then, the predicted rating values and the true rating values were used for calculating an evaluation measure of recommendation accuracy of algorithms, which is mentioned in the next subsection. These experiments were executed for five settings of selecting missing values.

Evaluation measure
We applied the three algorithms (Algorithms 2.1, 2.2, and 3) to these six real datasets, and then compared the obtained recommendation accuracy using the area underneath the receiver operating characteristic (ROC) curve (AUROC) (Swets 1979;Hanley and McNeil 1982), defined as follows.
All algorithms recommend items if the corresponding estimation of the rating value is higher than the predefined cut-off value x . If the true rating value is higher than x , the item should be recommended. Here, the following four numbers are considered: • True positive (TP) is the number of items the algorithm recommended when such the items should be recommended. • True negative (TN) is the number of items the algorithm did not recommend when such the items should not be recommended. • False positive (FP) is the number of items the algorithm recommended when such the items should not be recommended. • False negative (FN) is the number of items the algorithm did not recommend when such the items should be recommended.
True positive rate (TPR) is the percentage of TP in TP and TN. False positive rate (FPR) is the percentage of FP in FP and FN. TPR and FPR, including TP, TN, FP, and FN, change according to the cut-off x . Then, the ROC curve is drawn by connecting several pairs of the FPR and TPR obtained from different cut-off x , and AUROC is the area under the ROC curve. The higher the AUROC value, the more accurate the result of the CF algorithm. In this experiment, the AUROC was calculated using the discrete cut-off values from 0.1 to the maximal rating value in increments of 0.1. Tables 7,8,9,10,11,12 show the highest AUROC value for each method and the parameter value at which the highest AUROC value was achieved. Table 13 shows their summary, where the highest AUROC value among the three methods is underlined. Table 13 indicates that all algorithms produced the same AUROC values for two datasets: Epinions and SUSHI; Algorithm 2.2 and Algorithm 3 produced the same AUROC values for one dataset: MovieLens, which are higher than those obtained by Algorithm 2.1; and Algorithm 3 produced the highest AUROC values than those obtained from others for the Epinions, Jester, and LibimSeTi datasets. Table 13 shows that the AUROC value obtained from Algorithm 3 is higher than or the same as those obtained from the other methods for all datasets. Therefore, the proposed algorithm is better than the others in terms of recommendation accuracy. The better recommendation accuracy of the proposed method is attributed to the fact that clustering for spherical data allows segmenting users more accurately than clustering for categorical multivariate data.

Conclusion
In this study, we proposed a CF algorithm based on q-divergence-based fuzzy clustering for spherical data. The experiment was conducted on six datasets using three different algorithms. The results of the experiment indicate that the proposed algorithm outperforms the conventional methods in terms of recommendation accuracy, and this is attributed to the fact that clustering for spherical data enables a more accurate segmentation of users in comparison with clustering for categorical multivariate data. The results thus indicate that users' items rating vector should be considered as spherical data than categorical multivariate data to better recommendation accuracy.
There is a major limitation in this study. The proposed algorithm must be applied with a predefined cluster number and two fuzzification parameter values. The experiment was conducted through several cluster numbers and fuzzification parameters, and the best AUROC value was  Algorithm 3 0.792885 1.0100 10 5 5 compared with conventional methods. This means that the proposed method achieves high recommendation accuracy provided that the predefined cluster number and fuzzification parameters were set adequately. However, if they are not set adequately, the recommendation accuracy would degrade, and it would possibly be worse than that of conventional methods.
To overcome this limitation, future research aims to select an appropriate cluster number and fuzzification parameter values for the proposed method; for example, adopting cluster validity indices (Dunn 1974;Gath and Geva 1989;Xie and Beni 1991;Wang and Zhang 2007) and conducting cross validation.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.