1 Introduction

Currently, there exists a considerable amount of information on digital platforms, thus, it is extremely difficult to select information that is truly relevant to each user. Recommender systems are the most powerful tool to help people in choosing products, activities, and friends from many representative options. Although recommendation systems such as Amazon.com have been ubiquitous, their recommendation accuracy is not sufficient to satisfy the growing needs of the people. This study is motivated by the requirement of higher accuracy of recommendation system.

Among many techniques combined in recommender systems, collaborative filtering (CF) is the most fundamental technique (Paul et al. 1994; Sarwar et al. 2001), which can filter items (products, activities, or friends) that a user may like based on the preferences of similar users. The most representative CF method is GroupLens (Herlocker et al. 1999), which is simple and time efficient. However, the similarity of “similar users” is heuristically defined. An adequate definition of similarity can help CF suggest more appropriate items to users. We consider that users implicitly belong to a latent group, where users have similar preferences in the same group. If we can determine such groups, we can determine users similar to a target user, and then, CF can help suggest items to the target user based on the preferences of similar users.

Clustering is only a technique to detect latent groups. Many clustering methods have been proposed and applied based on the type of given data. Honda (2016) suggested applying fuzzy clustering for categorical multivariate data induced by multinomial mixture models (FCCMM), which is based on a cluster-wise bag-of-words concept. Kondo and Kanzawa (2018) modified the FCCMM algorithm, referred to as q-divergence-based fuzzy clustering for categorical multivariate data induced by multinomial mixture models (QFCCMM). The q-divergence was focused because it is not only a generalization of the standard Kullback-Leibler divergence used in FCCMM but also the divergence discussed in Tsallis statistics with which the predictions and consequences in a wide spectrum of complex systems were confirmed (Tsallis 2009). Utilizing the q-divergence instead of the Kullback-Leibler one in clustering task, there was a potential that clusters could be captured adequately, and actually QFCCMM achieved higher clustering accuracy than FCCMM (Kondo and Kanzawa 2018). Furthermore, in a previous study (Kondo and Kanzawa 2019), we proposed applying the QFCCMM as a preparatory step of GroupLens for CF tasks and indicated that the QFCCMM-based CF algorithm outperforms not only the GroupLens algorithm but also the FCCMM-based CF algorithm.

A clustering method should be applied based on the given data type. FCCMM (Honda et al. 2015) and QFCCMM (Kondo and Kanzawa 2018) were proposed originally for categorical multivariate data, such as document data. In the case of applying FCCMM or QFCCMM for the CF task, we consider the vector of rating of items given by a user (rating vector) as categorical multivariate data. On the other hand, GroupLens does not deal with the rating vector as categorical multivariate data. Pearson’s coefficient used in GroupLens focuses on the directions of the users’ items rating vectors instead of their magnitudes. Since the users’ items rating vectors are made of uniform magnitude, they are on the unit hypersphere with the dimension of items. In other words, GroupLens deals with the rating vector as spherical data. Therefore, there is a logic gap that users are considered as categorical multivariate data in the QFCCMM-based CF algorithm, whereas they are considered as spherical data in the GroupLens algorithm. There is a need to design a clustering method for spherical data that has the potential to solve this logic gap. In a previous study, Higashi et al. proposed q-divergence-based fuzzy clustering for spherical data, referred to as QFCS (Higashi et al. 2019), and demonstrated that the proposed clustering algorithm achieved higher clustering accuracy using several document datasets. Although QFCS is worth applying to not only clustering documents but also to clustering rating vectors for CF tasks, it was not applied in the literature.

In this study, we propose a CF algorithm with the help of QFCS clustering algorithm. First, for all unevaluated elements in the given rating matrix, the lowest value among all already evaluated values is tentatively set. Subsequently, all values are normalized such that all users’ items rating vectors are on the unit hypersphere. Second, the QFCS algorithm segments the users’ items rating vectors into some clusters. Third, the GroupLens algorithm is applied for each users’ items rating cluster. Finally, every item is recommended if the corresponding estimated rating value is higher than the predefined cut-off value. Through numerical experiments using six real datasets, the results of the proposed method are compared with those of two counter candidates (GroupLens and the QFCCMM-based algorithm). The experimental results indicate that the proposed algorithm performs better than the two in terms of recommendation accuracy.

The remainder of this paper is organized as follows: Sect. 2 introduces a representative CF algorithm, GroupLens; a clustering-based CF algorithm from our previous work, QFCCMM-based CF algorithm; and fuzzy clustering algorithm for spherical data, QFCS algorithm. Section 3 presents the proposed CF algorithm. Section 4 presents numerical experiments, and the conclusion is presented in Sect. 5.

2 Preliminaries

2.1 Conventional collaborative filtering method: GroupLens

The most frequently used CF algorithms are based on the concept of “neighborhood” (Herlocker et al. 1999), wherein a user’s neighbor is selected based on the preference of the target user, and then, the latent preferences of the target user are estimated from the preferences of the target users’ neighbor.

Let N and M be the number of users and items, respectively. Let \(x_{k,\ell }(\ge 0)\) (\(k\in \{1,\dots ,N\}\), \(\ell \in \{1,\dots ,M\}\)) be the rating value that the user \(\#k\) evaluated the item \(\#\ell \). The matrix whose \((k,\ell )\)-th element value is \(x_{k,\ell }\) is denoted by X. Since all users do not always evaluate all items, some elements of X are missing. Then, the goal of CF is to estimate such missing values. Let \(y_{k,\ell }\in \{0,1\}\) be the indicator whether the user \(\#k\) evaluated the item \(\#\ell \), and it is defined as

$$\begin{aligned} y_{k,\ell }={\left\{ \begin{array}{ll} 1&{}(\text {The user}\,\#k\,\text {evaluated the item}\,\#\ell ),\\ 0&{}(\text {The user}\,\#k\,\text {has not evaluated the item}\,\#\ell ). \end{array}\right. } \end{aligned}$$
(1)

Y denotes the matrix for which the \((k,\ell )\)-th element value is \(y_{k,\ell }\). Let \(\varOmega _{\text {item}}(k)\) be the set of items which the users \(\#k\) evaluated. Let \(\text {sim}{(k,k')}\) be the similarity measure between the target user \(\#k\) and the user \(\#k'\) neighbor to the target user. The similarity measure \(\text {sim}{(k,k')}\) is defined by Pearson’s correlation coefficient using the rating values of items that both users \(\#k\) and \(\#k'\) have evaluated, as described by

$$\begin{aligned}&\text {sim}(k,k')=\nonumber \\&\quad \frac{\displaystyle {\sum _{\ell \in \varOmega _{\text {item}}(k)\cap \varOmega _{\text {item}}(k')} \left( x_{k,\ell }-\overline{x_{k,\cdot }^{k'}}\right) \left( x_{k',\ell }-\overline{x_{k',\cdot }^{k}}\right) }}{\sqrt{\displaystyle {\sum _{\ell \in \varOmega _{\text {item}}(k)\cap \varOmega _{\text {item}}(k')} \left( x_{k,\ell }-\overline{x_{k,\cdot }^{k'}}\right) ^2}} \sqrt{\displaystyle {\sum _{\ell \in \varOmega _{\text {item}}(k)\cap \varOmega _{\text {item}}(k')} \left( x_{k',\ell }-\overline{x_{k',\cdot }^{k}}\right) ^2}}}, \end{aligned}$$
(2)

where \(\overline{x_{k,\cdot }^{k'}}\) is the mean rating value of the user \(\#k\) for items that both users \(\#k\) and \(\#k'\) evaluated, described as

$$\begin{aligned} \overline{x_{k,\cdot }^{k'}}=\frac{\displaystyle { \sum _{\ell \in \varOmega _{\text {item}}(k)\cap \varOmega _{\text {item}}(k')} x_{k,\ell }}}{|\varOmega _{\text {item}}(k)\cap \varOmega _{\text {item}}(k')|}. \end{aligned}$$
(3)

If \(\varOmega _{\text {item}}(k)\cap \varOmega _{\text {item}}(k')\) is empty, \(\text {sim}(k,k')\) is set to zero. Let \(\hat{x}_{k,\ell }\) be the missing value for the item \(\#\ell \), which the target user \(\#k\) has not evaluated, and let \(\bar{x}_{k,\cdot }\) be the mean rating value of the user \(\#k\) for items that the user \(\#k\) evaluated, as

$$\begin{aligned} \bar{x}_{k,\cdot }=\frac{ \sum _{\ell \in \varOmega _{\text {item}}(k)}x_{k,\ell } }{ |\varOmega _{\text {item}}(k)| }. \end{aligned}$$

The GroupLens method (Herlocker et al. 1999) estimates the unknown rating value \(\hat{x}_{k,\ell }\) of the target user \(\#k\) such that the deviance between \(\hat{x}_{k,\ell }\) and \(\bar{x}_{k,\cdot }\) is Pearson’s correlation coefficient-weighted mean of the deviance between \(\hat{x}_{k',\ell }\) and \(\bar{x}_{k',\cdot }\), where \(\#k'\) represents every user with a positive correlation for the target user \(\#k\). Then, the estimated rating value \(\hat{x}_{k,\ell }\) of the target user \(\#k\) is described as

$$\begin{aligned} \hat{x}_{k,\ell }&=\bar{x}_{k,\cdot }+ \frac{ \sum _{\begin{array}{c} k'\in \varOmega _{\text {user}}(\ell )\\ \text {sim}(k,k')\ge 0 \end{array}} \text {sim}(k,k')(x_{k',\ell }-\bar{x}_{k',\cdot })}{\sum _{\begin{array}{c} k'\in \varOmega _{\text {user}}(\ell )\\ \text {sim}(k,k')\ge 0 \end{array}} \text {sim}(k,k')}, \end{aligned}$$
(4)

where \(\varOmega _{\text {user}}(\ell )\) is the set of users who evaluated the item \(\#\ell \). If there is no user \(\#k'\) satisfying both \(k'\in \varOmega _{\text {user}}(\ell )\) and \(\text {sim}(k,k')\ge 0\) for the target user \(\#k\), \(\hat{x}_{k,\ell }\) in Eq. (4) is just \(\bar{x}_{k,\cdot }\).

The GroupLens algorithm is summarized as [GroupLens]

Step 1. Obtain the similarities among users according to their preferences as Eq. (2).

Step 2. Estimate the missing values \(\hat{x}_{k,\ell }\) (\(k\in \{1,\dots ,N\}\), \(\ell \in \{1,\dots ,M\}\)) if \(y_{k,\ell }=0\), as Eq. (4). \(\square \)

2.2 QFCCMM-based CF (Kondo and Kanzawa 2019)

In the GroupLens method, similar users \(\#k'\) (\(k'\in \{1,\dots ,N\}\)) to the target user \(\#k\) are heuristically defined as those satisfying \(\text {sim}(k,k')\ge 0\), in Eq. (4). Note that there is theoretical basis for this definition, and there exist many ways to define similar users to the target user. We focus on clustering users based on their preferences. Kondo and Kanzawa proposed the QFCCMM (Kondo and Kanzawa 2018) algorithm, as follows. Let \(X=\{x_k\in \mathbb {R}^M\mid k\in \{1,\ldots ,N\}, x_{k,\ell }\ge 0, \ell \in \{1,\dots ,M\}\}\) be a categorical multivariate dataset, where \(x_{k,\ell }\) represents co-occurrence relations between the k-th user and the \(\ell \)-th item. The membership of \(x_k\) to the i-th cluster is denoted by \(u_{i,k}\) \((i\in \{1,\ldots ,C\}, k\in \{1,\ldots ,N\})\), and the set of \(u_{i,k}\) is denoted by U. U obeys the constraint

$$\begin{aligned} \sum _{i=1}^C u_{i,k}=1,\quad (k\in \{1,\dots ,N\}). \end{aligned}$$
(5)

The typicality of the \(\ell \)-th item for the i-th cluster is denoted by \(w_{i,\ell }\) \((i\in \{1,\dots ,C\}, \ell \in \{1,\dots ,M\})\); the set of \(w_{i,\ell }\) is denoted by w, which obeys the constraint

$$\begin{aligned} \sum _{\ell =1}^Mw_{i,\ell }=1\text { and }w_{i,\ell }\in [0,1], (i\in \{1,\dots ,C\}). \end{aligned}$$
(6)

The variable controlling the i-th cluster size is denoted by \(\pi _i\). The i-th element of vector \(\pi \) is denoted by \(\pi _i\), which obeys the following constraint:

$$\begin{aligned} \sum _{i=1}^C \pi _i=1. \end{aligned}$$
(7)

The QFCCMM algorithm is obtained by solving the optimization problem

$$\begin{aligned}&{\mathop {\hbox {maximize}}\limits _{U,w,\pi }} \sum _{i=1}^C\sum _{k=1}^N\sum _{\ell =1}^M(\pi _i)^{1-q}(u_{i,k})^q \frac{1}{t}\left( \left( w_{i,\ell }\right) ^t-1\right) x_{k,\ell }\nonumber \\&\quad -\frac{\lambda ^{-1}}{q-1}\left( \sum _{i=1}^C\sum _{k=1}^N(\pi _i)^{1-q}(u_{i,k})^q-1\right) \end{aligned}$$
(8)

subject to Eqs. (5), (6), and (7), where \((q,\lambda ,t)\) are the fuzzification parameters satisfying \(q>1\), \(\lambda >0\), and \(t>0\). This method is named “q-divergence-based fuzzy clustering for categorical multivariate data” because the second term of the objective function is the q-divergence. The algorithm is presented below (Kondo and Kanzawa 2018).

Step 1. Set fuzzification parameters \(q>1\), \(\lambda >0\) and \(t>0\), the number of clusters C. Initialize typicalities w, and initial variables controlling the cluster size \(\pi \).

Step 2. Calculate s as

$$\begin{aligned} s_{i,k}=\frac{1}{t}\sum _{\ell =1}^M\left( \left( w_{i,\ell }\right) ^t-1\right) x_{k,\ell } \end{aligned}$$
(9)

for all \(i\in \{1,\dots ,C\}\) and \(k\in \{1,\dots ,N\}\).

Step 3. Calculate U as

$$\begin{aligned} u_{i,k}=\frac{\pi _i(1+\lambda (1-q)s_{i,k})^{1/(1-q)} }{ \sum _{j=1}^C\pi _j(1+\lambda (1-q)s_{j,k})^{1/(1-q)} } \end{aligned}$$
(10)

for all \(i\in \{1,\dots ,C\}\) and \(k\in \{1,\dots ,N\}\).

Step 4. Calculate w as

$$\begin{aligned} w_{i,\ell }=\frac{ \left( \sum _{k=1}^N(u_{i,k})^q x_{k,\ell }\right) ^{1/(1-t)} }{ \sum _{r=1}^M\left( \sum _{k=1}^N(u_{i,k})^q x_{k,r}\right) ^{1/(1-t)} } \end{aligned}$$
(11)

for all \(i\in \{1,\dots ,C\}\) and \(\ell \in \{1,\dots ,M\}\).

Step 5. Calculate \(\pi \) as

$$\begin{aligned} \pi _i=\frac{ \left( \sum _{k=1}^N(u_{i,k})^q(1+\lambda (1-q)s_{i,k})\right) ^{1/q} }{ \sum _{j=1}^C \left( \sum _{k=1}^N(u_{j,k})^q(1+\lambda (1-q)s_{j,k})\right) ^{1/q} } \end{aligned}$$
(12)

for all \(i\in \{1,\dots ,C\}\).

Step 6. Check the limiting criterion for \((U, w, \pi )\). If the criterion is not satisfied, go to Step 2.

The cluster index \(i\in \{1,\dots ,C\}\) for the user \(\#k\), \(f(x_k)\) is determined by

$$\begin{aligned} f(x_k)=\text {arg} \max _{1\le {}j\le {}C}\{u_{j,k}\}. \end{aligned}$$

Furthermore, Kondo and Kanzawa proposed using the above QFCCMM algorithm for CF tasks as follows (Kondo and Kanzawa 2019):

Step 1. Define a cut-off value, \(\check{x}\).

Step 2. Replace each missing value with the lowest value among all the ratings values.

Step 3. Process Algorithm 2.2.

Step 4. Calculate \(\hat{x}\) using

$$\begin{aligned} \hat{x}_{k,\ell }&=\bar{x}_{k,\cdot }+ \frac{ \sum _{\begin{array}{c} k'\in \varOmega _{\text {user}}(\ell )\\ f(x_{k'})\equiv f(x_k) \end{array}} \text {sim}(k,k')(x_{k',\ell }-\bar{x}_{k',\cdot }) }{ \sum _{\begin{array}{c} k'\in \varOmega _{\text {user}}(\ell )\\ f(x_{k'})\equiv f(x_k) \end{array}} \text {sim}(k,k')} \end{aligned}$$
(13)

for all \(i\in \{1,\dots ,C\}\) and \(\ell \in \{1,\dots ,M\}\) if \(y_{k,\ell }=0\). If there is no user \(\#k'\) satisfying both \(k'\in \varOmega _{\text {user}}(\ell )\) and \(f(x_{k'})\equiv f(x_k)\) for the target user \(\#k\), set \(\hat{x}_{k,\ell }=\bar{x}_{k,\cdot }\).

Step 5. Recommend all items to the target user \(\#k\) with \(\hat{x}_{k,\ell }\ge \check{x}\) and \(y_{k,\ell }=0\). \(\square \)

It was shown through some numerical experiments that this algorithm is better than the GroupLens algorithm in terms of recommendation accuracy (Kondo and Kanzawa 2019).

2.3 Fuzzy clustering for spherical data based on q-divergence (Higashi et al. 2019)

Higashi et al. (2019) proposed a fuzzy clustering method for spherical data based on q-Divergence (QFCS), defined as

$$\begin{aligned}&{\mathop {\hbox {minimize}}\limits _{U,w,\pi }} \sum _{i=1}^C\sum _{k=1}^N(\pi _i)^{1-q}(u_{i,k})^q\left( 1-x_k^{\mathsf {T}}v_i\right) \nonumber \\&\quad +\frac{\lambda ^{-1}}{q-1}\left( \sum _{i=1}^C\sum _{k=1}^N(\pi _i)^{1-q}(u_{i,k})^q-1\right) \end{aligned}$$
(14)

which is subject to the constraints in Eqs. (5), (7), and

$$\begin{aligned} \Vert v_i\Vert _2=1\text { for all}\,i\in \{1,\dots ,C\}, \end{aligned}$$
(15)

where \(x_k\) is on the \(M-1\)-dimensional unit sphere, and \((q,\lambda )\) are the fuzzification parameters satisfying \(q>1\) and \(\lambda >0\). This method is named as “q-divergence-based fuzzy clustering for spherical data” because the second term of the objective function is the q-divergence. Both QFCCMM and QFCS methods are based on q-divergence, and the difference between them is the target data type; the QFCCMM method is, as in the name, for categorical multivariate data, and the QFCS method is, as in the name, for spherical data. The QFCS algorithm is described as (Higashi et al. 2019).

Step 1. Fix \(q>1\), \(\lambda >0\). Assume initial cluster centers v and initial variable controlling cluster sizes \(\pi \).

Step 2. Update U as

$$\begin{aligned} u_{i,k}=&\frac{\pi _i\left( 1-\lambda (1-q)\left( 1-x_k^{\mathsf {T}}v_i\right) \right) ^{1/(1-q)}}{\sum _{j=1}^C\pi _{j}\left( 1-\lambda (1-q)\left( 1-x_k^{\mathsf {T}}v_j\right) \right) ^{1/(1-q)}} \end{aligned}$$
(16)

for all \(i\in \{1,\dots ,C\}\) and \(k\in \{1,\dots ,N\}\).

Step 3. Update \(\pi \) as

$$\begin{aligned} \pi _i=&\frac{ \left( \sum _{k=1}^N(u_{i,k})^q\left( 1-(1-q)\lambda {}\left( 1-x_k^{\mathsf {T}}v_i\right) \right) \right) ^{1/q} }{ \sum _{j=1}^C\left( \sum _{k=1}^N(u_{j,k})^q\left( 1-(1-q)\lambda {}\left( 1-x_k^{\mathsf {T}}v_{j}\right) \right) \right) ^{1/q} } \end{aligned}$$
(17)

for all \(i\in \{1,\dots ,C\}\).

Step 4. Calculate \(v_i\) as

$$\begin{aligned} v_i=&\frac{\sum _{k=1}^N (u_{i,k})^q x_k}{\left\| \sum _{k=1}^N (u_{i,k})^q x_k\right\| _2} \end{aligned}$$
(18)

for all \(i\in \{1,\dots ,C\}\).

Step 5. Check the limiting criterion for \((U, v, \pi )\). If the criterion is not satisfied, go to step 2.

Higashi et al. (2019) showed using numerical experiments using 16 real document datasets that QFCS outperformed the conventional methods in terms of clustering accuracy.

3 Proposed method

In a previous work (Kondo and Kanzawa 2019), the neighborhood for the target users was defined using the QFCCMM clustering algorithm.

QFCCMM (Kondo and Kanzawa 2018) was proposed originally for categorical multivariate data, such as document data. In the case of applying QFCCMM for the CF task, we consider the users’ items rating vector as categorical multivariate data. On the other hand, GroupLens does not deal with users’ items rating vector as categorical multivariate data. For Pearson’s coefficient used in GroupLens, given in Eq. (2), all rating vectors have uniform magnitude, and they are on the unit hypersphere with the dimension of items. In other words, GroupLens deals with user’s items rating vectors as spherical data.

Thus, we propose adopting QFCS instead of QFCCMM to segment users’ items rating vectors, and we apply GroupLens to the users segment that the target user belongs to. Incorporating Algorithm 2.3, we propose the following algorithm for estimating the missing values:

Step 1. Define a cut-off value, \(\check{x}\).

Step 2. Replace each missing value with the lowest value among all ratings’ values.

Step 3. Normalize \(\{x_{k,\ell }\}_{\ell =1}^M\) (\(k\in \{1,\dots ,N\}\)) into \(\{\tilde{x}_{k,\ell }\}_{\ell =1}^M\) (\(k\in \{1,\dots ,N\}\)), as

$$\begin{aligned} \tilde{x}_{k,\ell }=\frac{x_{k,\ell }}{\sqrt{\sum _{\ell =1}^Mx_{k,\ell }}}. \end{aligned}$$
(19)

Step 4. Process Algorithm 2.3 for \(\tilde{x}\).

Step 5. Estimate the missing values \(\hat{x}_{k,\ell }\) (\(k\in \{1,\dots ,N\}\), \(\ell \in \{1,\dots ,M\}\)) if \(y_{k,\ell }=0\), as Eq. (13).

Step 6. Recommend all items to the target user \(\#k\) with \(\hat{x}_{k,\ell }\ge \check{x}\) and \(y_{k,\ell }=0\). \(\square \)

The flow of Algorithm 3 is described using Tables 16. Table 1 shows an initial rating matrix, for five users versus four items, where the user #1 has not evaluated the item #4 yet, and it is denoted by “N/A.” On applying Step 2 of Algorithm 3 to Table 1, we obtain the rating matrix as shown in Table 2. Thus, \(x_{1,4}\), denoted by “N/A”, is replaced with \(\displaystyle {\min _{\begin{array}{c} 1\le {}k\le {}5\\ 1\le {}\ell {}\le {}4\\ (k,\ell )\not \in \{(1,4)\} \end{array}}x_{k,\ell }}=1\). On applying Step 3 of Algorithm 3 to Table 2, we obtain the rating matrix as shown in Table 3. Thus, the rating values are normalized for each user, which is a preparation for applying clustering for spherical data. Applying Step 4 of Algorithm 3 to Table 3, we obtain the rating matrix as shown in Table 4, where the user #1 is placed in cluster #1. Immediately before Step 5 of Algorithm 3 is applied to cluster #1 in Table 3, the value \(x_{1,4}\) is restored to “N/A”, to be predicted, as shown in Table 5. Applying Step 4 of Algorithm 3 to cluster #1 in Table 5, the restored “N/A” is replaced with the predicted rating value, as shown in Table 6. If the estimated value is higher than a given cut-off value \(\check{x}\), the corresponding item is recommended to the target user.

Table 1 Example of initial rating matrix: \(N=5\), \(M=4\), and \(\{x_{k,\ell }\}_{(k,\ell )=(1,1)}^{(5,4)}\) are actual rating values from the users, and \(x_{1,4}\) needs to be predicted
Table 2 Example of rating matrix after Step 2 of Algorithm 3: \(N=5\) and \(M=4\)
Table 3 Example of rating matrix after Step 3 of Algorithm 3: \(N=5\) and \(M=4\)
Table 4 Example of rating matrix after Step 4 of Algorithm 3: \(N=5\), \(M=4\), and \(C=2\)
Table 5 Example of the rating matrix immediately before Step 5 of Algorithm 3: \(N=3\) and \(M=4\)
Table 6 Example of the rating matrix after Step 5 of Algorithm 3: \(N=3\) and \(M=4\)

4 Numerical experiments

Numerical experiments were conducted to compare the CF accuracy of the following three algorithms: Algorithm 2.1, Algorithm 2.2, and Algorithm 3, using six real datasets: “BookCrossing” (Ziegler et al. 2005), “Epinions” (Massa et al. 2008), “Jester” (Goldberg et al. 2001), “LibimSeTi” (Brozovsky and Petricek 2007), “MovieLens” (Harper and Konstan 2015), and “SUSHI” (Kamishima and Akaho 2009).

4.1 Datasets

The “BookCrossing” dataset was compiled by Cai-Nicolas Ziegler in a four-week crawl of the BookCrossing community with the kind permission of Ron Hornbaker, CTO of Humankind Systems. It contains 1,149,780 ratings for approximately 271,379 books provided by 278,858 users (Ziegler et al. 2005). However, only 35,179 ratings from 1091 users for 2248 books were used for this experiment. Therefore, each book was evaluated by more than 8 users, with each user rating over 15 books. In this case, the ratings were scaled from 1 to 10, with 10 being the best score. The “Epinions” dataset (Massa et al. 2008) was collected by Paolo Massa in a 5-week crawl from the Epinions.com web site, and it contains the rating of users for products such as software, music, television shows, and so on. In “Epinions”, 49,290 users recorded 664,824 ratings for 139,738 products; however, we used 42,808 ratings from 1022 users for 835 products in our experiment. Further, the ratings were scaled from 1 to 5, with 5 being the best score. The “Jester” dataset (Goldberg et al. 2001) was collected by Ken Goldberg from the Jester Online Joke website, and it contains the rating of users for jokes. In “Jester”, 59,132 users recorded around 1.7 million ratings for 150 jokes; however, we used 373,338 ratings from 2916 users for 140 products in our experiment. Further, the ratings were scaled from − 10 to 10, with 10 being the best score. The “LibimSeTi” profile dataset (Brozovsky and Petricek 2007) was released by Vaclav Petricek of eHarmony.com. This dataset includes 17,359,346 anonymous ratings of 168,791 profiles created by 135,359 LibimSeTi users on April 4th, 2006. The ratings were scaled from 1 to 10, with 10 being the best score. Thus, each profile was evaluated by at least 230 users, and each user evaluated at least 230 profiles. In our experiment, only 400,955 ratings from 866 users for 1156 profiles were used. The “MovieLens” dataset was compiled through the “MovieLens” website (Harper and Konstan 2015). This dataset contains the ratings of users for kinds of movies. In “MovieLens”, 6040 users recorded 1,000,000 ratings for 3900 movie titles, but we used 277,546 ratings from 905 users for 684 movies in our experiment. Therefore, each movie was evaluated by more than 240 people, and each user rated over 200 movies. Further, the ratings were scaled from 1 to 5, with 5 being the best score. The “SUSHI” dataset (Kamishima and Akaho 2009) was compiled by Toshihiro Kamishima, and contains the rating of users for kinds of sushi. In “SUSHI”, 5000 users recorded 50,000 ratings for 100 kinds of sushi. Further, the ratings were scaled from 1 to 5, with 5 being the best score.

4.2 Experimental setting

Algorithm 2.1 did not contain parameter settings. In Algorithm 2.2, the cluster numbers and fuzzification parameters were set as \(C\in \{2,3,\dots ,20\}\), \(q\in \{1.0001,1.0004,1.0007,1.001,1.01,1.1\}\), \(\lambda \in \{10^0,\dots ,10^6\}\), and \(t\in \{10^{-6},\dots ,10^{-2}\}\). In Step 1 of Algorithms 2.2, all the variables controlling cluster sizes were initialized with the reciprocal of the cluster number, and the item typicality values were initialized at random. For the 10 initial settings, the clustering result with the maximal objective function value was selected for Step 3 in Algorithm 2.2. In Algorithm 3, the cluster number and fuzzification parameters were set as the same as in Algorithm 2.2 except for t, which was not needed. In Step 1 of Algorithms 2.3, all the variables controlling cluster sizes were initialized with the reciprocal of the cluster number, and the cluster center values were initialized at random. For the 10 initial settings, the clustering result with the minimal objective function value was selected for Step 3 in Algorithm 3.

The experiment was performed as follows. First, 10,000 rating values in the “BookCrossing” dataset, 20,000 rating values in the “Epinions” dataset, 20,000 rating values in the “Jester” dataset, 20,000 rating values in the “LibimSeTi” dataset, 20,000 rating values in the “MovieLens” dataset, and 10,000 rating values in the “SUSHI” dataset, were randomly selected to be missing from originally evaluated values. It is because the originally evaluated values were used for evaluating the recommendation accuracy of algorithms. Note that the originally missing values were not used. After these true rating values were hidden from the original datasets, Algorithms 2.1, 2.2, and 3 predicted these hidden rating values. Then, the predicted rating values and the true rating values were used for calculating an evaluation measure of recommendation accuracy of algorithms, which is mentioned in the next subsection. These experiments were executed for five settings of selecting missing values.

4.3 Evaluation measure

We applied the three algorithms (Algorithms 2.1, 2.2, and 3) to these six real datasets, and then compared the obtained recommendation accuracy using the area underneath the receiver operating characteristic (ROC) curve (AUROC) (Swets 1979; Hanley and McNeil 1982), defined as follows.

All algorithms recommend items if the corresponding estimation of the rating value is higher than the predefined cut-off value \(\check{x}\). If the true rating value is higher than \(\check{x}\), the item should be recommended. Here, the following four numbers are considered:

  • True positive (TP) is the number of items the algorithm recommended when such the items should be recommended.

  • True negative (TN) is the number of items the algorithm did not recommend when such the items should not be recommended.

  • False positive (FP) is the number of items the algorithm recommended when such the items should not be recommended.

  • False negative (FN) is the number of items the algorithm did not recommend when such the items should be recommended.

True positive rate (TPR) is the percentage of TP in TP and TN. False positive rate (FPR) is the percentage of FP in FP and FN. TPR and FPR, including TP, TN, FP, and FN, change according to the cut-off \(\check{x}\). Then, the ROC curve is drawn by connecting several pairs of the FPR and TPR obtained from different cut-off \(\check{x}\), and AUROC is the area under the ROC curve. The higher the AUROC value, the more accurate the result of the CF algorithm. In this experiment, the AUROC was calculated using the discrete cut-off values from 0.1 to the maximal rating value in increments of 0.1.

4.4 Results and discussion

Tables 7, 8, 9, 10, 11, 12 show the highest AUROC value for each method and the parameter value at which the highest AUROC value was achieved. Table 13 shows their summary, where the highest AUROC value among the three methods is underlined.

Table 13 indicates that all algorithms produced the same AUROC values for two datasets: Epinions and SUSHI; Algorithm 2.2 and Algorithm 3 produced the same AUROC values for one dataset: MovieLens, which are higher than those obtained by Algorithm 2.1; and Algorithm 3 produced the highest AUROC values than those obtained from others for the Epinions, Jester, and LibimSeTi datasets.

Table 13 shows that the AUROC value obtained from Algorithm 3 is higher than or the same as those obtained from the other methods for all datasets. Therefore, the proposed algorithm is better than the others in terms of recommendation accuracy. The better recommendation accuracy of the proposed method is attributed to the fact that clustering for spherical data allows segmenting users more accurately than clustering for categorical multivariate data.

Table 7 Highest AUROC value for each method and the corresponding parameter values for the “BookCrossing” dataset
Table 8 The highest AUROC value for each method and the corresponding parameter values for the “Epinions” dataset
Table 9 The highest AUROC value for each method and the corresponding parameter values for the “Jester” dataset
Table 10 Highest AUROC value for each method and the corresponding parameter values for the “LibimSeTi” dataset
Table 11 The highest AUROC value for each method and the corresponding parameter values for the “MovieLens” dataset
Table 12 The highest AUROC value for each method and the corresponding parameter values for the “SUSHI” dataset
Table 13 Summary of the highest AUROC values for all real datasets

5 Conclusion

In this study, we proposed a CF algorithm based on q-divergence-based fuzzy clustering for spherical data. The experiment was conducted on six datasets using three different algorithms. The results of the experiment indicate that the proposed algorithm outperforms the conventional methods in terms of recommendation accuracy, and this is attributed to the fact that clustering for spherical data enables a more accurate segmentation of users in comparison with clustering for categorical multivariate data. The results thus indicate that users’ items rating vector should be considered as spherical data than categorical multivariate data to better recommendation accuracy.

There is a major limitation in this study. The proposed algorithm must be applied with a predefined cluster number and two fuzzification parameter values. The experiment was conducted through several cluster numbers and fuzzification parameters, and the best AUROC value was compared with conventional methods. This means that the proposed method achieves high recommendation accuracy provided that the predefined cluster number and fuzzification parameters were set adequately. However, if they are not set adequately, the recommendation accuracy would degrade, and it would possibly be worse than that of conventional methods.

To overcome this limitation, future research aims to select an appropriate cluster number and fuzzification parameter values for the proposed method; for example, adopting cluster validity indices (Dunn 1974; Gath and Geva 1989; Xie and Beni 1991; Wang and Zhang 2007) and conducting cross validation.