1 Introduction

With the advent of the big data era, the scale of information has expanded significantly. Especially in the field of e-commerce, users must devote more time and energies to finding valuable information, which makes the information overload problem [26] prominent. Recommendation system [23] is the main information filtering technology, which has been widely used in various online business fields such as video, music, e-commerce, advertising, news, etc. Recommendation algorithms are typically divided into three categories: collaborative filtering, content-based, or a hybrid of these two approaches. Collaborative filtering (CF) [17, 35] analyzes user preferences based on explicit data, and helps users obtain the content that they are interested in quickly and accurately, provide personalized need. In this research, the item-based CF algorithm is chosen for further improvement due to its simple and efficient. The recommendation process of item-based CF algorithm mainly involves two phases. The first phase is to search the nearest neighbors of the target item according to similarity values. The second phase is to predict user preferences based on the rating information of the nearest neighbors. Therefore, the design of similarity and prediction formulas has a crucial impact on recommendation results. However, as the number of users and items increases, cold start and data sparse problems are extremely severe, resulting in the quality of selected neighbors and the ability of accurate predictions are seriously affected.

In the stage of computing similarity values, some representative similarity measures, mainly including Sigmoid and Pearson Correlation Coefficient (SPCC) [16], Adjusted Cosine Similarity (ACOS) [25], Jaccard and Mean Squared Differences (JMSD) [4], New Heuristic Similarity Model (NHSM) [15], and Our developed Similarity (OS) [13], have been designed. They only rely on co-rated cases and can quickly return recommendation results. However, in extremely sparse data, the accuracy of similarity results is seriously affected. To address the limitation of co-rated cases, some state-of-the-art similarity measures, such as Bhattacharyya Coefficient in CF (BCF) [21] and KL-based Similarity Measure (KLCF) [28] have been presented based on the probability distribution of ratings. They maximally utilize all rating information to generate more accurate similarity results, which alleviates the problems of sparse data and cold start. However, these two measures add the corresponding item similarity when calculating user similarity, which involves massive Cartesian calculation, resulting in lower computation efficiency.

In the conventional item-based prediction methods, if a given user rates the nearest neighbors of the target item, then these information will be aggregated to predict unknown ratings. Otherwise, these potentially valuable contributions will be filtered out. This may aggravate the problem of sparse data in turn. To address this problem, some advanced prediction methods, a Recursive Prediction Algorithm (RPA) [36] and an Iterative Rating Prediction algorithm (IRP) [39], have been presented. They enable more ratings to participate the prediction process, to improve the accuracy of prediction. However, the related parameters have to be adjusted to determine the optimal value.

Based on the above analysis, due to cold start [6, 31]and sparsity issues [20, 38], the quality of selected neighbors and the ability of accurate predictions are seriously affected. To alleviate these problems, we propose a new item-based CF model in this paper, which consists of an improved similarity measure and prediction method. Our contributions are mainly summarized as follows:

  • The KL divergence based on vague sets is proposed from the perspective of user preference probability to measure item similarity. It relaxes the constraint of co-rated cases, and possesses good flexibility in sparse data. Meanwhile, only item similarity is required to calculate, which makes a high computational efficiency.

  • The weight factor \(\alpha\) is defined to emphasize the number of ratings in the proposed similarity measure. It can further improve the accuracy of similarity results in extremely sparse data.

  • A item-based prediction method with a new neighbor selection strategy is presented. It allows more neighbors to participate in the prediction process, which effectively enhance the accuracy of prediction. What’s more, it does not involve parameter selection and is easy to implement.

The rest parts of this study are organized as follows. The related work is briefly introduced regarding the existing item-based similarity measures and prediction methods in Sect. 2. Then, a new item-based CF algorithm is proposed in details in Sect. 3. Experimental results on benchmark data sets are described and discussed in Sect. 4. Finally, the conclusion and future work are given in Sect. 5.

2 Related Work

In this section, we briefly review the similarity and prediction work related to the methods of this study in Sects. 2.1 and 2.2, respectively.

2.1 Item-Based Similarity Measures

In general, the traditional similarity measures are mainly divided into three categories, namely numerical, structural, and hybrid similarity. Numerical similarity calculates users or items similarity according to rating values. Cosine Similarity (COS) [7] takes the historical ratings on each item as a vector, and the similarity result is expressed by the cosine value of two vectors. The formula is defined in Eq. (1). Pearson Correlation Coefficient (PCC) [9] calculates similarity results by measuring the linear relationship between items on co-rated users, which is shown in Eq. (2). Besides, Mean Absolute Difference (MSD) [24] is computed by the average deviation of any two item ratings. The formula is expressed in Eq. (3):

$$\begin{aligned}&{{\mathrm{sim}(i,j)^\mathrm{COS}=\frac{\sum \limits _{u\in U_i\cap U_j}r_{ui}\cdot r_{vi}}{\sqrt{\sum \limits _{u\in U_i\cap U_j}r_{ui}^{2}}\cdot \sqrt{\sum \limits _{u\in U_i\cap U_j}r_{vi}^{2}}}}} \end{aligned}$$
(1)
$$\begin{aligned}&{{\mathrm{sim}(i,j)^\mathrm{PCC}=\frac{\sum \limits _{u\in U_i\cap U_j}(r_{ui}-\bar{r_{i}})\cdot (r_{vi}-\bar{r_{j}})}{\sqrt{\sum \limits _{u\in U_i\cap U_j}(r_{ui}-\bar{r_{i}})^{2}}\cdot \sqrt{\sum \limits _{u\in U_i\cap U_j}(r_{vi}-\bar{r_{j}})^{2}}}}} \end{aligned}$$
(2)
$$\begin{aligned}&{{\mathrm{sim}(i,j)^{MSD}=1-\frac{\sum \limits _{u\in U_i\cap U_j}(r_{ui}-r_{uj})^{2}}{\left| U_{i}\cap U_{j} \right| } }} \end{aligned}$$
(3)

where \(\bar{r_{i}}\) is the average values of item i. \(r_{ui}\) is the rating that user u rates on item i. \(U_{i}\) represents the sets of users rated on item i.

Structural similarity is different from numerical similarity, which focuses on whether the rating values exist or not. As one of the classic structural similarities, Jaccard [18] only emphasizes the rate of co-rated cases. The formula is shown in Eq. (4):

$$\begin{aligned} {{\mathrm{sim}(i,j)^\mathrm{Jaccard}=\frac{\left| U_{i}\cap U_{j} \right| }{\left| U_{i}\cup U_{j} \right| } }} \end{aligned}$$
(4)

Hybrid similarity comprehensively considers the numerical and structural information of user ratings. It can make up for the shortcomings of the first two types of similarity and effectively enhance the accuracy of similarity calculation. JMSD is a combination of Jaccard and MSD, and SPCC integrates Sigmoid function and PCC, which have better prediction quality. Their expressions are the following, respectively:

$$\begin{aligned}&{{\mathrm{sim}(i,j)^{\mathrm{JMSD}}=\mathrm{sim}(i,j)^\mathrm{Jaccard}\cdot \mathrm{sim}(i,j)^{\mathrm{MSD}} }} \end{aligned}$$
(5)
$$\begin{aligned}&{{\mathrm{sim}(i,j)^\mathrm{SPCC}=\mathrm{sim}(i,j)^\mathrm{PCC}\cdot \frac{1}{1+exp(-\frac{\left| U_{i}\cap U_{j} \right| }{2})}}} \end{aligned}$$
(6)

However, these typical similarity measures all heavily rely on co-rated users. With the rapid expansion of users and items, the data set becomes extremely sparse, resulting in unreasonable similarity results or even nonexistent.

At present, some similarity measures based on heuristic strategy have been proposed. Ahn et al. [1] presented a similarity measure named PIP using the specific domain meaning, which was consisted of the product of three components (the proximity, the impact, and the popularity). To simplify and normalize PIP model, Liu et al. [15] presented NHSM, combining the rate of co-rated cases and user rating preferences. These measures can better distinguish similarity values between users or items even in sparse data. However, they are only effective when users have co-rated cases. To address this constraint, from the perspective of probability distribution, a linear similarity BCF [21] based on Bhattacharyya coefficient and a hybrid similarity KLCF [28] based on KL divergence were presented, respectively. These two schemes can apply all rating information rated by a pair of users to comprehensively calculate similarity values. In addition, Wang et al. [30] introduced \(\alpha\)-divergence to calculate item similarity, and further added the influence of the proportion of co-rated cases. To achieve simple and efficient recommendation, Bag et al. [3] proposed a new structural similarity named RJaccard, which not only considered co-rated cases, but also emphasized the importance of the number of non co-rated cases. Similarly, Gazdar et al. [13] proposed a new similarity measure OS, which comprehensively utilized the proportion of non common ratings and relative absolute rating difference. Wang et al. [29] proposed a general framework by combining similarity measurement to improve the prediction accuracy with less memory and time. To sum up, many similarity measures have been presented from different perspectives to enhance the accuracy of similarity results. However, this is still a need for further research.

2.2 Item-Based Prediction Methods

The two classic prediction methods, Weighted Average (WA) [5, 14] and Mean Centering (MC) [32], have been widely applied to generate prediction. WA took into account similarity values between items and the ratings given by nearest neighbors. MC added further the average values of items to adjust prediction results. Among them, the number of neighbors had significant influence on prediction results. Their formulas are expressed as follows:

$$\begin{aligned}&p_{ui}=\frac{\sum \limits _{j\in N(i)}\mathrm{sim}(i,j)\cdot r_{uj}}{\sum \limits _{j\in N(i)}\left| \mathrm{sim}(i,j) \right| } \end{aligned}$$
(7)
$$\begin{aligned}&p_{ui}=\bar{r_{i}}+\frac{\sum \limits _{j\in N(i)}\mathrm{sim}(i,j)\cdot (r_{uj}-\bar{r_{j}})}{\sum \limits _{j\in N(i)}\left| \mathrm{sim}(i,j) \right| } \end{aligned}$$
(8)

where \(p_{ui}\) is the predicted rating obtained from the prediction formula. The similarity value between i and j is \(\mathrm{sim}(i,j)\).

In addition to the fundamental prediction methods, many researchers have put forward improvement strategies from different perspectives. Singh et al. [27] presented a predictive approach called Z-Score, which considered the rating differences by converting the ratings to z-scores and calculating the weighted average of z-scores. A new User Rating Prediction (URP) [19] algorithm was proposed to predict ratings for items, which assumed that similar users may be interested in similar items. Alhijawi et al. [2] introduced a new adaptable prediction approach (INH-BP). It customized predictors for each active user to suit the user environment, thereby improving the accuracy of prediction. However, in the case of sparse data, these conventional prediction methods utilize neighbor items only from these who have already been rated by the given user, leading to poor prediction performance. To ease the problem, RPA [36] has been proposed, which enables a larger proportion of nearest neighbors to be included in the prediction process. Recently, based on RPA, IRP [39] has been further proposed. It sufficiently exploits ratings rated by direct and indirect neighbors to iteratively update the predicted ratings matrix. However, it is required to set a proper iterative parameter to achieve a stable state, which makes a long running time. In summary, it is still worth further introducing item-based prediction methods, to improve the accuracy of predictions in sparse data.

3 The Proposed Model

In this section, we aim to construct a CF model (ISP) that fuses the new item similarity with prediction method. First, The KL divergence based on vague sets is introduced to measure item similarity. Then, the impact of rating quantity is further considered in our similarity measure. In addition, we propose an improved prediction method with the new neighbor selection strategy.

3.1 The Item Similarity Measure Based on Vague Sets

3.1.1 Definitions of Fuzzy Sets and Vague Sets on Recommendation Systems

When people make judgments or decisions on vague and complex objective issues, they are often uncertain. The theory of fuzzy sets [33] was first put forward to declare the information of uncertain decision-making. Likewise, in recommendation systems, user’s decision-making on items is often vague and no sharp class boundaries. Therefore, based on this association between fuzzy sets and recommendation systems, we first define a fuzzy set A, which represents overall preference of all users who have rated the given item i.

Definition 1

(Fuzzy sets) Let X donate the universe of discourse and represent the set of all users who have rated item i. Then, the fuzzy set A in X is defined as \(A=\left\{ \left( x,\mu _{\scriptscriptstyle A}\left( x \right) \right) :x\in X \right\}\).

For \(x\in X\), the membership grade \(\mu _{\scriptscriptstyle A}\left( x \right)\) is judged by the defined conditions as follows:

$$\begin{aligned} \mu _{\scriptscriptstyle A}\left( x \right) =\left\{ \begin{matrix} 1,\quad if\ r_{xi}> \bar{r_{x}} &{} \\ 0,\quad if\ r_{xi}\le \bar{r_{x}} &{} \end{matrix}\right. \end{aligned}$$
(9)

where \(r_{xi}\) is the rating that user x rates on item i, and \(\bar{r_{x}}\) is the average value rated by user x. According to the relationship between \(r_{xi}\) and \(\bar{r_{x}}\), user preferences are divided into two categories. Specifically, when \(r_{xi}> \bar{r_{x}}\), it means user x likes item i to a certain extent. While, it implies that user x dislikes item i.

However, in the real recommendation systems, each user will be affected by the corresponding surrounding environment. Thus, users who rate the same item affect each other. In other words, although supporters and opponents have different preferences (like or dislike) on the same item, they always influence each other in the process of propagation, so that a certain degree of hesitation will be produced in user preferences. The specific process is as follows. On the one hand, the preference of supporters will affect opponents. We can represent this as \(\mu _{\scriptscriptstyle A}\left( x \right) \cdot \left[ 1-\mu _{\scriptscriptstyle A}\left( x \right) \right]\). On the other hand, the preference of opponents will also affect supporters, and then we mark it as \(\left[ 1-\mu _{\scriptscriptstyle A}\left( x \right) \right] \cdot \mu _{\scriptscriptstyle A}\left( x \right)\). Therefore, the relationship of their mutual influence makes these users who will rate the same item be affected when making decisions, showing a certain degree of hesitation in the real recommendation systems. That is constructed by the sum of the above two values as \(2\cdot \mu _{\scriptscriptstyle A}\left( x \right) \cdot \left[ 1-\mu _{\scriptscriptstyle A}\left( x \right) \right]\).

To stress the effect of hesitation in decision-making, the defined fuzzy set A is extended to a vague set according to the concept of influence propagation [12] among users described above. As an extension and improvement theory of fuzzy sets, vague sets [34] have the merit that it sufficiently includes actions of like, dislike, and hesitation.

Definition 2

(Vague sets) The vague set \(\tilde{A}\) in X is given by \(\tilde{A}=\left\{ \left( x,t_{\tilde{A}} \left( x \right) ,1-f_{\tilde{A}}\left( x \right) :x\in X\right) \right\}\), with the condition \(t_{\tilde{A}}\left( x \right) +f_{\tilde{A}}\left( x \right) +\pi _{\tilde{A}}\left( x \right) =1\). Where \(t_{\tilde{A}}\left( x \right)\) is described as the degree of membership, \(f_{\tilde{A}}\left( x \right)\) is termed as the degree of non-membership, and \(\pi _{\tilde{A}}\left( x \right)\) represents the degree of hesitation.

According to the degree of hesitation caused by influence propagation analyzed above and the condition \(t_{\tilde{A}}\left( x \right) +f_{\tilde{A}}\left( x \right) +\pi _{\tilde{A}}\left( x \right) =1\) on vague set \(\tilde{A}\), we can infer that the degree of membership \(t_{\tilde{A}}\left( x \right)\) is obtained by the degree of membership \(\mu _{\scriptscriptstyle A}\left( x \right)\) on the defined fuzzy set to remove the preference influence of opponents on supporters. The formula is expressed as \(t_{\tilde{A}}\left( x \right) =\mu _{\scriptscriptstyle A}\left( x \right) - \left[ 1-\mu _{\scriptscriptstyle A}\left( x \right) \right] \cdot \mu _{\scriptscriptstyle A}\left( x \right) =\mu _{\scriptscriptstyle A}^{2}\left( x \right)\). Similarly, we can deduce that the degree of non-membership \(f_{\tilde{A}}\left( x \right)\) is gained by removing the preference influence of supporters on opponents from the degree of membership \(\mu _{\scriptscriptstyle A}\left( x \right)\) on our defined fuzzy set. That is \(f_{\tilde{A}}\left( x \right) =\left[ 1-\mu _{\scriptscriptstyle A}\left( x \right) \right] -\mu _{\scriptscriptstyle A}\left( x \right) \cdot \left[ 1-\mu _{\scriptscriptstyle A}\left( x \right) \right] =\left[ 1-\mu _{\scriptscriptstyle A}\left( x \right) \right] ^{2}\). In addition, it should be noted from the attribute of vague sets that the interval centre contains important information. The expression of interval centre is \(\varphi _{\tilde{A}}\left( x \right) =\frac{\mu _{\tilde{A}}\left( x \right) -\left[ 1-f_{\tilde{A}}\left( x \right) \right] }{2}\).

3.1.2 KL Divergence Between Vague Sets

After completing aforementioned definitions of fuzzy sets and vague sets on recommendation system. We first compute user preference probability on fuzzy set and vague set, respectively. And then, KL divergence is introduced to estimate the difference between vague sets.

According to the membership grade \(\mu _{\scriptscriptstyle A}\left( x \right)\) of each user on fuzzy set A, we first calculate \(p_{\scriptscriptstyle A,l}\) and \(p_{\scriptscriptstyle A,d}\), which indicate the probability that these users have optimistic attitude on item i and the probability that these users have pessimistic attitude on item i, respectively:

$$\begin{aligned} \left\{ \begin{array}{l} p_{\scriptscriptstyle A,l}=\frac{\sum \limits _{x\in X}\mu _{\scriptscriptstyle A}\left( x \right) }{N\left( X \right) }\\ p_{\scriptscriptstyle A,d}=1-p_{\scriptscriptstyle A,l}\\ \end{array}\right. \end{aligned}$$
(10)

where \(N\left( X \right)\) represents the number of users in the universe X.

Based on user preference probability on fuzzy set A, we obtain \(p_{\tilde{A},l}\), \(p_{\tilde{A},d}\), \(p_{\tilde{A},h}\) and \(p_{\tilde{A},m}\), which mean the probability that these users support item i, the probability that these users opposite item i, the probability of hesitation, and the probability of interval centre on vague set \(\tilde{A}\):

$$\begin{aligned} \left\{ \begin{array}{l} p_{\tilde{\scriptscriptstyle A},{l}'}=p_{\scriptscriptstyle A,l}-p_{\scriptscriptstyle A,l}\cdot p_{\scriptscriptstyle A,d}\\ p_{\tilde{\scriptscriptstyle A},{d}'}=p_{\scriptscriptstyle A,d}-p_{\scriptscriptstyle A,l}\cdot p_{\scriptscriptstyle A,d}\\ p_{\tilde{\scriptscriptstyle A},h}=2\cdot p_{\scriptscriptstyle A,l}\cdot p_{\scriptscriptstyle A,d}\\ p_{\tilde{A},m}=\frac{p_{\tilde{A},{l}'}+\left[ 1- p_{\tilde{A},{d}'}\right] }{2} \end{array}\right. \end{aligned}$$
(11)

We aim to construct a well-suited similarity measure from the perspective of user preference probability to calculate the similarity value between vague sets, that is, the similarity value between corresponding items. It is well-known that divergence is a very important concept of probability statistics. The divergence theory [10, 22] is a powerful tool that has been recognized to measure differences between information. Therefore, the KL divergence is utilized to evaluate the distance between any two vague sets (\(\tilde{A}\) and \(\tilde{B}\)). The KL distance can be expressed in Eq. (12), which breaks the constraint of co-rated cases, and can calculate the similarity between any two items by making full use of the rating information. Therefore, it has good adaptability to sparse data:

$$\begin{aligned} \mathrm{KL}(\tilde{A}\Vert \tilde{B}) =\sum \limits _{{\mathrm{pre}}'\in \left\{ {l}',{d}',h,m \right\} }p_{\tilde{\scriptscriptstyle A},{\mathrm{pre}}'}\cdot \log _2\frac{p_{\tilde{\scriptscriptstyle A},{\mathrm{pre}}'}}{p_{\tilde{\scriptscriptstyle B},{pre}'}} \end{aligned}$$
(12)

where the set of user preferences and interval centre on vague set is \(\left\{ {l}',{d}',h,m \right\}\), and \({\mathrm{pre}}'\) represents any element in this set.

However, it is worth noting that when \(p_{\tilde{\scriptscriptstyle A}, \mathrm{pre}'}=0\) or \(p_{\tilde{\scriptscriptstyle B},\mathrm{pre}'}=0\), the formula 12 will make no sense. To remedy it, we smooth the user preference probability \({p_{{\mathrm{pre}'}}}\) to obtain a new user preference probability \(\hat{p}_{\mathrm{pre}'}\), as shown in the following Eq. (13):

$$\begin{aligned} \hat{p}_{\mathrm{pre}'}=\frac{\delta +p_{\mathrm{pre}'}}{1+\delta \cdot \left| K \right| } \end{aligned}$$
(13)

where the smoothing factor is \(\delta\) with \(0< \delta < 1\). \(\left| K \right|\) represents length of the set to which \({pre}'\) belongs. We derive the error between \(p_{\mathrm{pre}'}\) and \(\hat{p}_{\mathrm{pre}'}\) after smoothing, as shown below.

$$\begin{aligned} \mathrm{error}&=\left| \hat{p}_{\mathrm{pre}'}-p_{\mathrm{pre}'} \right| \nonumber \\&=\left| \frac{\delta +p_{\mathrm{pre}'}}{1+\delta \cdot \left| K \right| }-p_{\mathrm{pre}'} \right| \nonumber \\&=\left| \frac{\delta -p_{\mathrm{pre}'}\cdot \delta \cdot \left| K \right| }{1+\delta \cdot \left| K \right| } \right| \nonumber \\&=\left| \frac{1-p_{\mathrm{pre}'}\cdot \left| K \right| }{1/\delta +\left| K \right| } \right| \end{aligned}$$
(14)

According to this above deduction of equation, if \(\delta \rightarrow 0\), then \(error \rightarrow 0\). Thus, \(\hat{p}_{\mathrm{pre}'}\) is a reasonable substitute for \(p_{\mathrm{pre}'}\) when \(\delta\) is set to a small value. Here, we assumed that \(\delta =0.000009\).

Finally, KL distance after the process of smoothing can be rewritten as follows:

$$\begin{aligned} \mathrm{KL}(\tilde{A}\Vert \tilde{B}) =\sum \limits _{\mathrm{pre}'\in \left\{ l',d',h,m \right\} }\hat{p}_{\tilde{\scriptscriptstyle A},\mathrm{pre}'}\cdot \log _2\frac{\hat{p}_{\tilde{\scriptscriptstyle A},\mathrm{pre}'}}{\hat{p}_{\tilde{\scriptscriptstyle B},\mathrm{pre}'}} \end{aligned}$$
(15)

3.1.3 The impact of rating quantity

In the above-mentioned KL divergence based on vague sets, only probability values after fraction reduction are involved in the calculation of KL divergence, resulting in a certain loss of the information of rating quantity. Especially in the case of extremely sparse data, it is easy to cause large errors in similarity results. In general, the more users who have rated a given item, the more correct the probability information we can get. To make similarity results more credible, the effect of the number of ratings needs to be considered when calculating KL divergence. Thus, we define a weight factor \(\alpha\) to represent the ratio of vague set \(\tilde{A}\), using the ratio of the scale of universe on vague set \(\tilde{A}\) to the sum of the scales of universe on vague sets \(\tilde{A}\) and \(\tilde{B}\). Correspondingly, the ratio of vague set \(\tilde{B}\) is \(1-\alpha\):

$$\begin{aligned} \alpha =\frac{N(\tilde{A})}{N(\tilde{A})+N(\tilde{B})} \end{aligned}$$
(16)

where \(\alpha\) is dynamic as the scale of universe on the vague set varies. It efficiently reflects the difference about the number of ratings between vague sets.

Therefore, the distance formula after joining the number of ratings is expressed in Eq. (17), which avoids a certain loss of the information of rating quantity caused by fraction reduction in the calculation of KL divergence and improves the accuracy of similarity values in sparse data:

$$\begin{aligned} \mathrm{KL}(\tilde{A}\Vert \tilde{B}) =\sum \limits _{\mathrm{pre}'\in \left\{ l',d',h,m \right\} }\alpha \cdot \hat{p}_{\tilde{\scriptscriptstyle A},\mathrm{pre}'}\cdot \log _2\frac{\alpha \cdot \hat{p}_{\tilde{\scriptscriptstyle A},\mathrm{pre}'}}{\left( 1-\alpha \right) \cdot \hat{p}_{\tilde{\scriptscriptstyle B},\mathrm{pre}'}} \end{aligned}$$
(17)

On the basis of the dual concept of the distance and similarity measure, we measure the similarity of items i and j by calculating KL distance between corresponding vague sets \(\tilde{A}\) and \(\tilde{B}\). The formula is indicated as below:

$$\begin{aligned} \mathrm{sim}\left( i,j \right) =\frac{1}{1+\mathrm{KL}(\tilde{A}\Vert \tilde{B})} \end{aligned}$$
(18)

3.2 Item-Based Prediction Method with the New Neighbor Selection Strategy

In the traditional item-based prediction methods, the availability of neighbors is strictly required in prediction formulas. To generate prediction, \(r_{uj}\) needs to be explicitly given, that is, a given user must rate the nearest neighbors (items). Due to sparse data, a considerable number of neighbors do not meet the constraint condition of prediction formulas. Even though these neighbors are very similar to the target item, their information is not effectively applied for prediction, either. This phenomenon will decrease the accuracy of prediction.

Based on aforementioned analysis, our reasonable assumption is that these neighbors may still make a valuable contribution to the prediction process, although the given user does not explicitly rate them. This is a promising process ignored in traditional prediction methods. However, it should be considered in our algorithm to effectively deal with the problem of sparse data. Inspired by this assumption, we propose an item-based prediction method with the new neighbor selection strategy. The detailed implementation process is as follows. We divide these items in the nearest neighbors set (N) of the target item i into two subsets (N1 and N2) by judging whether the given user u rates them. First, items are classified as subset N1 by satisfying this condition. Any rating \(r_{uj}\) exists and can directly participate in the prediction process. Second, items that do not meet this condition are categorized as subset N2. User u has not already rated any item in subset N2 and cannot be directly involved in predictive calculation. Aiming at this case, the approximate substitution principle is further introduced to indirectly allow these neighbors to participate in prediction process. For any item j from subset N2, we search user v in the set of users who have rated item j, and ensure v is the most similar to the given user u. Then, the non-given rating \(r_{uj}\) is replaced with the estimated rating \(r_{vj}\). However, \(r_{vj}\) is not as accurate as the explicitly given value \(r_{uj}\). Hence, \(\mathrm{sim}\left( u,v \right)\) is further added as the specified weight factor to adjust the error caused by this approximate substitution principle.

Formally, our proposed prediction method is defined in Eq. (19), which improves the ability of prediction in sparse data by integrating more rating information of neighbors:

$$\begin{aligned} p_{ui}=\bar{r}_{i}+\frac{\sum \nolimits _{j\in N1}sim\left( i,j \right) \cdot \left( r_{uj}-\bar{r}_{j} \right) +\sum \nolimits _{j\in N2,{{v\in N\left( u \right) }}}sim\left( u,v \right) \cdot sim\left( i,j \right) \cdot \left( r_{vj}-\bar{r}_{j} \right) }{\sum \nolimits _{j\in N1}sim\left( i,j \right) +\sum \limits _{j\in N2,{{v\in N\left( u \right) }}}sim\left( u,v \right) \cdot sim\left( i,j \right) } \end{aligned}$$
(19)

where \(N\left( u \right)\) represents the nearest neighbors of user u, and user v is the most similar to user u in \(N\left( u \right)\). \(\mathrm{sim}\left( u,v \right)\) is the same as the calculation principle of \(\mathrm{sim}\left( i,j \right)\) in our proposed similarity measure, and the process of detailed implementation can be omitted.

It is necessary to point out two extreme cases in our proposed prediction method. If a given user has rated each item in the nearest neighbors of item i, the prediction formula is equivalent to the traditional prediction formula MC. On the contrary, if a given user has not rated any of nearest neighbors of the target item, our proposed prediction formula evolves into the following:

$$\begin{aligned} p_{ui}=\bar{r}_{i}+\frac{\sum \nolimits _{j\in N\left( i \right) ,{{v\in N\left( u \right) }}}\mathrm{sim}\left( u,v \right) \cdot \mathrm{sim}\left( i,j \right) \cdot \left( r_{vj}-\bar{r}_{j} \right) }{\sum \nolimits _{j\in N\left( i \right) ,{{v\in N\left( u \right) }}}\mathrm{sim}\left( u,v \right) \cdot \mathrm{sim}\left( i,j \right) } \end{aligned}$$
(20)

Finally, the item-based CF model (ISP) in this paper is composed of the item similarity measure based on vague sets and prediction method with the new neighbor selection strategy. The detailed implementation process is presented in Algorithm 1.

figure a

4 Experiment

4.1 Experimental Setup

In this section, we carry out all experiments on the development platform AI Studio, where it has a quad-core processor, primary memory is 32GB, the development language is Python3.7, and the framework version is PaddlePaddle 2.0.2.

4.2 Data Preparation

For our experiments, two commonly-used benchmark data sets, namely MovieLens100K (ML-100K) and MovieLens1M (ML-1M) both from MovieLens, are employed. The properties, including number of users, number of items, number of ratings and sparsity, are briefly describes in Table 1. The experimental results of each algorithm are verified according to the fivefold cross-validation method. Specifically, we classify each data set into two parts: 80% of rating data is included in the training set, and the rest is regarded as the testing set.

Table 1 Data sets properties

4.3 Evaluation Indicators

We evaluate the prediction accuracy [11, 29] based on Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). The two metrics compute the error by comparing the difference of predicted values and actual values. And lower errors mean the better prediction accuracy. The formulas of MAE and RMSE are shown in Eqs. (21) and (22), respectively:

$$\begin{aligned}&\mathrm{MAE}=\frac{1}{n}\sum \limits _{i=1}^{n}\left| r_{ui}-p_{ui} \right| \end{aligned}$$
(21)
$$\begin{aligned}&\quad \mathrm{RMSE}=\sqrt{\frac{1}{n}\sum \limits _{i=1}^{n}\left( r_{ui}-p_{ui} \right) ^{2}} \end{aligned}$$
(22)

where \(r_{ui}\) is the actual rating in the testing set. n represents the number of prediction results.

We assess the recommendation accuracy [3, 8] using F1-value. As a comprehensive indicator, it integrates Precision and Recall simultaneously. The higher F1-value indicates the better recommendation quality. The expressions of Precision, Recall and F1-value are as follows, respectively:

$$\begin{aligned}&\mathrm{Precision}=\frac{n\left( I_{pr}\cap I_{ar}\right) }{n\left( I_{pr}\right) } \end{aligned}$$
(23)
$$\begin{aligned}&\mathrm{Recall}=\frac{n\left( I_{pr}\cap I_{ar}\right) }{n\left( I_{ar}\right) } \end{aligned}$$
(24)
$$\begin{aligned}&\mathrm{F1-value}=\frac{2\times \, \mathrm{Precision} \,\times Recall}{\hbox {Precision}+\hbox {Recall}} \end{aligned}$$
(25)

where \(I_{ar}\) and \(I_{pr}\) are the real recommendation list and the predicted recommendation list, respectively. \(n\left( \cdot \right)\) represents the length of specified list. The recommendation rule adopted is that each rating needs to be larger than the average value rated by corresponding user.

In addition to the above two types of evaluation indicators, we also present the rate of successful predictions (RSP) and perfect predictions (RPP) [37] to comprehensively measure the prediction ability of algorithms. RSP emphasizes the ability of a CF algorithm to make valid predictions, regardless of whether these results are accurate or not. RPP refers to the rate of correctly predicting the actual ratings in the testing set, which stresses the ability to make correct predictions. Their formulas are defined as below:

$$\begin{aligned} \hbox {RSP}= & {} \frac{N_{s}}{N_{t}} \end{aligned}$$
(26)
$$\begin{aligned} {{\hbox {RPP}}}= & {} \frac{N_{p}}{N_{t}} \end{aligned}$$
(27)

where \(N_{s}\) and \(N_{p}\) represents the number of successful predictions and correct predictions, respectively. In addition, \(N_{t}\) is the amount of ratings that needs to be predicted.

4.4 Experimental Results and Analysis

In this section, related experiments are mainly divided into two parts to validate the performance of our proposed model ISP. The first part is to gradually test the effectiveness of similarity measure and prediction method designed in ISP algorithm. The second part is to compare and discuss the performance of ISP and other state-of-the-art algorithms in each indicator. All experiments are carried out in ML-100K and ML-1M data sets, respectively. The number of the nearest neighbors, K, ranges from 10 to 100 in steps of 10.

Fig. 1
figure 1

Performance results in ML-100K

4.4.1 The Effectiveness Test of the Proposed Model

According to the steps of the proposed model ISP, we first construct different algorithms, namely, case1, case2, and case3, respectively. Then, the effectiveness of each component in ISP algorithm is gradually analyzed in experiments. The details of these three cases are as follows.

Fig. 2
figure 2

Performance results in ML-1 M

Case 1: according to the definition of vague set, we first calculate and smooth user preference probability, and then these probability values are added to the KL divergence to measure the difference between vague sets, as shown in formula (15). Based on the formula (18), we obtain the corresponding item similarity. The widely used MC is selected as the prediction method.

Case 2: based on the item similarity in case 1, the weight factor \(\alpha\) is defined to further consider the impact of rating quantity, which is incorporated in KL divergence to obtain a new distance formula as shown in (17). Then, we obtain the corresponding item similarity through formula (18). MC is still selected as the prediction method.

Case 3: the similarity measure in case2 is still adopted to compute item similarity. The prediction method MC used above is replaced by the proposed prediction formula (19).

Here, MC is selected as the comparative method of our proposed prediction method. The main reason is MC performs more efficient than other traditional prediction methods, which has proved by a comparative study in [27]. And it should be noted that case3 algorithm is our proposed model ISP.

First, case 1, case 2 and case 3 algorithms are all carried out in ML-100K data set. The results of prediction accuracy are given in Fig. 1a, b, respectively. Prediction errors (MAE and RMSE) of all algorithms decline gradually with the increasing of K values. Compared with case1, case2 algorithm has an approximate reduction of 1–5% in MAE and 3–8% in RMSE. The main reason is that the rating quantity is considered to avoid the information loss caused by fraction reduction in case 2 algorithm, which conduces to enhancing the prediction performance. Compared with case2, the MAE and RMSE of case3 algorithm decrease nearly 2% in different K values. When K value exceeds 40, prediction errors of case3 gradually become steady. At this time, MAE is about 0.745 and RMSE is about 0.958. Especially when the number of neighbors is relatively small, the advantage of prediction results is more prominent. The primary reason is that the traditional prediction method MC in case2 algorithm is replaced by our proposed prediction method in case3 algorithm. When generating prediction, the rating information of the nearest neighbors can participate to adjust the final prediction result, which further improves the accuracy of prediction.

Figure 1c shows the results of recommendation accuracy about case1, case2 and case3 algorithms. With the increasing of the number of neighbors, F1-value of each algorithm increases continuously. It is worth pointing out that the interval range of case1 has an obvious change, while case2 and case3 are relatively stable as the number of neighbors varies. Compared with case1 algorithm, F1-value of case2 has increased by 1.87% on average. Likewise, F1-value of case3 has increased by an average of 4.11% compared with case2 algorithm. Therefore, we can infer that case3 algorithm has the best recommendation accuracy in different K values, which proves the effectiveness of the number of ratings considered in the proposed similarity measure and prediction method.

The prediction ability of case1, case2 and case3 algorithms are depicted in Fig. 1d, e, respectively. As we can see from Fig. 1d, in different number of the nearest neighbors, the RSP value of case2 is evidently higher about 5.71% on average than case1 algorithm. Meanwhile, the RSP value of case3 has a slight rise about 1.61% compared with case2 algorithm. What’s more, with the increasing of neighbors, the gap between case2 and case3 becomes smaller. Case3 algorithm has the highest RSP value about 0.998 and remains unchanged. Figure 1e illustrates that case2 obtains the better RPP value compared with case1 algorithm, and the RPP value of case3 has better performance compared with case2 algorithm. The main reason is that the similarity measure is not limited by co-rated cases, and can calculate similarity results between any items. Meanwhile, our proposed prediction method does not strictly demand that a given user rates the nearest neighbors of the target item, either. Therefore, there are few restrictions in calculating predictive values.

In the same way, these three algorithms are executed in ML-1M data set, Figure 2 depicts the results of each index. The similar results are obtained that the performance of case2 algorithm obviously outperforms case 1, and the performance of case3 algorithm is better than case2. Therefore, according to the analysis of in the above two data sets with different degree of sparsity, case3 algorithm has the best performance, which also proves the effectiveness of each component in our proposed model.

4.4.2 Comparative Analysis

In this section, to further validate the performance of our proposed model ISP, other state-of-the-art CF algorithms, including SPCC [16], JMSD [4], NHSM [15], BCF [21], KLCF [28], \(\alpha\)-CF [30], RJMSD [3] and OS [13], are implemented in MovieLens data sets. We mainly discuss the results of ISP and these comparison algorithms in different number of the nearest neighbors.

MAE and RMSE

Figure 3 portrays the results of prediction accuracy in ML-100K data set. As the increasing of K values, prediction errors of all algorithms decrease gradually. ISP algorithm remains the lowest MAE and RMSE values in different K values. When the number of neighbors is less than 50, our proposed model has an average reduction of 2.6% in MAE and 3.2% in RMSE compared with the closest competitor (NHSM). The principal reason is that ISP algorithm integrate more ratings when calculating item similarity, to make the nearest neighbors more reliable. The prediction stage can also sufficiently apply the ratings information of the nearest neighbors to further improve the accuracy of the predictive values. Likewise, the results of prediction accuracy in ML-1M data set are portrayed in Fig. 4. In different number of neighbors, ISP algorithm attains the lowest prediction errors compared with other algorithms. When the number of neighbors is 100, the values of MAE and RMSE achieve optimal, which are about 0.712 and 0.916, respectively.

Fig. 3
figure 3

MAE and RMSE values of all algorithms in ML-100K

Fig. 4
figure 4

MAE and RMSE values of all algorithms in ML-1 M

F1-value

Figures 5 and  6 illustrate the results of recommendation accuracy in ML-100K and ML-1 M data sets, respectively. As shown in Fig. 5, with the number of neighbors increases, F1-value of each algorithm becomes continuously larger. In different number of neighbors, ISP algorithm always remains the best F1-value. When K value is 100, the corresponding best F1-value is close to 0.702. It is worth noting that, compared with other comparison algorithms, the F1-value of BCF and KLCF algorithms has significantly improved. Furthermore, there is only a gap about 1% between these two algorithms and ISP. According to the above analysis, we can conclude that the similarity measure based on probability distribution is an outstanding way to enhance the accuracy of recommendation. Similarly, Fig. 6 shows that ISP attains the best F1-value than other comparison algorithms in different number of the nearest neighbors. However, the F1-value of NHSM and JMSD algorithms performs well and keeps the closest trend to ISP algorithm. The possible reason is that the quantity information of co-rated cases has an important impact on the recommendation results in extremely sparse data sets. NHSM and JMSD algorithms emphasize the proportion of co-rated cases, so they have better recommendation quality than other comparison algorithms.

Fig. 5
figure 5

F1-value of all algorithms in ML-100K

Fig. 6
figure 6

F1-value of all algorithms in ML-1 M

RSP and RPP The RSP and RPP results in ML-100K and ML-1M data sets are described in Figs. 7 and  8, respectively. As illustrated in Fig. 7a, the RSP values of ISP algorithm achieve the best and remain about 0.998 in different K values. Among all these comparison algorithms, BCF and KLCF algorithms have better RSP values and behave basically consistent. An important reason is that these two approaches can calculate similarity results between any users. Therefore, computation results are more objective and contribute to generating more predictions. From Fig. 7b, it can be seen that the RPP values of all algorithms become higher with the increasing of K values. The RPP value of ISP algorithm has optimal performance with the range from 0.41 to 0.44. Meanwhile, the RPP values of NHSM and KLCF algorithms are closest to ISP, and the distribution intervals both are located between 0.37 and 0.42. Similarly, as can be seen from Fig. 8a, the RSP value of ISP is an invariant constant about 0.999. Meanwhile, the RSP values of NHSM, BCF and KLCF algorithms are gradually larger with the increasing of K values, and the gap with ISP algorithm becomes unapparent. Especially when K value is 100, the gap between them is less than 1%. From Fig. 8b, we can conclude that, with the number of neighbors increases, ISP algorithm always maintains the highest RPP value. In particular, when K value is less than 50, ISP is apparently superior to other comparison algorithms.

Fig. 7
figure 7

RSP and RPP values of all algorithms in ML-100K

Fig. 8
figure 8

RSP and RPP values of all algorithms in ML-1 M

5 Conclusion

In this paper, to enhance the adaptability to sparse data, we propose an item-based CF model (ISP) composed of a new item similarity measure and prediction method. First, the association between fuzzy sets and recommendation systems is introduced. Based on this association, fuzzy sets on recommendation systems are defined and extended to vague sets. Next, we use KL divergence to calculate the similarity between vague sets from the perspective of user preference probability, that is, the similarity between corresponding items. To further improve the accuracy of similarity results, the impact of rating quantity is considered in the similarity measure. What’s more, an item-based prediction method with the new neighbor selection strategy is proposed by aggregating more potentially valuable nearest neighbors’ information to achieve higher prediction accuracy. Experiments on two data sets with different sparsity show that our ISP algorithm is less affected by data sparsity and can provide better recommendation performance.

For future work, we intend to explore how to better deal with the problem of data sparsity. Meanwhile, a relationship between accuracy and efficiency will be investigated to obtain better prediction and recommendation quality while maintaining less time cost.