Collaborative filtering recommendation algorithm based on user correlation and evolutionary clustering

In recent years, application of recommendation algorithm in real life such as Amazon, Taobao is getting universal, but it is not perfect yet. A few problems need to be solved such as sparse data and low recommended accuracy. Collaborative filtering is a mature algorithm in the recommended systems, but there are still some problems. In this paper, a novel collaborative filtering recommendation algorithm based on user correlation and evolutionary clustering is presented. Firstly, score matrix is pre-processed with normalization and dimension reduction, to obtain denser score data. Based on these processed data, clustering principle is generated and dynamic evolutionary clustering is implemented. Secondly, the search for the nearest neighbors with highest similar interest is considered. A measurement about the relationship between users is proposed, called user correlation, which applies the satisfaction of users and the potential information. In each user group, user correlation is applied to choose the nearest neighbors to predict ratings. The proposed method is evaluated using the Movielens dataset. Diversity experimental results demonstrate that the proposed method has outstanding performance in predicted accuracy and recommended precision.


Introduction
The development of the internet and e-commerce makes our life more convenient as billions of required products are searchable online. Meanwhile, we must face the problem of information overload in daily life. Under the circumstances, it is much harder for us to dig out relevant object that we really want than ever before [1]. Many researchers have done plenty of research on the recommendation system, making progress about this particular issue. But data sparseness has always been an important reason for the recommendation low accuracy. To make full use of existing information, researchers have proposed more and more excellent algo-  [2], such as neighborhood-based CF(Collaborative Filtering) and model-based CF.
Neighborhood-based CF algorithms are further classified into two categories: user-based CF [3] and item-based CF [4]. And the basic principle of them is interlinked. For instance, user-based CF considers two users to be similar when their neighbors are similar. Obviously, selection of nearest neighbor is significant. Correspondingly, choosing the appropriate similarity will be helpful for the improvement of the recommended accuracy and the applicability to the recommended algorithm [5]. The researcher takes into account the score information that best reflects the user's preferences. They came up with similarity measures such as cosine similarity [6], pearson correlation coefficient [7], adjusted cosine [8]. And others put forward such as Salton similarity [6] and Jaccard similarity [9], which is taken into account the number of items of self and their neighbors. A vertex similarity index CosRA is proposed, which combines both advantages of cosine index and resource-allocation index [1]. This fusion enhances the overall performance of personalized recommendation.
Model-based CF algorithms learn a model from the training data and subsequently, the model is utilized for rec-ommendations [10], such as matrix decomposition model [11][12][13], clustering model [14][15][16]. In Ref. [11] presented a collaborative filtering recommendation algorithm based on SVD smoothing. Their approach predicts item ratings that users have not rated by the employ of SVD technology, and then uses Pearson correlation similarity measurement to find the target users neighbors, lastly produces the recommendations, which can alleviate the sparsity problems of the user item rating dataset.Kindly rephrase the sentence "Their approach predicts item ....." Superiority of CF algorithms with clustering techniques relieve the impact of data sparseness and cold start problems, which have been verified by many research experiments [17,18]. Recently, many scholars have also proposed dynamic evolutionary clustering algorithms [19][20][21], compared to the classical Kmeans algorithm, which not only reduced time-consuming and complexity, but also have indeterminate classification category. Liao et al. [22] have presented an approach which applied the user-product rating matrix without the necessity of collecting extra attributes about customers and products to cluster, and clusters are formed automatically. Shang et al. proposed a novel fuzzy double trace norm minimization method for recommender systems with faster running time [23]. Recently, we have presented two novel recommender algorithms based on heterogenous network model [24] and filled matrix [25]. They combined dynamic clustering with traditional collaborative filtering to gain better recommendation results.
Based on the above work, a novel recommended algorithm is proposed in this paper. Main motivations are given as follows. Firstly, to reduce complexity and time-consuming, the following two steps are executed on the score matrix: normalizing and reducing the dimension. Then, clustering principle is constructed and a novel dynamic evolutionary clustering model is proposed to imitate the changes of different nodes in network model. Secondly, to better grasp the highest similar interest neighbors, a new similarity index which combines advantage of degrees with potential information of nodes. Thirdly, ratings are predicted in each group based on user correlation.
The rest of the paper is structured as follows. Section "Evolutionary clustering algorithm" gives the evolutionary clustering model. In section "Similarity indices", we review existing similarity indices and put forward a new similarity index. Full description of our algorithm is presented in section "The UCEC&CF algorithm flow". Section "Experiment and studies" shows experimental results and comparisons with existing algorithms, and presents advantages of our algorithm. Section "Conclusion" concludes the paper.

Evolutionary clustering algorithm
Many complex social, biological, and information systems can be described by networks, where nodes indicate individuals, and links represent the relationships or interactions between nodes [26]. And the study of link prediction has become a common focus of science. A link between two nodes can be found by the topology information of network, or the feature vectors of nodes. In this paper, the network G(V , E) is constructed, where V is the set of nodes containing M users and N 1 items, and E is the set of scores of items from users. Firstly, clustering method is presented to gather similar interest users into the same groups.

Data pre-processing
As we all know, clustering is roughly divided into three steps: deal with data at first, then choose the appropriate distance function to measure the similarity between objects, and at last obtain the groups. Naturally, the most important step of clustering is to choose optimal clustering principle. Ba et al. [12] proposed a method to cluster all users by calculating the user characteristic value with the attributes of users. But this method requires much computational cost. So, in this paper, to reduce the complexity, only the score matrix is adopted to gather users into groups.
To effectively apply scores information, Ref. [22] normalized the ratings of user u i according to Eq. (1). By this, the influences in habit of people giving ratings will be reduced. So, we take the same approach, original score matrix R = (r i j ) M×N 1 will be processed by (1).
Here, r i j denotes the user i rating of the item j, m i denotes the number of items from user u i . At the same time, to gain better recommendation results, items with fewer scores are removed. And in this way, the sparse scoring matrix will become relatively dense. Here, the number of items is changed from N 1 to N , N 1 > N . By doing this, the new relationship between users and items is generated, which is denoted as R = ( r i j ) M×N . The proportion of removing is denoted as α and the influence of α on the results is later shown in section "Experiment and studies". Adjacent matrix A of network is constructed similar to Ref. [24]. In this paper, adjacency matrix is given as follows: Here, A is an N + M dimension square matrix. 0 is zero matrix.

Evolutionary clustering algorithm
Wu et al. [27] have presented generalized Kuramato model to identify the community structure for positive networks. The model is defined as follows: Here, θ i is the state of the i-th node, ω i denotes the initial state of node i, and it is randomly generated from the interval [0, 2π ]. N refers to the number of nodes in the network. If the i-th node is connected with the j-th node, a i j = 1, otherwise, a i j = 0. K p > 0 is to make the states of two connected oscillators in the network synchronize and K n < 0 is to make the states of two unconnected oscillators in the network evolve far away. Inspired by this, the model is used in heterogeneous network, and the user is divided into different communities by the relationship between the user and the item. The evolutionary clustering rule is designed as: if the user is rating the item with higher score, they are regarded as in the same group.
In this paper, the dynamic evolutionary clustering model is defined as: Here, where N + M refers to the number of users and items in the network. The coupling parameters K p >0, K n < 0 are to make the nodes with higher scores in the network evolve together and to make the nodes with lower scores in the network evolve far away, respectively. δ denotes the critical value of the high and low scores, the size is the median of the nonzero element in the matrix A. With the random generation of the initial values of the nodes, the iteration begins. Ref. [27] has verified the convergence of the nodes evolution. After a certain number of iterations, the states of nodes would be stable. Nodes with state values in nearby Adjusted cosine (ADJ) would get closer and be divided into the same group. At the same time, nodes with large different state values would be divided into the different groups.
Higher ε means more nodes are divided into the same cluster and lower ε means fewer nodes are divided into the same cluster. ε is to confirm which nodes would be divided into the same cluster.

Local similarity indices
Most of the existing similarity indices are classified into two categories: local and global similarity indices. The similarity indices [28] are defined in Table 1, where Γ (u) denotes the set of neighbors of u, Γ (uv) denotes the set of common neighbors of u and v, k(u) is the degree of user u, r ui denotes user u rating of item i, r u is historical average score of u.

User correlation indices
In real life, people have different expectations for the products, those who have lower expectations tend to give a high lever of satisfaction feedback, while the others seem like to be much pickier. For instance, user A who has the lower expections always gives 4 or 5 stars for the product feedback; user B who belongs to the other side only gives 4 stars feedback which could be possibly the high score he or she marked in the history. So when both of them make their comments like 4 stars on product, it actually means different, dose not it? So in this paper, we evaluate the user's preference when they do the feedback. On the other hand, different feedbacks from different users can be a good evaluation about the product, for example, user A gives 2 stars while user B gives 5 stars on a same product. Why they originally buy this product? The reason maybe is that they both like its appearance. But why they mark feedback are quite different? The reason maybe is practicability, price, even after-sales service. Thereby, it is more difficult to figure out why exactly users give different evaluations about same product. So, the paper takes the general evaluation into account; meanwhile, the feedback needs to focus on the same product.
Based on the above reasons, the calculation method of user correlation is given as follows : Here, B = {i|r ui ≥ β and r vi ≥ γ }, C = {i|r ui < β and r vi < γ }. N (u) denotes items set of user u has evaluated in original score matrix R. N (v) denotes items set of user v has evaluated in original score matrix R. β and γ are two parameters; in our experiment they represent historical average rating scores, respectively. It can be found that the numerator of SIM uv considers potential information of users and the denominator is about degrees of users, which can obtain more similar information between users u and v.
In our algorithm, we do not adopt the normalization in similarity indices. Similarity calculation is applied to differentiate the influences of neighbors.

The UCEC&CF Algorithm Flow
In this paper, Collaborative Filtering Recommendation Algorithm Based on User Correlation and Evolutionary Clustering (denoted as UCEC&CF) algorithm is proposed. The procedure of algorithm is given as Fig. 1.
Step 1 Generate the adjacency matrix. Input score matrix of users and items, normalize the matrix by (1), remove the items with fewer scores, and generate the adjacency matrix as A.
Step 2 Generate the initial states of nodes. States of nodes are randomly generated from the interval [0, 2π ].
Step 3 Update states of nodes. Given two values for coupling strength parameters K p and K n and iteration threshold t, states are updated by (2) and the states are stable at some certain values. In all the simulations of this paper, K p = 10, K n = − 0.01, t = 100.
Step 4 Obtain the group structure. Nodes would be divided into a group if the state values meets the condition |θ i −θ j | < ε. In all the simulations of this paper, ε = 0.001.
Step 5 Calculate the user correlation matrix. User correlation matrix SIM is calculated by (3) in its group based on the original score matrix. Step 6 Predict the ratings. For every rating of items to be tested from target users, the calculation method is according to formula (4).
Here, r ui denotes the rating of item j from user i, r u is historical average score of u. SIM uv denotes the relationship between user u and v. U is neighbor set which contains the most similar k users with user u in same group according to our user correlation index.
Step 7 Top-N recommend. All the predict scores for the target user are sorted in descending order. Pick the first N items to generate a recommend list and give them to the target user.

Experiment and studies
All the algorithms are coded in Matlab 2013a; the configuration of the experimental platform is Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 64 GB memory. The operating system is Windows Server 2008 R2 Enterprise.

Dataset and evaluation metrics
We conducted experiments on movielens dataset (http:// grouplens.org/datasets/movi-elens/). The dataset from Grouplens is the most popular dataset in the field of recommender systems. Our evaluation is restricted to 100 K and 1 M datasets; the former consists of 1682 movies, 943 users, and almost 100,000 known ratings in the scale from 1 to 5 and the latter contains 1,000,209 anonymous ratings of approxi- mately 3900 movies made by 6040 users. To reduce the error, we conducted fivefold cross-validation taking 80% of available ratings as train and the rest as test data.
In general, the evaluation metrics of recommended system are accuracy, but the accuracy of prediction and Top-N recommendation is different. In this paper, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) are applied to verify the accuracy of prediction. The lower the MAE (or RMSE) is, the better the prediction ratings.
Here, n is the number of ratings to be predicted. r ui and r ui are the real and predicted scores, respectively. We also evaluate our algorithm based on the precision of Top-N recommendation measured in terms of F in varying lengths of recommendation list. The higher the F is, the better the recommender system algorithm.
R u denotes the set of users in the real high ratings (i.e., ≥ 3 in movielens dataset) and T u denotes users set in predicted high ratings in the test set.

Experimental results
In this subsection, we apply the multiview to test the accuracy of our algorithm UCEC&CF, and compare the accuracy of the prediction and Top-N recommendation with other algorithms.
1. Comparing results of processing score matrix. To test the impact of normalization and dimension reduction on the score information, we compare our algorithm with DEHC-CF [24]. In their paper, the authors established heterogeneous network, and generated the groups based on score information. Parameters in clustering principle for five cross-validation sets are δ 1 = 0.0070, δ 2 = 0.0072, δ 3 = 0.0071, δ 4 = 0.0071, δ 5 = 0.0069, and α = 50% in the following. The parameters of our clustering algorithm are K p =10, K n = −0.01. It can be observed from Fig. 2 that our method is slightly lower than DEHC-CF in some cases. Especially, when the number of neighbors is 60, our experiment is more effective. Mean value of MAE of DEHC-CF is 0.7491 and mean value of RMSE is 0.9572. Mean values of MAE and RMSE of UCEC&CF are 0.7482 and 0.9560, respectively. It can be shown that score matrix processing is meaningful to improve recommender precision.   The bolditalic values are corresponding optimal values algorithms, F improves with increasing length of recommendation list. Our algorithm performs better and recommends more satisfactory items for user when the number of recommendations becomes larger.

Comparing results of different clustering methods.
To test whether the clustering method of our algorithm is efficient, three methods are compared. They are DTD in Ref. [19], DPSO in Ref. [29] and Kmeans. In their papers, iteration threshold of DTD is 0.1. The parameters of DPSO are as follows: particle swarm size is 100, the number of interactions is 4, learning factor is 1.494, max and min inertia weights are 0.9 and 0.4, respectively. The cluster number of Kmeans is given three cases and individuals in network are divided into 3 clusters, 6 clusters and 10 clusters. The test results are shown in Fig. 3 and Table 3. All comparing results do not apply similarity indices normalization. Figure 3 presents MAE and RMSE values along different setting of neighborhood size. It can be observed that our presented method is more effective than the other clustering algorithms. The result of DPSO is that each group contains only one user; so, the prediction accuracy is independent of the number of neighbors. Mean values of MAE of UCEC&CF, DTD, DPSO and Kmeans are 0.7482, 0.7882, 0.8362 and 0.7557, respectively. Furthermore, mean values of RMSE for UCEC&CF, DTD, DPSO and Kmeans are 0.9560, 1.0070, 1.0437 and 0.9668. Besides, our proposed method is more efficient than Kmeans algorithm, as our method does not require the predefined number of clusters. Table 3 shows the F values for all the algorithms. UCEC&CF is more efficient than DTD and DPSO in view of F values in all cases. In terms of top-N recommendation accuracy, our algorithm slightly inferior to Kmeans algorithm with increasing length of recommendation list. F values of DPSO are worst.

3.
Comparing results of different similar indices. As follows, our user correlation results are compared with other eight similarity indices (see Table 1), and the results are shown in Fig. 4 and Table 4. All comparing results do not apply similarity indices normalization.
It can be seen that our method carries the best performance from Fig. 4. Some compared similarity indices only consider the degree of the nodes and others are computed from users' ratings to same items. Our algorithm obtains higher recommendation results, and the reason is that our presented user correlation considers degree of nodes and score information.
To further check the performance of the proposed algorithm, Table 4 gives recommended accuracy between algorithms in different number of recommendations. It can be seen that our method gains better performance than the others. 10 20    5. The comparison results between several existing recommendation algorithms. Collaborative filtering and matrix decomposition are representative of existing recommendation algorithms. To test the effectiveness of our overall algorithm, our UCEC&CF algorithm is compared with DEHC-CF and DTNM [23] on two datasets: MovieLens100K, MovieLens1M. DEHC-CF is a novel heterogeneous evolutionary clustering algorithm based on user collaborative filtering, the coupling strength K 1 = 20 and K 2 = −20. DTNM is a novel fuzzy double trace norm minimization method for recommender systems, the values of rank are integer between 5 and 15, and it has little effect on the result. DTNM is compared with rank = 5. The comparing results on different data sets are illustrated in Fig. 6 and Table 5. Figure 6a, b gives the MAE and RMSE values, respectively, for 100 K dataset. MAE and RMSE values for the 1 M dataset are given in Fig. 6c, d. From overall view, the improvement in MAE and RMSE is more pronounced for the 1 M dataset than the 100 K one as the former has higher explicit rating data sparsity.The result of DTNM is independent of the number of neighbors, which is because that it is a rank-dependent matrix decomposition algorithm. As expected, our UCEC&CF algorithm performs the best amongst all compared algorithms, which means ours work gains better performance than DEHC-CF and DTNM. Our work broadly includes: building heterogeneous networks, setting cluster principles, defining user correlation, etc. Table 5 gives a description of UCEC&CF, DEHC-CF and DTNM when items are recommended to users. The  The bolditalic values are corresponding optimal values experiment results demonstrate that our method gains better performance than the others compared algorithms with different datasets. In a word, in section "Experiment and Studies", two real datasets with different size are considered to verify the efficiency of our proposed algorithm UCEC&CF. Five diversity experiments are tested and UCEC&CF gains excellent recommendation results.

Conclusion
A new CF model based on user correlation and evolutionary clustering is proposed in this paper. The performance of our proposed method is verified by comparative experiments from different aspects. On the one hand, only normalized and reduced dimension score matrix is applied to generate clustering principle. An improved dynamic evolutionary model is constructed to imitate the constantly changing states values in the network. Similar users are divided into the same group. Besides, user correlation is proposed to measure the distance between users by combining the users satisfaction and the potential score information. In each group, a novel user correlation measurement is adopted to find the highest similar neighbors with target user. Finally, user-based collaborative filtering is adopted in each group. Extensive experiments demonstrate our proposed method gains better recommendation performance than the other compared algorithms. But the work is still much room for improvement, we want to join some other information of network into the design for further performance enhancement.