A Correlation Based Recommendation System for Large Data Sets

Correlation determination brings out relationships in data that had not been seen before and it is imperative to successfully use the power of correlations for data mining. In this paper, we have used the concepts of correlations to cluster data, and merged it with recommendation algorithms. We have proposed two correlation clustering algorithms (RBACC and LGBACC), that are based on finding Spearman’s rank correlation coefficient among data points, and using dimensionality reduction approach (PCA) along with graph theory respectively, to produce high quality hierarchical clusters. Both these algorithms have been tested on real life data (New York yellow cabs dataset taken from http://www.nyc.gov), using distributed and parallel computing (Spark and R). They are found to be scalable and perform better than the existing hierarchical clustering algorithms. These two approaches have been used to replace similarity measures in recommendation algorithms and generate a correlation clustering based recommendation system model. We have combined the power of correlation analysis with that of prediction analysis to propose a better recommendation system. It is found that this model makes better quality recommendations as compared to the random recommendation model. This model has been validated using a real time, large data set (MovieLens dataset, taken from http://grouplens.org/datasets/movielens/latest). The results show that combining correlated points with the predictive power of recommendation algorithms, produce better quality recommendations which are faster to compute. LGBACC has approximately 25% better prediction capability but at the same time takes significantly more prediction time compared to RBACC.


Introduction
Correlation analysis can be defined as an efficacious method to study the relationship among the data points. Strong and weak correlations in data points help in capturing the current and future trends in the data. There are many scenarios where correlations in data have been able to understand seemingly irrelevant data and give accurate results. Over the years, with the help of experimentation and hypothesis, correlation analysis has evolved as a concept. These days, given the power of high powered computing and greater storage capacity, the concept of correlation analysis is applied to large and high dimensional datasets. As a D. Pandove, A. Malhi result there is rapid emergence of correlations in data, with not much cost. The correlation methods aim on finding correlations with focus on "what" and not the "why" aspect of the data. The "why" part of data might be very appealing and interesting to the human mind, but does not help in generating useful insights about data points' relationships. The main idea is to find the correlations and patterns between data points rather than studying the cause-effect relationships which will help to visualize the links among data, unseen before. The premise of this approach is that causality can be rarely proven [1]. Correlations not only help to analyze the small data sets but they can be fairly used with high dimensional data. Now-a-days, important tools are being developed by experts for identifying and comparing non-linear correlations. The techniques of analysis are being aided and improved by demanding novel methods, and softwares to extract non-causal relationships among data [2]. Given increasing use of intensive data collection techniques these days, there is a substantial increase in the number of data points, as well as number of features, in the datasets. Feature extraction techniques applied on these datasets may not be very effective, as the features extracted may either have false correlations among each other or may be noisy. These irrelevant features must be disregarded by applying appropriate data mining techniques for the selection of relevant features. If a set of 'n' documents are given to cluster them into topics and one has no information about what the topic is. If we have a classifier f (A, B) such that if two documents A and B are given, it outputs its belief if A and B are similar to each other or not where, the behavior of f is being learned from past training data. Therefore, the most intuitive technique for clustering in this case is that function f is applied to each pair of documents to find a clustering which agrees with results in a maximum possible way. The most prominent challenge while clustering high-dimensional data space is that the feature relevance is dependent on the clusters they belong to. Moreover, there might be relevance of correlations among dataset's different attributes with disparate clusters. This fact of relevance of feature correlation with various clusters is termed as local feature selection [3]. In a given dataset, the correlation detection between different features is a primary task of data mining. The higher degree of collinearity in a dataset means high correlation of features, corresponding to the fact that there exist approximate linear dependencies among two or more attributes. Complex dependencies may exist such that there might be dependence of one or more features on a combination of various features. If correlations are known initially, the dimensionality of dataset may be reduced by elimination of redundant features. A new concept of knowledge discovery in databases has been introduced which is known as correlation clustering [4] for the detection of dependencies that exist among features, and to cluster the points having common pattern dependencies. Correlation clusters are formed by grouping the data into subsets with the condition that the data points in the same correlation cluster are linked with common hyperplane of arbitrary dimensionality. There is also requirement for correlation cluster algorithms to present certain density known as feature similarity. The concept of correlation clustering has been used in various application domains [5][6][7].

Contributions
The contributions of the paper are four-fold: 1) we have introduced two algorithms for obtaining high quality correlation clusters. These algorithms are similar in the sense that they produce high quality clusters but they are different in the concepts they are based on.
2) The first algorithm is based on finding correlation coefficients among given data points, and clustering data based on these values. 3) The second algorithm is based on merging the technique of data reduction with the concepts of graph theory and hierarchical correlation clustering. This approach is suitable for finding correlation clusters in high dimensional data. 4) We have combined the power of correlation analysis with that of prediction analysis to propose a better recommendation system.
These two approaches have been used to replace similarity measures in recommendation algorithms and generate a correlation clustering based recommendation system model. It is found that this model makes better quality recommendations as compared to the random recommendation model.

Organization
The rest of the paper is organized as: Section 2 gives a brief summary of the work done in the related area of correlation clustering. Section 3 mathematically defines and interprets correlation clusters. Section 4 enlists the preliminaries required for understanding the proposed algorithms. Section 5 introduces and explains the proposed correlation clustering algorithms and experimental evaluations for validating the proposed approaches. Section 6 discusses the framework of the recommendation system model. It also contains the preliminaries for the proposed framework, followed by a detailed explanation and experimental evaluation of the model. Section 7 briefly concludes the work presented in this paper.

Related Work
Correlation clustering is a type of subspace clustering. It selects the number of clusters automatically to give approximated solutions. This method is used to handle instances with the focus on relationships among the objects, rather than actual object representations. Correlation clustering has marked as a paramount inclusion data mining field considering the ever increasing data scale these days. The general idea is to discover a clustering that either minimizes disagreements or maximizes agreements between data points. Many methodologies, like linear programming formulations, approximations etc., are being used as an approach for this problem. The problem of correlation clustering can be defined as to find a partition of vertices in the form of clusters to agree with labels of edges as much as possible in a fully connected graph with labelled edges as '+' for similarity and '-' for difference [4]. The main aim is to maximize the positive edges in the desired partition. The positive edges sum in each cluster and negative edge sum between clusters need to be maximized. This method of clustering does not take the cluster count as clustering parameter. Edge labels define the optimal cluster number between 1 and n. We categorize the correlation clustering algorithms by the dimensionality reduction approaches they are based on.

PCA Based Approaches
It covers the major section of correlation clustering approaches. A local correlation dimensionality based hierarchical approach, HICO [8] was proposed for defining the distance between data points and for calculating data's subspace orientation. Another approach, ORCLUS [6] assigns data objects to first k seeds based on the eigenvector distance function.The efficiency of the algorithm is increased by choosing the highest value of k. A variant of ORCLUS was given by Li et al. [9] which is more efficient as it can produce correlation clusters of high quality from noisy data. In 4C algorithm [5], the cluster expands around the seed until density criteria is fulfilled which is specified by the upper bound on the data points that can lie within defined neighborhood (distance matrix from 2 data point's Eigen systems) of points and there is no decision on cluster number in advance. COPAC [10] is an improvement of 4C algorithm by reducing the time complexity to d 2 . It works by partitioning the data space such that search is limited to local correlation dimensionality clusters. Eric [11] is another approach which takes affine distance into consideration where approximate linear dependency of each data point defines the neighborhood. As a result, subspace cluster hierarchy is formed.

Hough Transform Based Approaches
In Hough transform, trigonometric function is used to represent each data point's link with infinite points. A global subspace clustering approach is obtained by hough transform [12]. In CASH algorithm [13], dense regions are carved out by using grid based methodology. Attributes are used to divide the data space and calculating the functions which intersect the hyperboxes created due to data space division.

Other Approaches
There is another approach called CURLER [20] where the non-linear and arbitrary correlations are detected. Micro clusters are used in this approach which uses an EM variant for generation and are finally grouped to find correlation clusters. There need not exist linear relationship among the clusters. Moreover, there is an assumption that every data object belongs to each cluster but is assigned different probability to each cluster. In a short span, many subspace clustering algorithms [5,[14][15][16][17] have been adduced for finding out clusters in data space's axis-parallel projections. The local data is not captured in these algorithms to find correlated objects' clusters as there is arbitrary orientation of correlated data's principal axes. Finding subgroups having similar trend in attributes' subsets is done in pattern-based clustering methods [7,[18][19][20] which is also called by bi-clustering or co-clustering. There exists unique form of clustering in pattern based clustering having all positively correlated attributes and excluding the negative correlations as well as correlations having dependency on two or more than two attributes.
There are not many techniques in the literature that successfully use the power of correlations for data mining. we propose two correlation clustering algorithms based on entirely different techniques. The first one is based on calculating mathematical correlations between data points, while the other is based on the definition of correlation clusters. Hence, we propose a recommender system model based on the two proposed algorithms which merges the power of correlation clustering with that of the prediction analysis to make better quality recommendations.

Understanding Correlation Clusters
This section describes the correlation clusters formalization and the things which need to be considered in their interpretation.

Eigenvectors: Strong and Weak
Let V be the Eigenvectors and they are partitioned into two classes: strong and weak (denoted by S and W. They satisfy following conditions In non-technical terms, those Eigenvectors that correspond to the least variance in the corresponding dimension (with Eigenvalue λ ≈ 0) are termed as weak Eigenvectors. Without loss of generality, let us assume that first s vectors are strong and rest of them are weak. Now, either the cluster hyperplane can be defined by strong Eigenvectors, or they can also be defined by the weak Eigenvectors, perpendicular to this hyperplane. There is an additional benefit of working with weak Eigenvectors. It allows to examine the numerical dependencies between the attributes.

Defining Correlation Clusters
Suppose that there exits an s-dimensional cluster C in the original data space D of d dimensions. Clearly C ⊆ D and {s, d − s} be the strong and weak Eigenvectors for the cluster C. Thus s-dimensional hyperplane for the definition of cluster C is possible through mean μ = (x 1 , . . . , x s ) T . Now, they also correspond to the weak eigenvectors W which are orthogonal to the hyperplane. Thus the hyperplane can be defined using the following equation system: Here, To find out dependencies in weak Eigenvectors, Gauss Jordan elimination method [21] is used with total pivoting, which is numerically stable and simple to implement. Although the cluster model works for all data points x ∈ C but it can also work as predictive testing for the new incoming data. Now for every cluster C i ⊆ D the following calculations are done: 1) Calculate covariance matrix i 2) Choose weak Eigenvectors W i from the covariance matrix with reference to a given threshold value τ 3) Solve the hyperplane equation W T .x i = W T .x i for cluster C i 4) Perform Gauss Jordan elimination and row reduction echelon form, to solve for mutual dependencies quantitatively

Interpretation of Correlation Clusters
Let the solution obtained from above system in five dimensional space be the following: We find out the strong eigenvectors which are actually free attributes with the linear dependence among the following vectors for the given values of constants 5 . Now, to understand this relationships, often the domain knowledge of the experts are a prerequisite. The dependence of Eigenvector can also be highlighted through experimentation. For example increasing or decreasing values of {w 1 , s 3 , w 5 } that is choosing one vector from each equation and watching the effect in other variables will alarm some patterns. Now, further refinement could be done by simply trimming the related related models generated by dependent Eigenvectors. A set of independent vectors may also be represented by a novel fourth variable. Hence, modeling the sections of a big and complex structured systems and performing the experiments to examine their outcomes may be extremely useful for analyzing the nature of causal relationships among Eigenvectors.

Preliminaries
In this section, we present the concepts and attributes to describe the proposed algorithms.
• PCA (Principal Component Analysis): PCA is a dimensionality reduction technique that aims at constructing data vectors that indicate data variance in the given data. Once the data is imported into the computing environment, it is standardized and the covariance matrix is computed. Subsequently, Eigenvalues and Eigenvectors are calculated, based on the covariance matrix. These values are then arranged in decreasing order, and the components lower down the list are ignored in order to obtain the principal components. These components are then arranged to form a feature vector matrix. This matrix is then used to obtain final data by using the formula: Final data = Row Feature Vector X Row Data Adjust. Here, Row Feature Vector is the matrix with the Eigenvectors in the columns transposed so that the Eigenvectors are in the rows, with the most significant Eigenvector at the top, and is the mean-adjusted data transposed, i.e. the data items are in each column, with each row holding a separate dimension. The final data set is derived by taking data items in columns and dimensions along the rows. • Local covariance matrix: It ( P ) is the covariance matrix of the k nearest neighbors of point P. Given that k ∈ N , k ≤ |D|. The local covariance matrix P of a point P ∈ D w.r.t k is formed by r nearest neighbors of P. Let Y be the centroid of NN r (P ), then [22]: • Correlation similarity matrix of a point: Let us consider a point P, such that P ∈ D. V P and E P are the the Eigen vectors and Eigen values of the point P respectively. Let C be a constant with C ≥ 1. Now in order to calculate a new Eigen value matrix withÊ P having diagonal entriesê i .
Here, represents normalization of the Eigenvalues to convert into [0, 1]. The correlation similarity matrix is represented as:B P = V P .Ê P .V T P . The correlation similarity matrix measure associated with point P is denoted by: and is a type of a symmetric matrix. It represents affinity scores between the two objects. It is computed by applying k-nearest neighbor in order to build a matrix of closest data points. Affinity matrix replaces clustering by a graph partition problem. Here, graph components are interpreted as clusters. The graph construction must be partitioned in a way that the edges connecting different clusters should have low weights. Even the edges within the same clusters must also have high values [23]. We consider 3 nearest neighbors for each point, including the point as well. Given the similarity matrix as an input, the output of the clustering contains at most three vertices of the graph G . Here, G is a complete graph formed out of a larger graph H. Given 2 data points x i and x j , Affinity A i,j , that is positive and symmetric, and depends on the Euclidean distance x i −x j between the two data points. Affinity matrix is defined as: Where, α is a constant. • Graph Laplacian transform: The graph Laplacian of graph G, L G has Eigenvalues Eigenvalues reveal global graph properties which is not apparent from edge structure. If 0 is the Eigenvector of L with K different Eigenvectors, i.e., 0 = λ 1 = λ 2 , ...., = λ k , then G has K connected components. If the graph is connected, λ ≥ 0 and λ 2 is the algebraic connectivity of G. The greater the value of λ 2 , the more connected G is [24]. An unnormalized Graph Laplacian given by U = D − A and a Simple Laplacian is given by H = D −1 A, where A is the affinity matrix and D −1 A is a transition matrix. • Hierarchical Agglomerative clustering: Hierarchical Agglomerative clustering based on correlation coefficient is an application of correlation analysis. The result of a hierarchical algorithm is usually described in the form of a dendrogram, where data set is represented by its root and data object by each leaf node. AGNES (Agglomerative Nesting) [25] is the hierarchical clustering approach used in this paper. In this approach, the user does not specify a value of k (number of clusters). This algorithm constructs a tree like hierarchy which implicitly contains all values of k. When one has to perform cluster analysis on a set of observed variables, then one can use parametric or non parametric correlation coefficient of two variables and convert them into a dissimilarity matrix. The data points are fussed together based on this matrix to form data clusters following an Agglomerative approach. The authors have used a graphical representation known as banner to represent clustering by AGNES. It looks like a waving flag and consists stars and stripes. The stars indicate linking of objects and stripes are repetitions of labels of these objects. A banner helps to easily navigate the structure of data set at a given level. This means at a particular value of k, a banner shows how data points are clustered and placed. The steps of Agglomerative hierarchical clustering is given in Algorithm 1.

Algorithm 1 Agglomerative hierarchical clustering.
Require Go to Step 2 12: end if 13: Exit • Agglomerative coefficient: It is calculated from the banner produced as a graphical output of the algorithm AGNES. For each object i, the line containing its label is identified and its length l(i) is measured on the 0-1 scale above and below the banner. The coefficient of the data set is defined as [25]: The AC is a dimensionless quantity between 0 and 1 as l(i) lies between 0 and 1. It is the average width of the banner and it remains the same even when all the original dissimilarities are multiplied by a constant factor. AC defines the strength of the clustering structure that has been obtained by group average linkage. If d(i) denotes the dissimilarity of object i to the first cluster it is merged with divided by the dissimilarity of the merger in the last step of the algorithm. Then Agglomerative coefficient is given by average of all 1 − d(i). • Fowlkes-Mallow index: It is an external evaluation method to determine the quality of a cluster. It is a measure of similarity, and a comparison is made between the given clustering and hierarchical clustering or the given clustering and a benchmark classification. A higher value of the FM index indicates a greater similarity between the clusters and the benchmark classifications. FM index can be calculated as follows [26]: We consider two hierarchical clusterings for n objects which can be labeled as A 1 and A 2 . The trees A 1 and A 2 which are obtained as a result of this clustering can be cut to produce k = 2, ....., n − 1 clusters for each tree. For each value of k we obtain corresponding values of m given by: Here i = 1, ....., k and j = 1, ....., k, m i,j gives the common objects between i th cluster of A 1 and j th cluster of A 2 . The Fowlkes-Mallows index for a given value of k can be defined as: Here, is a rank based correlation coefficient, given by: where, d i = x i − y i is the difference between ranks of observations. The value of the correlation coefficient lies between -1 and 1 and assesses monotonic relationships between the rank values of those two variables. • Correlation and dissimilarity matrix: Once the data is imported in the computing environment, ρ is used to compute correlation coefficients between paired values and generate a correlation matrix. The values in this matrix lie between -1 and 1. This correlation matrix acts as an input for the distance or dissimilarity matrix. Dissimilarities are always positive and are represented as d(i, j ). They are small when i and j are close to each other and become large when i and j are very different. It is assumed that dissimilarities are always symmetric and dissimilarity of an object to itself is zero. In this paper, dissimilarities are calculated from the correlation matrix using the formula [25]: where, R(f,g) is the correlation coefficient value.

Proposed Correlation Clustering Algorithms
We propose two correlation clustering algorithms based on entirely different techniques. The first one is based on calculating mathematical correlations between data points, while the other is based on the definition of correlation clusters, as defined in the previous sections.

Rho Based Correlation Clustering (RBACC)
We present a correlation clustering based algorithm that does not require to specify the number of clusters in advance. The algorithm creates a tree like hierarchy of data points. This is a agglomerative hierarchical clustering algorithm based on mathematical correlation between data points. The value of correlation is given by Spearman's rank correlation coefficient (ρ). The focus of this approach is to produce high quality clusters. The detailed steps for RBACC are given in Algorithm 2. The first step of the algorithm converts data in the form of a data matrix. This is done to give a structure to the data, as the inputed data may also be semi-structured or unstructured. This data matrix is then used to generate a correlation matrix using 'ρ.' Pairs of attributes are selected from the matrix and correlations are found between them. A dissimilarity matrix is generated from this matrix, that acts as the input for the agglomerative hierarchical clustering algorithm (AGNES). The next step is to check the quality of the clusters produced. We use Agglomerative Coefficient and Fowlkes-Mallow index for this purpose. This algorithm is a simple algorithm that has evolved around the idea that the clusters generated by determining the correlations between data points are high quality and distinctive in nature.

Locality Assumption Graph Based Correlation Clustering (LGBACC)
In high-dimensional data, clusters often exist in the form complex hierarchical relationships. In order to explore these relationships, there is a need to integrate dimensionality reduction techniques with data mining approaches and the graph theory. We have proposed an algorithm that integrates the basic elements of PCA with those of graph theory to produce high quality hierarchical clusters. This algorithm is based on locality assumption. This means that it is assumed that all the clusters are present on a common hyperplane.
The detailed steps of this approach are given in Algorithm 3. Once the data is imported into the computing environment, it is standardized. This standardized data is used to compute the local covariance matrix. These data pre-processing steps help to get data in a standard structured format. The next step is to reduce the dimensionality of the covariance matrix. We use PCA to perform this operation. A new data set S2 is derived after selected principle components from the Eigenvector and Eigenvalue matrices, calculated from the covariance matrix. To integrate the reduced data with graph theory, we compute local covariance matrix. From this matrix, similarity and subsequently affinity matrices are generated. In the next step, simple graph laplacian is applied and a new dataset S2 is generated. This dataset is used to generate clusters using AGNES. The quality of these clusters is checked using Agglomerative Coefficient and Fowlkes-Mallow index.
Both the proposed approaches are similar in the manner that they produce quality clusters, and overcome the problem of specifying the number of clusters in advance. There are many points of differentiation as well. The comparison of RBACC and LGBACC is given in Table 1 and points of differentiation between different correlation clustering approaches are given in Table 2.

Experimental Evaluation
To test the working of the proposed algorithms on the basis of the stated parameters, we have performed experiments on four datasets, using SparkR (R on Spark). Spark R is an R language package that provides a front end to use Apache Spark from R. It provides a distributed data frame implementation that supports different operations but on large datasets. It also supports distributed machine learning using MLib. We have used Amazon simple storage (S3) buckets to store our large datasets and piped the data into the SparkR environment using the datasource API. The SparkR architecture used for experimentation is given in Fig. 1. We have used the New York yellow cabs dataset taken from http://www. nyc.gov/html/tlc/html/about/trip record data.shtml. This is a very large dataset with approximately 1.3 billion data points and takes approximately 260 GB on the disk. The data has been taken from January 2009 to December 2016. We have worked  Table 3. We discover that LGBACC generates better quality clusters than RBACC. Though both of them produce pretty high quality clusters, LGBACC is a little better than RBACC. The visualization of experiments performed on LGBAAC and RBACC are given in Figs. 2 to 7. All these figures represent a data sample of 23 weeks of taxi activity, chosen randomly from the dataset. This has been done for clear visibility of the data points. Figure 2 shows a correlation plot depicting correlations between clusters formed on the basis of taxi activity on each day of the week. It can be inferred that neither a positive not a negative correlation exists between these clusters. Figure 3 shows data clustering by RBACC, it is represented in the form of bar plots. Figure 4 is a more systematic representation of this clustering in the form of a line graph. The clusters are clearly represented with the help of different colored lines and the variation in data points is shown by the slop of the line. Figure 5 represents clustering performed on the first half of the LGBACC algorithm, before performing PCA. Figure 6 shows percentage of explained variance against the dimensions, to perform PCA on the dataset. Figure 7 is representation of the variance factor map for PCA. The Figures show steps of PCA in detail. These representations are of the PCA phase of the LGBACC algorithm. Figure 8 shows 3D representation of the hierarchical clustering performed on the dataset produced after performing PCA (S2). Figure 9 shows a cluster plot of the distinct quality clusters produced by LGBACC algorithm. D. Pandove, A. Malhi Table 2 Comparison of the proposed clustering algorithms with state-of-the-art algorithms b) Select principal components c) Derive new data set S1 d) Compute local covariance matrix using: : Perform graph theory analysis as: a) Generate local similarity matrix S i,j = s(a i , b i ) from S1, represented as:B P = V P .Ê P .V T P . b) Generate affinity matrix using: 2 ) c) Apply Simple Graph Laplacian as: Eigenvector matrix e) Derive new dataset S2 5: Generate clusters using AGNES Refer Algorithm 1 6: Check cluster quality using: We have used the look-alike model to extend this dataset and generate four different datasets, to test the scalability of the proposed approaches, and compare them to the existing hierarchical clustering algorithms. The details of the extended datasets in terms of data points and size are given in Table 4. We have used The results of this analysis are given in Fig. 10. A linear relationship is observed between the execution times of the algorithms. It is observed that LGBAAC has the fastest execution time as compared to other algorithms. For DS1, at 25 nodes, it is 19% faster than RBACC and 35.3% faster than the single link hierarchical clustering algorithm, which takes the maximum time out all the algorithms considered for this experiment. For DS3, at minimum number of nodes, 5, it is 4.2% faster than RBACC algorithm, but is 5.3% faster, at 20 nodes, of the same dataset. For the largest dataset (DS4), LGBACC is again the fastest at all the nodes and average hierarchical clustering algorithm is a close second with RBACC trailing very closely LGBACC, is a little faster than the average link, this is due to the time reduction achieved by applying PCA. RBACC is better than Single and Complete link hierarchical clustering algorithms, in terms of execution time.

Recommendation System Model
We now propose a recommender system model based on the proposed algorithms. This model merges the power of correlation clustering, with that of prediction analysis to make better quality recommendations. The flow of the model has been given in Fig. 11.   The measures and terms used in the model are given below:

Preliminaires
• Recommendation algorithms: There are many algorithms that can be used to make recommendations.
We list the algorithms defined in the recommenderlab package of R language: -User based collaborative filtering (UBCF) [28]: It analyzes rating data collected from many individuals. The assumption is that the users with similar preferences   rate items similarly. The missing ratings can be predicted by finding a cluster of similar users and aggregating the ratings to make predictions. The clusters are defined by using similarity measures that either give maximum points of similarity between the users or takes all the users above a particular threshold.
-Item-based collaborative filtering (IBCF) [29]: This approach is based on a rating matrix. The recommendations are made based on items that can be inferred from the ratings matrix. This approach is based on the assumption that users will go for items similar to the other items they have liked. -User and Item-based collaborative filtering using 0-1 data [30]: This method is used in situations where less rating data is available. The usage behavior is analyzed to infer preferences. The information is presented in the form of 0's and 1's. 1 means that user has a preference for a product and 0 means not. -Recommendations for 0-1 data based on association rules [31]: The recommendations are made based on the  • k-fold cross validation [32]: The dataset is split into k sets (called folds) of approximately the same size. The evaluation is done k times, using one fold as a test fold and the all the other folds are used for learning. This is a robust approach for evaluating recommender algorithms, making sure that every user is in the test data, at least one. The averaging approach ensures robust results and error estimates. • Evaluation of predicted values: The best way to evaluate a predicted value is to use measures such as Mean Average Error (MAE), Mean Square Error (MSE) and Root Mean Square Error (RMSE) [33].
• MAE: This measure helps to evaluate a prediction by computing the deviation of a prediction from the true value. It is calculated as: where, κ is the set of all user item pairings (i,j),r ij is the predicted rating and r ij is the known rating, that was not used to learn the model. • RMSE: This is another popular measure to find out accuracy. It detects larger errors better than MAE and is suitable to situations where small prediction errors are not very important to find. It can be computed as: [34]: It is an information retrieval measure, it evaluates recommender performance using: P recision = Correctly recommended items T otal recommended items (11) In terms of confusion matrix, given in Table 5, Precision can be expressed as: • Recall [34]: It uses useful recommendations to define a measure for evaluating recommender performance. It is calculated as:

Recall =
Correctly recommended items T otal usef ul recommendations (13) In terms of confusion matrix, it is represented as: • ROC (Receiver Operating Character): It is a method to compare two classifiers at different parameter settings. The ROC curve is a method to detect system's probability (also known as sensitivity or true positive rate (TPR)) by the false positive rate, with regard to model parameters.
The efficiency of the two systems can be computing by looking at the area under the ROC curve. Bigger area indicates better performance.

Model Explanation
The phases of the recommender system model presented in Fig. 11 are explained as follows: • Data pre-processing phase: After the dataset is imported into the computing environment, it needs to be organized in a way that it contains attributes that are required to make recommendations. Once the data is in the desired form, it is converted into a data matrix, known as the search matrix. • Select parameters of recommender model: The search matrix acts as an input for running correlation clustering algorithms like LGBACC, or RBACC. Once correlation clusters are obtained, a recommendation algorithm is selected depending on the kind of recommendations to be made. The next step is to define the parameters of the selected algorithm. This is done in preparation of defining correlation clustering based recommendation system model. The core of this model is forming a relationship between generated correlated data points and parameters of the recommendation algorithm. • Build recommender system model: Once the desired data is obtained, and the parameters of the recommendation algorithms are understood, the default parameters of the recommender model  are defined. The next step is to take the clustered dataset and split it into train and test data sets. Then the data is trained and recommender model is applied on the test data set. This is the core of the proposed framework. The last step of this phase is to explore the results obtained. • Evaluate recommender system model: k-fold cross validation method is used to validate the results obtained in the last phase. The next step is to evaluate the prediction accuracy by using parameters such as RSME, MSE and MAE. Then the recommendations are evaluated by using metrics like Precision and Recall. The last step is to visualize the quality of the recommendations made by plotting curves such as ROC and Precision/Recall.

Experimental Evaluation
We validate the recommendation system model by using a real world MovieLens dataset, taken from http://grouplens.org/datasets/movielens/latest. This dataset has 24,000,000 ratings and 670,000 tag applications applied to 40,000 movies by 260,000 users. The dataset is approximately 1 Gb is size and has been used by many researchers to make recommendations [35][36][37][38]. The latest version of the dataset has information about users from the year 1996 to 2016.
There are four linked files, namely links.csv, movies.csv, ratings.csv and tags.csv. For testing the recommender model, only movies.csv and ratings.csv have been used. The descriptive statistics of the two files are given in Tables 6 and 7. The search matrix is prepared by extracting a list of genres and making each genre a separate attribute. There are 18 genres, combining those with the attributes of 'movies.csv' the search matrix has 20 attributes. We run both LGBACC and RBACC algorithms on this matrix. These algorithms provide similar data points, grouped together. We explore the data further and map the search matrix with 'ratings.csv' to prepare the data for making recommendations. We also normalize this data to make it cleaner for the next step of the recommendation system model. As for the recommendation algorithm, we select Item-based collaborative filtering (IBCF) model for our analysis. Looking at the requirements of the algorithm, we determine that the defining parameters of the algorithm are user ratings and preferences. These are the building blocks of the recommender system model. We need to link the correlation clusters with parameters of the recommendation algorithm. Linking correlated data points obtained to the user ratings and preferences, defines the correlation clustering based recommender system Fig. 16 Algorithms comparison based on Precision-recall model. Next step is to set the default parameters of the IBCF model. We set the default parameter of method = cosine, and k (the number of items to compute similarities) as 30. Now, we build the model by splitting the whole data set as 80% training data and 20% testing data. Then we apply the recommender system on the dataset, by training and testing the data. We validate the results obtained by running a 4-fold cross validation models. The results of this analysis in terms of RSME, MSE and MAE are given in Table 8.
We see that LGBACC has approximately 25% better prediction capability than RBACC. Running this cross validation model for all the folds, we arrive at the conclusion that model time for LGBACC is much lass than that of RBACC, but it takes significantly more prediction time. Now we evaluate the recommendations by calculating Precision and Recall based on the confusion matrix. The confusion matrix for LGBACC and RBACC are given below: LGBACC conf usion matrix We now plot the ROC and Precision-Recall curves for both LGBACC and RBACC from the confusion matrix. Looking at the Precision-Recall curve Figs. 12 and 13, we see that LGBACC has better Precision-Recall ratio than RBACC. But the area under the ROC curve for LGBACC is less than the other approach as depicted in Figs. 14 and 15. We compare the existing recommendation models with each other using the IBCF and UBCF models. We create recommendation system models for both LGBACC and RBACC and compare them with the existing random recommendation system model. Precision-Recall and ROC curves are plotted for all the algorithms given in Figs. 16 and 17. It can be observed that Random recommendation algorithm, that does not use the power of correlations performs the worst, on both the criteria. LGBACC and RBACC using UBCF, perform the best. This analysis shows that integrating correlated points with recommendation algorithms make better quality recommendations.

Conclusion
In this paper, we have defined correlation clusters and have proposed two correlation clustering algorithms. These algorithms are similar as well as very distinct. They are similar as both of them exploit the power of correlations in data points and use hierarchical clustering to produce high quality clusters. They are distinct as they are based on totally different concepts of correlation analysis. RBACC algorithm has used the mathematical concept of correlation determination, to produce clusters based on correlation coefficients, calculated using spearman's rank correlation coefficient (ρ). Locality assumption graph based correlation clustering (LGBACC) combines the concepts of dimensionality reduction and graph theory to determine correlations among data objects. Further, it uses hierarchical clustering to produce quality data clusters. Both these algorithms act as the building blocks for designing a recommendation system model through correlations. This model uses correlation clustering algorithms to determine similar data points, use this measure of similarity and merge it with existing recommendation algorithms. It has been discovered that recommendations made using correlated data points are better and faster to compute. Parallel and distributed computing environment has been used to perform all types of analysis in this paper. We have used Spark and R platform to process high-dimensional and large datasets.
Data availability statement (DAS) Data available in a public repository and the links have been provided in relevant section.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.