A new recommendation system using map-reduce-based tournament empowered Whale optimization algorithm

In the era of Web 2.0, the data are growing immensely and is assisting E-commerce websites for better decision-making. Collaborative filtering, one of the prominent recommendation approaches, performs recommendation by finding similarity. However, this approach fails in managing large-scale datasets. To mitigate the same, an efficient map-reduce-based clustering recommendation system is presented. The proposed method uses a novel variant of the whale optimization algorithm, tournament selection empowered whale optimization algorithm, to attain the optimal clusters. The clustering efficiency of the proposed method is measured on four large-scale datasets in terms of F-measure and computation time. The experimental results are compared with state-of-the-art map-reduce-based clustering methods, namely map-reduce-based K-means, map-reduce-based bat algorithm, map-reduce-based Kmeans particle swarm optimization, map-reduce-based artificial bee colony, and map-reduce-based whale optimization algorithm. Furthermore, the proposed method is tested as a recommendation system on the publicly available movie-lens dataset. The performance validation is measured in terms of mean absolute error, precision and recall, over a different number of clusters. The experimental results assert that the proposed method is a permissive approach for the recommendation over large-scale datasets.


Introduction
Among the various web revolutions, recommendation system is a prominent tool which is widely used by E-commerce websites to offer more personalized services to the users.For example, movie recommendation method suggests a list of movies that a specific user may prefer based on the information retrieved from the social media or rating made by other similar users [1].Generally, a recommendation system follows two types of approaches, namely content-based filtering and collaborative filtering.In content-based filtering, each item is associated with a certain set of features which are rated differently by different users.This approach predicts the rating of the items on the basis of user's inputs [2,3].On the contrary, collaborative filtering takes up a completely different approach.It works on the similarity among the users or items [4].The performance of such recommendation sys-B Himanshu Mittal himanshu.mittal224@gmail.com 1 Malviya National Institute of Technology, Jaipur, India 2 Jaypee Institute of Information Technology, Noida, India tems is highly dependent on the similarity determination.Generally, clustering-based approaches are quite popular in the literature to determine the similarity [5].
K-means, a widely used clustering approach, has been used in a number of engineering domains for the same.However, K-means generates biased clusters due to its dependence over parameter settings and initial cluster centres [6].To remedy this concern, meta-heuristic-based solutions have been widely employed to obtain optimal cluster centroids in the last two decades [7][8][9].Pal et al. [10] introduced a new clustering algorithm using the enhanced bio-geography algorithm.Furthermore, Mittal et al. [11] presented an intelligent gravitation search algorithm-based method to obtain optimal cluster centroids.Sharma et al. [12] introduced an enhanced grey wolf optimization-based method for the optimal clustering of the data.Pal et al. [13] presented genetic algorithmbased energy-efficient weighted clustering method.Recently, a number of researchers have used meta-heuristic-based clustering solutions for recommendation systems.Chen et al. [14] introduced collaborative filtering-based recommendation method using evolutionary clustering.Malik et al. [15] introduced particle swarm-based travel recommendation sys-tem.Moreover, Peška et al. [16] performed a detailed study about the applicability of meta-heuristic-based methods for solving the collaborative filtering-based recommendation system.Kumar et al. introduced efficient clustering-based model for the movie recommendations [17].Kataria [18] introduced artificial bee colony-based movie recommendation system.Similarly, Singh et al. [19] introduced novel movie recommendation system by the efficient clustering of the dataset using modified cuckoo search method.Suganeshwari at al. [20] performed a survey on clustering-based recommendation system and concluded that clustering-based recommendation system can be efficiently utilized for the recommendations of the product and services as it finds the similarity among the the user behavior and uses patterns.
Generally, meta-heuristic methods optimize cluster centroids based on the inter-cluster or intra-cluster distances.Unlike K-means, these methods obtain the optimal solution through collective working, which eradicates any biasness towards initial clusters.Hence, these methods perform better for the clustering problem.Therefore, this paper presents a novel meta-heuristic-based recommendation system for the big data environment.
Meta-heuristic methods refer to the set of algorithms which leverages the concept of guided random search.These methods define a mathematical model which correspond to certain natural phenomena and have been used in the literature to obtain optimal solutions for different realworld optimization problems [21][22][23][24][25].Generally, they use population-based approach to finds the optimal solution with the information sharing among the individuals.In contrast, single solution-based methods such as simulated annealing and hill climbing [26], finds the solution with a single individual.However, single solution-based algorithm suffers with premature convergence due to the lack of information sharing.Furthermore, the success of a meta-heuristic algorithm majorly depends on the way in which exploration and exploitation is performed [27,28].Exploration controls the diversification of the search agents, whereas the convergence of the individuals is controlled by the exploitation.Therefore, each meta-heuristic method tries to attain balance between exploration and exploitation to achieve precise solution [29].Generally, these algorithms are inspired from swarm-based, or evolution-based phenomenons.Mirjalili et al. [30] developed multi-verse algorithm based on the notion of cosmology.Sayed et al. [31] introduced hybrid SA-MFO algorithm solving the engineering design problems.The genetic algorithm, differential evolution and bio-geographybased optimization are some of the popular examples of evolutionary concept [32].Furthermore, swarm-based algorithms behave like the swarm of agents to achieve optimal results.Particle swarm optimization (PSO) is one the metaheuristic that has been broadly used solving problems and several variants of the PSO has also been introduced in the literature [33].Subsequently, Unal et al. [34] presented multiobjective particle swarm optimization, which uses random immigrants.Lie et al. [35] introduced levy flight based ant colony optimization.Moreover, Satapathy [36] presented the social group optimization, which mimics the social behavior of humans for solving the problems.Furthermore, Tripathi et al. [37] proposed an algorithm inspired by military dog squad to find the optimal solution.Dragonfly-based optimization is another swarm-based algorithm introduced by Mirjalili et al. [38].
WOA [39] is a popular algorithm which models the behavior of humpback whales.Mathematically, WOA simulates the hunting behavior of whales to find the optimal solution.It includes two phases, namely encircling phase and spiral phase, which corresponds to exploration and exploitation, respectively.WOA has surpassed other recent algorithms on the benchmark problems [39].In the last three years, WOA has been applied across a wide set of application areas, like data clustering, mining, image processing, and others [40].Moreover, WOA has been improved by several researchers for solving various real-world problems.Mafarja et al. [41] introduced hybrid WOA and simulated annealing-based method for the feature selection.Aziz et al. [42] combined moth fame algorithm with WOA for the multi-level image segmentation.Similarly, Aljarah et al. [43] employed WOA for optimizing connection weights of the neural network.Furthermore, the whale algorithm has also performed competitive in the recommendation system.Karleka et al. [44] introduced a WOA-based clinical risk assessment and recommendation method for treatment.However, collaborative filtering-based recommendation method involves clustering of data according to user's similarity.Moreover, literature has witnessed that WOA performs efficiently in clustering-based applications [45].Therefore, this paper aims at leveraging the strengths of WOA for collaborative-filtering-based recommendation system.
Generally, WOA discards bad solutions during position updation.However, the whale having bad fitness might be nearer to global optima [41].Therefore, it suffers from demerits like the risk of trapping into local optima [46].To remedy this, a new variant of WOA, tournament selection empowered WOA (TWOA), is proposed in this paper.The tournament process gives a fair chance to the bad solutions to overcome the local optima during exploitation.Furthermore, the strength of TWOA is utilized for improving the quality of the recommendation system.Although meta-heuristic-based recommendation system has shown better efficiency than traditional methods comparatively, these sequentially executing recommendation systems fail to respond in a reasonable amount of time on large-scale datasets [47].To alleviate the same, the TWOA is parallelized using the map-reduce architecture for large-scale datasets and has been leveraged to obtain optimal clusters to perform recommendations.
The overall contribution of this paper is two folds, (1) a new clustering method, map-reduce-based tournament empowered whale optimization algorithm (MR-TWOA), is presented for efficient clustering of large-scale data set and (2) a novel variant of the WOA, tournament empowered whale optimization algorithm (TWOA), is presented to attain efficient clustering.The clustering efficiency of the proposed map-reduce-based TWOA (MR-TWOA) is tested on four large datasets, namely Replicated Iris, Replicated CMC, Replicated Wine, and Replicated Vowel.The experimental findings are compared with other state-of-the-art map-reduce-based clustering methods, namely map-reducebased K-means (MR-Kmeans) [7], map-reduce-based bat algorithm (MR-bat) [48], map-reduce-based Kmeans particle swarm optimization (MR-KPSO) [49], map-reduce-based artificial bee colony (MR-ABC) [50], and map-reduce-based whale optimization (MR-WOA).Furthermore, the applicability of the proposed MR-TWOA-based recommendation system is validated using MovieLens dataset [51].The results are compared with three parameters, namely mean absolute error (MAE), precision, and recall.
The remaining sections of the paper are as follows.In this section, briefs data-clustering and WOA.The next section discusses the proposed recommendation system along with the proposed variant (TWOA) and its parallel version (MR-TWOA).The Experimental results section presents the experimental arrangements and results.Finally, the paper is concluded in the last section.

Clustering
. ., k it is the position vector for ith clustercentroid.Generally, the intra-cluster distance is considered as the objective function while performing clustering which is defined as the Euclidean distance between O i and K l .Its formulation is depicted in Eq. ( 1).
where O i and K l represent ith data-point and lth cluster, respectively.

Whale optimization algorithm (WOA)
Whale optimization algorithm [39] mimics the hunting behavior of humpback whales.The humpback whales hunt small fishes in the proximity surface by generating bubbles in a circular shape.The algorithm works in the two phases, namely exploration and exploitation.Furthermore, the exploitation phase is performed through two different strategies, namely shrinking encircling and spiral update.In shrinking encircling mechanism, the whale moves toward the best whale in a circular manner.

Exploitation phase
To mathematically model exploitation phase of WOA, current best is represented by the position of the prey, which is assumed as the solution nearest to the optimum solution.To exploit the search space, the position of each whale is defined according to the prey, which simulated as encircling behavior.The current position of each agent is defined using two ways, namely spiral formation and encircling of prey.The encircling of prey is equated as Eq. ( 2).
where position P(m), denotes the position of agent at iteration m and P b (m) represents the best agent.A represents the coefficient vector which is equated in Eq. ( 3) while D denotes the distance from best agent which is computed as Eq. ( 4).
where r ∈ (0, 1) is a randomly generated number, a is linearly decreasing vector with values from 2 to 0, and C denotes an adjustment factor by which search agents captures the local areas.Furthermore, the spiral formation is mathematically modeled as Eq.(6).
where l represents is a randomly generated number in the range [− 1, 1], constant number b defines spiral shape, and ( D) represents the distance between prey and search agent as defied in Eq. ( 7).
The exploitation phase of the WOA is implemented with equal probabilities using Eq. ( 8).

Exploration phase
To perform the exploration, each whale updates its position either randomly in the search space or using the best search agent, which depends on vector A.
For A > 1, a random movement is performed by whales whereas for A < 1, whales prefer to search locally in the space.The exploration phase is mathematically modelled as Eqs.( 9) and ( 10) at iteration (t + 1).
where P rand denotes any randomly selected whale.Algorithm 1 details the pseudo-code of the WOA.

Proposed method
This section details a novel recommendation system, namely map-reduce-based tournament empowered WOA (MR-TWOA), to deal with large-scale data efficiently.The proposed method performs clustering by leveraging the strengths of mapreduce architecture with TWOA.The workflow of the MR-TWOA is depicted in Fig. 1.First, the user-rated dataset is captured.Then, it is processed through the proposed MR-TWOA to obtain optimal clusters in an efficient manner.
Algorithm 1 Whale optimization algorithm (WOA) [39] 1: Input: Population (P j ) randomly generated in search space, j := 1, 2, . . ., n 2: Output: P * (final position of best whale i.e prey) 3: Find the fitness of each whale and position of prey (P * ) 4: while (it < I ter max ) do 5: for each whale in the population do 6: Update l, p, A, C, and a 7: Redefine the positions of whale using encircling phase 10: Initialize (X rand ) 12: Redefine the position of whale using exploration phase 13: end if

15:
Update the positions of whale using spiral phase

Tournament empowered WOA
WOA defines the position of the optimal solution according to the current best whale and randomly selected whale.The parameter 'a' controls the equilibrium between exploration and exploitation.However, WOA performs exploration using the randomly picked solution, which affects the exploration and exploitation balance.To mitigate the above concern, a novel tournament selection empowered WOA has been introduced.Instead of a random solution in the exploration phase, TWOA uses tournament selection [52] for selecting the P rand solution in Eqs. ( 9) and ( 10).This yields a better possibility of selecting good solutions at the later stage.This results in fast convergence and better exploitation.

MR-TWOA-based recommendation method
For clustering using meta-heuristic algorithm, each iteration involves N * K * P number of distance computations, where N denotes the number of data points, K is the number of clusters, and P denotes the population size.Therefore, on large scale datasets, sequential algorithms fail to respond in terms

Time complexity
The time complexity of MR-TWOA-based recommendation method is proportional to the number of clusters, the number of data objects, and the number of dimensions in the dataset.In the MR-TWOA based recommendation method, the optimal number of centroids are obtained with O(N ×C × D×T ) operations, where N , C, D, and T denotes the total number of data objects, number of clusters and number of dimensions in the dataset, and number of iterations, respectively.Furthermore, for the population size of P, the time complexity of the proposed recommendation system can be represented as

Experimental results
The performance of MR-TWOA method is analyzed in three sections.First, the efficacy of the proposed TWOA is validated on 23 benchmarks which belong to three different categories, namely uni-modal, multi-modal, and fixed dimensional multi-modal.Second, the clustering efficiency of the parallel version of TWOA (MR-TWOA) has been analyzed on four large-scale datasets.In the third section, the experimental validation of the proposed method (MR-TWOA) as the recommendation system is performed in terms of three parameters, namely mean absolute error (MAE), recall, and precision.

Performance of TWOA on benchmark problems
This section details the experimental analysis of the proposed variant (TWOA) on 23 standard benchmark functions.The simulation results are conducted on a computer having Intel Corei3-4570 processor with 3.20 GHz, 4GB ram and 500 GB hard disk.The results are compared with four recent metaheuristic methods, namely whale optimization algorithm (WOA) [39], improved cuckoo search (ICS) [53], enhanced [− 100, 100] 0 [− 1.28, 1.28] 0 grey-wolf optimizer (EGWO) [12], and salp-swarm algorithm (SSA) [54].As WOA has already shown superior performance over popular meta-heuristic methods in literature such as grey wolf optimizer [55], particle swarm optimization (PSO) [56], dragonfly algorithm [38], differential evolution [57].Therefore, the comparison includes only recently proposed meta-heuristic methods.Tables 1, 2, 3 detail the considered 23 benchmark functions which are grouped into three categories, namely unimodal, multi-modal, and fixed dimensional multi-modal functions, respectively.Generally, unimodal functions describe the exploitation ability of the considered method, while multi-modal functions validate the exploration ability of the method.Furthermore, each method is executed over 30 times for each benchmark function.The best fitness value obtained in different runs is averaged and analyzed in terms of mean fitness value and standard deviation.The parameter settings of each meta-heuristic method are given in Table 4.These values were fixed according to the related literature to make a fair comparison between the selected meta-heuristics [12,39,53,54].Moreover, the population size and the number of iterations for all algorithms are kept as 30 and 500, respectively.Table 5 tabulates the average fitness value on different benchmark functions obtained by the considered metaheuristic methods along with the standard deviation.It is pertinent from the table that TWOA outperforms the other compared methods on four unimodal functions, i.e.F 1 , F 2 , F 5 , F7.For F 3 and F 4 .ICS has shown competitive results while SCA performed well on F 6 .Thus, it may be stated that TWOA has superior local searchability.Moreover, TWOA has surpassed other methods on more than 80% of the multimodel functions.This represents that TWOA is robust against trapping in local optima.The superiority of TWOA is due to the inclusion of the tournament selection process which resulted in better trade-off between the exploration and exploitation.Additionally, the poor solutions also got a fair chance in the early phase of the algorithm, which prevents the algorithm from the premature convergence.

4
[0, 10] − 10.4028 Furthermore, to analyze the exploration and exploitation behavior, the convergence trends of the proposed and considered methods on two representative benchmark functions, namely F 1 and F 8 , are depicted in Fig. 3.In the figure, the horizontal axis represents the iteration count, and vertical axis denotes the best fitness value.It is visualizable from convergence curves that TWOA smoothly reaches the optimal solution.This shows that the proposed method has better ability to attain an optimal solution.Therefore, it can be validates from experimental analysis that TWOA is an efficient method that can be leveraged for clustering the large scale datasets.

Table 5
Mean and standard deviation of the fitness value over 30 runs  7 presents the Fmeasure (Fm) and computation time (CT) of the considered methods in terms of mean value which is obtained over 30 runs by running the considered methods on a cluster of 5 computers.It is visible from the table that MR-TWOA has outperformed the compared methods on all datasets.The performance of MR-Kmeans algorithm has been recorded as poorest among all the considered methods.However, it has given competitive performance in terms of computation time since it works on single solution-based approach.Moreover, the parallel computation efficacy of MR-TWOA is validated in terms of speedup which is computed according to Eq. (11).
where T base represents the computation time taken by a method to run on a single machine, and T N refers to the time taken by the same method to run on N number of machines.
To study the speedup efficiency of MR-TWOA, two largescale datasets are considered, namely Replicated Iris and Replicated CMC. Figure 4a and b represent the speedup graphs of MR-TWOA for Replicated Iris and Replicated CMC datasets, respectively.In the speedup graph, Y axis corresponds to the computation time while X axis corresponds to the number of machines (or nodes) in the cluster.From the figures, it is observable that the speedup performance of MR-TWOA running on Replicated Iris dataset is 2.7548 when there are five nodes in the cluster.The speedup performance of MR-TWOA running on Replicated CMC dataset is 2.1561 when there are five nodes in the cluster.This clearly indicates that MR-TWOA is an efficient method and can be used for large-scale clustering datasets.

Analysis of MR-TWOA as recommender system
This section analysis the applicability of the proposed MR-TWOA for the recommendation.To perform the same, MovieLens dataset [51] is considered which is a publicly available dataset, consisting of 1000 user-reviews on 1700 movies.It contains 100,000 data-points, where each data point corresponds to a user-rating for a movie.Furthermore, this dataset is replicated 1000 times to make it suitable for Hadoop architecture.To analyze the efficacy of the MR-TWOA with the considered map-reduce-based clustering methods, three performance measures, namely mean absolute error (MEA), precision, and recall, are considered over the different number of clusters.From the table and figures, it is visible that MR-TWOA has reported least MEA value among WOA, Bat, ABC and PSO on all the clusters.Whereas, WOA attained second least MEA all the clusters.Furthermore, it can also be observed that MR-TWOA has clearly outperformed all the methods in terms of precision.Again, WOA performed as second best method in terms of precision on all the clusters.It can also be inferred that MR-TWOA attains maximum recall among all the considered methods on all the cluster sets except 10, 15, where MR-BAT and MR-ABC has given competitive results, respectively.Furthermore, WOA has given second-best result when the number of clusters is set as 15, 20, 25, 30 and 40, while MR-Bat and ABC performed second best on 5 and 10 cluster sets, respectively.Therefore, it is affirmed from the experimental results that MR-TWOA is scalable and robust for data clustering.Moreover, it can be leveraged as a powerful alternative for the recommendation system over large-scale datasets.

Conclusion
In this paper, a novel recommendation method, MR-TWOA, is introduced for handling large dataset.The proposed method performs clustering through a novel variant of WOA, termed as tournament empowered WOA (TWOA).The performance of TWOA is tested on 23 uni-model and multi-model benchmark functions in terms of the mean and standard deviation of the fitness value.The results are compared against four recent meta-heuristic methods, namely WOA, ICS, EGWO, and SSA.The experimental results witnessed the superiority of the proposed method as compared to the considered methods on the majority of the benchmark function, which validates the ability of the TWOA for avoiding local optima.Furthermore, the clustering accuracy of the proposed MR-TWOA is tested on four massive datasets in terms of F-measure and computation time.The performance is compared with five recent map-reduce algorithms, namely MR-Kmeans, MR-KPSO, MR-ABC, MR-Bat, and MR-WOA.The proposed MR-TWOA outperformed the compared method on all the datasets, which shows the superior clustering efficiency of the proposed method.Additionally, the performance of MR-TWOA is studied for the parallel environment in terms of speed-up efficiency.To do so, MR-TWO runs on a cluster with 5 machines for four massive datasets.The experimental results of the proposed MR-TWOA surpassed the other state-of-the-art metaheuristics-based methods.Furthermore, the recommendation ability of MR-TWOA is validated on MovieLens dataset in terms of MEA, precision and recall.It is confirmed from the simulation results that MR-TWOA outperformed the other considered methods in the product recommendation along with the ability to handle massive datasets.
In future, MR-TWOA can be used to unfold other realworld problems pertaining to big datasets.The proposed TWOA incorporates tournament selection for opting better solutions rather than random solutions.Since tournament selection sometimes fails in the selection of best solutions [58], it may limit the exploration ability of the proposed TWOA which can be improved by examining other selection methods.Furthermore, some other framework such as spark may be used to improve the computation cost of the proposed method.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.
Data clustering is an unsupervised machine learning approach which iteratively groups the set of N data-points in p clusters.Unlike supervised approaches, it does not need any priori training phase.Let O = {0 11 , o 12 , . . ., o 1t }, {o 21 , o 22 , . . ., o 2t }, and {o n1 , z n2 , . . ., o nt } be a set of n datapoints having t features and o i j denotes the jth attribute value of ith data-point.The clustering works iteratively to find a set of cluster centroids denoted as K = {k 11 , k 12 , . . ., k 1t }, {k 21 , k 22 , . . ., k 2t }, and {k p1 , k p2 , . . ., k pt }. k i j corresponds to the value of jth attribute of ith cluster centroid and

Fig. 1
Fig. 1 The proposed Map-reduce-based tournament empowered WOA for recommendation

Fig. 3
Fig. 3 Convergence trend of TWOA with other considered metaheuristics

Fig. 4
Fig. 4 Computation time analysis of MR-TWOA with other considered meta-heuristics

Fig. 5 Fig. 6 Fig. 7
Fig. 5 Mean absolute error of MR-TWOA and other considered methods

Table 1
Description of unimodal benchmark functions

Table 2
Description of multi-modal benchmark functions

Table 3
Description of fixed-dimension multi-modal benchmark functions

Table 6
Description of the considered large datasets

Table 8
depicts the MAE, precision, and recall of the considered methods.For the visual interpretation of Table 8, Figs.5, 6, and 7 depict the barcharts corresponding to mean absolute error, precision, and recall, respectively.The X axis in the figures corresponds to

Table 7
Computation time (CT) and F-measure (Fm) for 30 runs of the MR-TWOA and other methods

Table 8
Comparative