1 Introduction

Recently, Big Data has gained considerable interest in a wide-ranging application which generates Big Data, e.g., social networking profiles, health care services, MapReduce scientific experiments, cloud applications, e-government services, and transportation [1]. Data processing massive volumes of data in parallel across multiple nodes is possible using the MapReduce programming paradigm. MapReduce is an analytics framework for large-scale complicated data analysis. To process large volumes of data across multiple clusters, MapReduce is a Hadoop framework. It is also a programming model that allows large datasets to be processed over numerous computer clusters Distributed data storage is enabled by this programme. This is achieved by splitting petabytes of data into smaller parts and processing them on commodity Hadoop servers. It then consolidates data from many servers and returns it to the application. Data are rapidly generated, and application produces increasing amount of structured and unstructured data consisting of several variables that should be analyzed in a shorter period of time. The National Institute of Standard and Technology has specified that Big Data has four common features (4Vs): veracity, volume, variety, and velocity [2]. Veracity denotes a measure of understandability and quality of the data. Volume represents the data size that could be very large to be produced by the existing generation of techniques or systems. Variety refers to the most fascinating of the four Vs since it includes data of different kinds, namely audio, video, text, and images for a provided object Velocity refers to data that is streaming at faster speed when compared to traditional algorithms and systems [3]. Data mining (DM) and Data analysis are difficult processes since the quantity of data is significant and this data can be polluted with noise and might be stored by different processes. Such data are classified by the four Vs of Big Data. Figure 1 illustrates the types of Vs involved in big data.

Fig. 1
figure 1

Six Vs of Big data

The major problem in research is data analytics viz. implemented on the basis of DM and machine learning (ML) methods [4]. Usually, big data mining (BDM) method has difficulty in handling DM software tools and presentation techniques because the size of the information is complex and large. Executing DM method through largescale data sets with a single Personal Computer (PC) necessitates higher cost of computation. Therefore, it is important to utilize efficient computing environment for big data processing and analyzing [5]. Big data increases the demand for smart data analytic models such as automatic classification, image processing, data fusion, and multi-temporal processing. Parallel processing is a computing technique that involves running two or more processors (CPUs) simultaneously to perform separate pieces of a larger operation. Parallel processing is a technique that is widely used to conduct complex activities and computations in parallel. Parallel processing will be widely used by data scientists for compute- and data-intensive activities. The parallelization method is technologically advanced for scaling with the data available by increasing the computation significantly. In order to manage the problem based on largescale data sets, Google presented the MapReduce architecture [6]. The MapReduce approach along with distributed file system (DFS) offers robust and simple environments to handle largescale data processing. In DM, this approach is currently being considered than other parallelization methods, i.e., Message Passing Interface (MPI), because of its fault tolerance system, i.e., needed for the task which consumes significant amount of time, and because of their MPI of [7]. In general, the MapReduce architecture is implemented by an effective parallel programming method named Hadoop [8]. The MapReduce techniques involve map and reduce function. The mapping process is utilized for sorting and filtering, where the reduce functions perform a summary process for generating a result. Several researches-based methods are presented for BDM methods such as for instance selection, attribute reduction, and class imbalance. Therefore, by using MapReduce technique and traditional distributed approach, BDM is efficiently implemented by several computer nodes or processors to simultaneously perform the task [9]. In the study, Decision Tree (DT), ML methods, optimization algorithms are utilized for classifying big data.

This study introduces an efficient Computational Intelligence based Clustering with Classification Model for Big Data Visualization on Map Reduce Environment, named CICC-BDVMR technique. The proposed CICC-BDVMR technique involves the design of grasshopper optimization algorithm (GOA) with kernelized fuzzy c-means (KFCM) technique to group the big data and the GOA is applied to effectively compute the initial cluster centers of the KFCM technique. KFCM is abbreviated as kernel fuzzy c-means clustering algorithm (KFCM) derives from the fuzzy c-means clustering approach (FCM). Comparing the KFCM approach to the standard fuzzy c-means technique, the former allows for more accurate clustering and has a higher accuracy. The latter also allows for more accurate clustering. Moreover, big data classification process takes place using the Ridge Regression (RR) and the parameter optimization of the RR model is carried out via the Red Colobuses Monkey (RCM) algorithm. In order to demonstrate the enhanced performance of the CICC-BDVMR technique, a comprehensive result analysis is made using benchmark datasets. A grasshopper optimization algorithm (GOA), the Red Colobuses Monkey (RCM) algorithm, the design of GOA and RCM algorithms for parameter optimization processes for large data categorization demonstrates the study's uniqueness. Parameter optimization procedures for big data categorization are being designed using GOA and RCM algorithms.

2 Literature review

Abukhodair et al. [10] developed a meta heuristic optimization based on big data classification in MapReduce (MOBDC-MR) architecture. The presented method focuses on selecting optimum features and efficiently categorizing big data. Additionally, the suggested techniques involve the proposal of BPOA based FS method for increasing the accuracy and reducing the difficulty. Beetle antenna search (BAS) with LSTM is applied for classifying big data. Brahmane and Krishna [11] introduced an approach to handle big data with Spark architecture. The presented method undergoes two stages to classify the big data that includes feature classification and selection, i.e., implemented in the primary nodes of Spark framework. The presented optimization method is called rider chaotic biography optimization (RCBO) method, i.e., combination of chaotic biogeography-based optimization (CBBO) and rider optimization algorithm (ROA). The presented RCBO-DSAE method with Spark architecture efficiently handles the big data to attain an efficient big data classification.

Qin et al. [12] the DEEPEYE method has been presented for addressing this challenge. The scheme resolves the problem by training a binary classification for deciding either a certain visualization is effective for a provided data set, and utilizing supervised learning to rank method for ranking the abovementioned visualization. Also, it considers common visualization processes, namely binning and grouping, that could manipulate the data, also describe the searching space.

Galletta et al. [13] proposed a graphical tool for the visualization of healthcare data, which is simply used to monitor health condition of person remotely. The tool is easy to use, and assist medical doctor to understand fast the existing condition of patient by observing a coloured circle. Cui et al. [14], proposed a Big Data Visualization enables Multi-modal Feedback Framework (BDVMFF) for boosting motivation, student confidence, and self-consciousness in the online learning environment. The presented method provides the teacher a digital task to efficiently exchange input and writing to employ multi-modal feedback. Those systems provide students and teachers with straightforward and effective digital learning platforms.

Lakshmanaprabu et al. [15], developed big data analytics on IoT based medical systems with the MapReduce and Random Forest Classifier (RFC). The e-health information is gathered from the patient affected by various diseases is taken into account for analysis. The optimum attribute is selected by an Improved DA (IDA) from the databases for effective classification. At last, RFC method is utilized for classifying the e-health data using optimum features. Dubey et al. [16], proposed an effective ACO and PSO-based architecture for data preprocessing and classification in big data. It has shown that content part is fetched and collaborated for analyzing the integration of velocity and volume. Next, weight marking is performed by the variety and volume of data. At last, the ranking is performed by the variety and velocity features of big data.

3 The proposed model

In this study, a novel CICC-BDVMR technique has been developed for accomplishing effectual BDV by the use of clustering and data classification process on the Map Reduce environment. The proposed CICC-BDVMR technique encompasses several subprocesses namely KFCM based clustering, GOA based initial cluster center selection, RR based classification, and RCM based parameter tuning. Figure 2 illustrates the overall process of CICC-BDVMR technique.

Fig. 2
figure 2

Block diagram of CICC-BDVMR technique

3.1 Map reduce

The MR method is applied for parallel and distributed processing of massive amounts of unstructured and structured information, whereby Hadoop is generally stored in HDFS, clustered with a large computer [17]. Therefore, scaling in small steps is feasible (scale-out). The architecture consists of (a) reduce—an aggregation/consolidation stage, whereby all the related records are processed in single entity. (b) Map—a key transformation, and recording stage, whereby individual-input record is simultaneously processed. Correctly configure the cluster using the appropriate diagnostic tools. When writing intermediate data to disc, utilise compression. Adjust the amount of Map and Reduce tasks in accordance with the aforementioned recommendations. Whenever possible, incorporate Combiner. MapReduce uses the input data to pass each data element to the mapper during the mapping phase. The reducer process all of the mapper’s outputs and arrives at the result during the reducing step. Simply put, the mapper’s job is to filter and change the input into something that the reducer can accumulate over. The two great advantages are interrelated with: map task and Logical block. The key idea is that the input data is separated into logical blocks. Each block is processed via map task. The results from functioning block are divided into dissimilar sets and then arranged. All the sorted blocks are transported to the reduced task (RT). The RT: a map task could run-in cluster node, and map task could run in parallel that is responsible to transform the input record to value or key pair. The output from all the maps is split and later arranged. But there is a separate division for each RT.

3.2 Design of GOA-KFCM based clustering technique

In recent years, the kernel method [18] is the most researched subject within ML community and has been extensively employed to function approximation and pattern recognition. The key motivation of utilizing the kernel method consists of: (1) enhances strength of original clustering algorithm to outliers and noise, (2) induces a class of strong non‐Euclidean distance measure for the novel data space to derive objective function and thereby cluster the non‐Euclidean structure in data; and (3) still retain computation simplicity. This procedure can be realized by changing the objective function in the traditional FCM method with a kernel‐induced distance rather than Euclidean distance in the FCM, and thereby the respective process is acquired and known as the kernelized FCM (KFCM) model, that is very powerful when compared to FCM:

$$ \sum \limits _{i=1}^{C}{u}_{ij}=1,\forall j=1\dots N$$
(1)

In the study, the kernel function \(K(x, C)\) is considered as a Gaussian radial basic function (GRBF):

$$ K\left( {x, c} \right) = \exp \left( {\frac{{ - \left\| {x - c} \right\|^{2} }}{{\sigma^{2} }}} \right), $$
(2)

whereas \(\sigma \) represent an adjustable variable:

$$ j_{m} = 2\mathop \sum \limits_{i = 1}^{C} \mathop \sum \limits_{j = 1}^{N} u_{{ij^{m} }} \left[ {1 - K\left( {X_{j} , C_{i} } \right)} \right]. $$
(3)

The fuzzy membership matrix \(u\) is attained by:

$${u}_{ij}=\frac{(1-K({x}_{j},{c}_{i}){)}^{-1/(m-1)}}{ \sum \limits_{k=1}^{C}(1-K({x}_{j}.,{c}_{k}){)}^{-1/(m-1)}}.$$
(4)

The cluster center \({c}_{i}\) is attained by:

$${\mathcal{C}}_{i}=\frac{ \sum\limits_{j=1}^{N}{u}_{i\mathrm{j}}^{m}K({x}_{j},{c}_{i}){x}_{j}}{\sum\limits_{j=1}^{N}{u}_{i\mathrm{j}}^{m}K({x}_{j},{c}_{i})}$$
(5)

As the \(K\)‐means model focus on minimizing the sum of squared distance from each point to the cluster center, it leads to compact cluster. Then employ the intra‐cluster distance measures, i.e., median distance between a cluster center and point [19]. The following equation can be used:

$$intra =midean\left(\sum\limits_{i=1}^{C}\sum\limits_{x\in {c}_{i}}\| x-{v}_{j}{\| }^{2}\right)$$
(6)

Thus, the clustering provides minimal values for the validity measure shows the ideal value of cluster. Next, the amount of cluster is known beforehand evaluating the membership matrix.

For determining the initial cluster centers of the KFCM technique, the GOA is utilized. The GOA method is an evolutionary model proposed by the simulation behavior of swarm of grasshoppers while searching for food. Typically, they are insects of, destructive nature; cause harm to agricultural produce and harvest production [20]. The growth of a full‐grown grasshopper drives as egg, nymph, and adults. It can be mathematically modelled by the following equation for resolving different optimization issues.

$${Y}_{i}^{d}=cx\left\{\sum\limits_{\begin{array}{c}j=i\\ j\ne i+\widehat{{T}_{d}}\end{array}}^{n}cx(u{l}_{d}-l{l}_{d}/2)sf(|{Y}_{j}^{d}-{Y}_{i}^{d}|{)}^{{Y}_{j}-{Y}_{i}}/{D}_{ij}\right\}$$
(7)

Here, \({Y}_{j},\) \({Y}_{i}\) represents the location of jth and ith grasshopper. The jth and \(i\mathrm{th}\) locations of the grasshopper in \(D\mathrm{th}\) dimension are represented by \({Y}_{j}^{d}\) and \({Y}_{i}^{d}\), correspondingly. The distance, number of grasshoppers, and social interaction between jth and \(ith\) grasshoppers are denoted as, \(sf\), and \({D}_{ij}\) respectively. \(\widehat{{T}_{d}}\) indicates the value of the target in the \(Dth\) dimension, while \(u{l}_{d}\) and \(l{l}_{d}\) denotes the upper and lower limits in Dth dimension. According to the coefficient cx, the comfort zone is reduced in proportion to the number of iterations. The adoptive variable \(cz\) is utilized for reducing the comfort zone. To balance exploitation and exploration of the grasshopper swarm near the optimal global solution, the initial \(cx\) value is used. Moreover, repulsion zone, comfort zone, and attraction amongst the grasshoppers are reduced by using the second \(cx\) value [21]. The coefficient \(cx\) reduces the comfort zone proportionate to the amount of iterations as follows

$$cz=c{z}_{\mathrm{ max }}-t\left({t}_{\mathrm{ max}}\right)$$
(8)

In which \(c{z}_{\mathrm{ max}}\) means the maximal value, \(c{z}_{\mathrm{ min}}\) shows the minimal value, \(t\) represent the existing iteration, and \({r}_{\mathrm{ max}}\) indicates the maximal amount of iterations.

figure a

3.3 Design of RCM-RR based classification technique

Once the big data is clustered into different groups based on the class labels that exist in it, the next stage is to perform classification process using the RR technique. The RR [22] is an SLFN system where the weights between the hidden and input layers are selected in an arbitrary way. RR is computational free from iteration that makes RR very faster by considerably minimizing the computational time needed for training the SLFN. The SLFN frequently needs large amount of hidden layers when creating optimum solutions. The output function of SLFN with \(L\) hidden node is determined as follows:

$${f}_{L}\left(x\right)=\sum \limits_{i=1}^{L}{\beta }_{i}{g}_{i}x=\sum \limits_{i=1}^{L}{\beta }_{i}G\left({a}_{i}, {b}_{i},x\right),x\in {R}^{d},{\beta }_{i}\in {R}^{m}$$
(9)

For additive nodes with activation function \(g,\) \(g\) is determined by

$${g}_{i}=G\left({a}_{i}, {b}_{i},x\right)=g\left({a}_{i}x+{b}_{i}\right),{a}_{i}\in {R}^{d},{b}_{i}\in R$$
(10)
$${\sum }_{i=1}^{L}{\beta }_{i}G\left({a}_{i}, {b}_{i},x\right)={t}_{j},j=1,\dots ,N$$
(11)

The above equations are updated by the following equation

$$H\beta =T$$
(12)

now

$$H({w}_{1}\cdots {w}_{L}, {b}_{1}\cdots {b}_{L},{x}_{1}\cdots {x}_{N})$$
$$=\left(\begin{array}{ccc}g({w}_{1}\cdot {x}_{1}+{b}_{1})& \cdots & g({w}_{L}\cdot {x}_{1}+{b}_{L})\\ \vdots & \ddots & \vdots \\ g({w}_{1}\cdot {x}_{N}+{b}_{1})& \cdots & g({w}_{L}\cdot {x}_{N}+{b}_{L})\end{array}\right)$$
(13)
$$\beta =\left[\begin{array}{l}{\beta }_{1}^{T}\\ \vdots \\ {\beta }_{N}^{T}\end{array}\right]$$
(14)
$$T=\left[\begin{array}{l}{t}_{1}^{T}\\ \vdots \\ {t}_{N}^{T}\end{array}\right]$$
(15)

\(H\) signifies the hidden neuron output matrix of the NN system. The SLFN is trained to resolve a linear optimization issue as follows:

$$\left|\left|H\widehat{\beta }-T\right|\right|={\mathrm{min}}_{\beta }\left|\left|H\beta -T\right|\right|$$
(16)

Now \(\widehat{\beta }\) is represented by

$$ \hat{\beta } = H^{\dag } T = (H^{T} H)^{ - 1} H^{T} T $$
(17)

is the minimum norm least square solution of \({w}_{i}=T\) and \({H}^{T}\) characterizes the Moore Penrose generalized inverse of \(H\) [23].

The process of RR is described as follows.

Step 1 Arbitrarily Select the input weight \({w}_{i}\) and hidden layer bias \({b}_{i}\).

Step 2 Evaluate the hidden neuron output matrix \(H.\)

Step 3 Attain the output weight \(\widehat{\beta }\) by utilizing \(\hat{\beta } = H^{\dag } T\)

For properly tuning the parameters involved in the RR technique, the RCM algorithm has been utilized and thereby achieved improved classification outcomes. The RCM approach stimulates the red monkey behavior. In order to model this interaction, every cluster in the monkey area unit needed maneuvering through the searching region [24]. Young males must quickly go out because of the territorial aspects related to the Cercopithecus mitis to be very effective since they are entering challenges with dominant males from other families. As well, there is no specific interaction among young ones and male Cercopithecus mitis. Once they defeated that male, they will be leader in the family and offers food supplies, place to live, and socialization for the young male. The location update about each one of the red monkeys in a group is depending on the location of the optimal red monkey of the group has been delineated by the succeeding equations:

$$P{B}_{i+1}=\left(P{A}_{i}\times P{B}_{i}\right)+\left({W}_{leader}-{W}_{i}\right)\times rand\times \left({X}_{best}-{X}_{i}\right)$$
(18)
$${X}_{i+1}={X}_{i}+P{B}_{i+1}$$
(19)
$$P{B}_{i+1}=P{A}_{i}\times rand$$
(20)

While

  • \(PA\) signifies the monkey combat power (an arbitrarily selected value in the range of \([\mathrm{0,1}\)]);

  • \(PB\) denotes the monkey body power (an arbitrarily selected value in the range of \([-\mathrm{5,5}])\);

  • \({W}_{i}\) characterizes the monkey weight (an arbitrarily selected value between [4, 6]);

  • \({W}_{leader}\) indicates the leader weight;

  • \({X}_{best}\) refers to the location of the leader.

  • rand has shown any number in the range of \([\mathrm{0,1}]\).

  • \(X\) illustrates the location of the red monkey;

In order to upgrade the location associated with the children of red monkey, the following equation has been used:

$$PBc{h}_{i+1}=(P{A}_{i}\times PBc{h}_{i})+(Wc{h}_{leader}-Wc{h}_{i})\times rand\times \left(Xc{h}_{best}-Xc{h}_{i}\right)$$
(21)
$$Xc{h}_{i+1}=Xc{h}_{i}+PBc{h}_{i+1}$$
(22)
$$PAc{h}_{i+1}=PAc{h}_{i}\times rand$$
(23)

where PAch indicates the child combat power; PBch denotes the power rate of child body, and \(Wc{h}_{i}\) represents the child weight in which each weight was stated for being arbitrary number within [4, 6]. It is noteworthy that each parameter of RCM depending on the problem's nature or set by experiment that should be resolved. RCM is considered as a parameter which makes it easier to execute; also RCM balances between exploration and exploitation stages, which makes it applicable to resolve optimization problems.

The RCM approach derives a FF for attaining enhanced classification performance. It defines a positive integer for representing the optimum efficiency of the candidate solutions. During this analysis, the minimization of the classification error rate is regarded as the FF, as offered in Eq. (24). An optimum solution has a lesser error rate and the worst solution attains an enhanced error rate.

$$Fitness\left({x}_{i}\right)=Classifier\,Error\,Rate\left({x}_{i}\right)=\frac{numerb\,of\,misclassified\,instances}{Total\,number\,of\,instances}\times 100$$
(24)

4 Performance validation

This section assesses the performance of the CICC-BDVMR approach using two standard datasets [25] namely localization data and skin data. The first localization data involves 8 attributes and 164,860 instances. Besides, the skin dataset includes 245,057, amongst which 50,859 are skin samples, and the remaining 194,198 are non-skin samples. A correlation matrix is simply a table that shows the correlation. The measure is optimally utilized in variables that illustrate linear relations among others. The fit of the data is visually characterized in a scatterplot. Figure 3 demonstrates the correlation matrix of n the test localization dataset.

Fig. 3
figure 3

Confusion matrix of CICC-BDVMR technique on localization dataset

Figure 4 shows the pairwise relationship plot of the class labels involved in the localization dataset such as sitting, walking, lying, lying down, on all fours, standing up from lying, sitting on the ground, standing up from sitting on the ground, falling, sitting down, and standing up from sitting.

Fig. 4
figure 4

Pairwise relationship of class labels in localization dataset

Figure 5 proves the correlation matrix attained by the CICC-BDVMR method on the test skin data set. The correlation matrix proved that our CICC-BDVMR method has gained enhanced performance on the test localization data set. Figure 6 displays the pairwise relation plot of the class label included in the skin dataset namely skin and Non-skin.

Fig. 5
figure 5

Confusion matrix of CICC-BDVMR technique on localization dataset

Fig. 6
figure 6

Pairwise relationship of class labels in skin dataset

Table 1 provides the comparative classification result analysis of the CICC-BDVMR technique with other techniques on the test localization dataset under different mappers (M). The experimental results indicated that the CICC-BDVMR technique has obtained effective classification performance under all sizes of M. For instance, with a training/testing dataset of 75:25 and M = 2, the CICC-BDVMR technique has achieved higher accuracy of 81.60% whereas the CNB, GWOCNB, and CGCNB techniques have obtained lower accuracy of 77.92%, 77.96%, and 79.06% respectively. Besides, with M = 5, the CICC-BDVMR technique has achieved higher accuracy of 84.35% whereas the CNB, GWOCNB, and CGCNB techniques have obtained lower accuracy of 76.15%, 80.18%, and 81.05% respectively. Besides, with training/testing dataset of 80:20 and M = 2, the CICC-BDVMR system has accomplished high accuracy of 82.40% while the CNB, GWOCNB, and CGCNB systems have attained minimum accuracy of 76.84%, 78.44%, and 79.66% correspondingly. In addition, with M = 5, the CICC-BDVMR method has reached maximum accuracy of 84.11% whereas the CNB, GWOCNB, and CGCNB methods have attained less accuracy of 76.39%, 79.96%, and 80.93% correspondingly.

Table 1 Result analysis of CICC-BDVMR technique under different mappers in localization dataset

Table 2 and Fig. 7 showcases the average classification results obtained by the CICC-BDVMR with recent methods under distinct sizes of training/testing data. With training/testing data of 75:25, the CICC-BDVMR technique has achieved better performance with the maximum average accuracy, TPR, and TNR of 83.36%, 86.11%, and 78.79% whereas the CNB, GWOCNB, and CGCNB techniques have resulted in ineffective outcomes with the lower accuracy of 76.96%, 78.71%, and 80.47% respectively. Simultaneously, with training/testing data of 85:15, the CICC-BDVMR system has accomplished improved performance with the maximal average accuracy, TPR, and TNR of 82.92%, 86.62%, and 79.32% while the CNB, GWOCNB, and CGCNB systems have resulted in inefficient outcomes with the less accuracy of 76.55%, 82.08%, and 73.71% correspondingly.

Table 2 Average analysis of CICC-BDVMR technique under distinct sizes of training/testing data in localization data
Fig. 7
figure 7

Average analysis of CICC-BDVMR technique under localization data

Concurrently, with training/testing data of 80:20, the CICC-BDVMR system has accomplished good performance with the highest average accuracy, TPR, and TNR of 82.80%, 86.27%, and 79.06% while the CNB, GWOCNB, and CGCNB systems have resulted in inefficient outcomes with the less accuracy of 77.30%, 81.60%, and 73.47% correspondingly. Furthermore, with training/testing data of 90:10, the CICC-BDVMR method has accomplished effective performance with the highest average accuracy, TPR, and TNR of 83.35%, 87.14%, and 80.14% while the CNB, GWOCNB, and CGCNB systems have resulted in inefficient outcomes with the minimum accuracy of 78.04%, 81.55%, and 73.49% correspondingly.

The overall accuracy outcome analysis of the CICC-BDVMR technique on localization data is portrayed in Fig. 8. The results demonstrated that the CICC-BDVMR technique has accomplished improved validation accuracy compared to training accuracy. It is also observable that the accuracy values get saturated with the epoch count of 1000. The overall loss outcome analysis of the CICC-BDVMR technique on localization data is Table 3 offers the relative analysis of CICC-BDVMR system with other approaches on the test skin dataset under dissimilar mappers (M). The experiment result indicates that the CICC-BDVMR method has attained good classification performance under each size of M. For example, with training/testing dataset of 75:25 and M = 2, the CICC-BDVMR method has accomplished high accuracy of 83.44% while the CNB, GWOCNB, and CGCNB approaches have attained less accuracy of 76.04%, 75.95%, and 79.70% correspondingly. In addition, with M = 5, the CICC-BDVMR system has realized high accuracy of 81.54% while the CNB, GWOCNB, and CGCNB methods have gained less accuracy of 76.70%, 77.27%, and 78.34% correspondingly.

Fig. 8
figure 8

Accuracy analysis of CICC-BDVMR technique under localization data

Table 3 Result analysis of CICC-BDVMR technique under different mappers in skin dataset

Besides, with training/testing dataset of 80:20 and M = 2, the CICC-BDVMR method has accomplished high accuracy of 82.09% while the CNB, GWOCNB, and CGCNB methods have attained less accuracy of 75.96%, 77.66%, and 79.46% correspondingly. In addition, with M = 5, the CICC-BDVMR system has reached high accuracy of 80.63% while the CNB, GWOCNB, and CGCNB methods have attained less accuracy of 76.33%, 77.34%, and 77.34% correspondingly.

Table 4 and Fig. 9 show the average classification outcome attained by the CICC-BDVMR with existing models under dissimilar sizes of testing or training data [26]. With training/testing data of 75:25, the CICC-BDVMR system has accomplished improved performance with the maximal average accuracy, TPR, and TNR of 81.89%, 85.71%, and 76.75% where the CNB, GWOCNB, and CGCNB methods have resulted in inefficient outcomes with the less accuracy of 76.19%, 81.02%, and 71.11% correspondingly. At the same time, with training/testing data of 85:15, the CICC-BDVMR method has reached improved performance with the maximal average accuracy, TPR, and TNR of 81.83%, 87.64%, and 77.64% while the CNB, GWOCNB, and CGCNB systems have resulted in inefficient outcomes with the less accuracy of 76.82%, 81.17%, and 71.25% correspondingly.

Table 4 Average analysis of CICC-BDVMR technique under distinct sizes of training/testing data in skin data
Fig. 9
figure 9

Average analysis of CICC-BDVMR technique under skin data

Simultaneously, with training/testing data of 80:20, the CICC-BDVMR system has accomplished good performance with the maximal average accuracy, TPR, and TNR of 81.52%, 87.20%, and 76.45% while the CNB, GWOCNB, and CGCNB systems have resulted in inefficient outcomes with the less accuracy of 76.68%, 81.20%, and 70.80% correspondingly. Additionally, with training/testing data of 90:10, the CICC-BDVMR procedure has attained effective performance with the maximal average accuracy, TPR, and TNR of 81.21%, 87.41%, and 77.43% while the CNB, GWOCNB, and CGCNB methods have resulted in inefficient outcomes with the less accuracy of 77.48%, 81.78%, and 72.18% correspondingly.

The overall accuracy analysis of CICC-BDVMR method on skin data is depicted in Fig. 10. The result demonstrates that the CICC-BDVMR approach has attained enhanced validation accuracy than training accuracy. Also, it is noticeable that the accuracy value gets saturated with the epoch count of 1000. The above mentioned tables and figures ensured that the proposed model has accomplished effectual outcome over the other techniques.

Fig. 10
figure 10

Accuracy analysis of CICC-BDVMR technique under skin data

5 Conclusion

In this study, a novel CICC-BDVMR technique has been developed for accomplishing effectual BDV by the use of clustering and data classification process on the Map Reduce environment. The proposed CICC-BDVMR technique encompasses several subprocesses namely KFCM based clustering, GOA based initial cluster center selection, RR based classification, and RCM based parameter tuning. The utilization of the GOA and RCM algorithms helps to effectually improve the overall big data classification outcomes. In order to demonstrate the enhanced performance of the CICC-BDVMR system, a comprehensive comparative result analysis is made with the benchmark datasets. The extensive comparison study demonstrates the promising performance of the CICC-BDVMR approach on the recent state of art approaches. Therefore, the CICC-BDVMR technique has been found to be a proficient tool to visualize and classify big data. In future, feature selection and feature reduction methodologies can be integrated into the proposed model to improve the classification outcomes. Our final study direction is text clustering metaheuristic optimization. Text clustering performance can be improved by combining these strategies. Text clustering difficulties can be solved via hybrid and updated methods. New meta-heuristic optimization methods for clustering problems have recently been proposed. Text clustering difficulties can be solved via hybrid and updated methods. New meta-heuristic optimization methods for clustering problems have recently been proposed. Others include Salp Swarm Optimization, Harris Hawks Optimization, and Henry Gas Solubility Optimization.