A Distributed Attribute Reduction Algorithm for High-Dimensional Data under the Spark Framework

Attribute reduction is an important issue in rough set theory. However, the rough set theory-based attribute reduction algorithms need to be improved to deal with high-dimensional data. A distributed version of the attribute reduction algorithm is necessary to enable it to effectively handle big data. The partition of attribute space is an important research direction. In this paper, a distributed attribution reduction algorithm based on cosine similarity (DARCS) for high-dimensional data pre-processing under the Spark framework is proposed. First, to avoid the repeated calculation of similar attributes, the algorithm gathers similar attributes based on similarity measure to form multiple clusters. And then one attribute is selected randomly as a representative from each cluster to form a candidate attribute subset to participate in the subsequent reduction operation. At the same time, to improve computing efficiency, an improved method is introduced to calculate the attribute dependency in the divided sub-attribute space. Experiments on eight datasets show that, on the premise of avoiding critical information loss, the reduction ability and computing efficiency of DARCS have been improved by 0.32 to 39.61% and 31.32 to 93.79% respectively compared to the distributed version of attribute reduction algorithm based on a random partitioning of the attributes space.


Introduction
In recent years, along with the rapid development of information technology, a huge volume of digital data is continuously being generated in different application areas. Meanwhile, the size and dimensionality of data are growing exponentially [1][2][3]. Mining knowledge from such largescale and high-dimensional data is a tremendously challenging task. The high dimensionality of data will increase the temporal and spatial complexity and even deteriorate the performance of mining algorithms [4,5]. In fact, many attributes in the dataset are redundant and unnecessary for decision-making. Eliminating these attributes can make people have a clearer understanding of the data and allow the mining algorithms to perform more effectively. Therefore, 1 3 22 Page 2 of 14 dimensionality reduction (DR) has become an important and common step for data cleaning [6,7]. Rough set theory [8], proposed by Pawlak, is a powerful mathematical tool for dealing with inconsistent information in decision situations. It has been successfully applied in many fields, such as pattern recognition [9], knowledge discovery [10], and transportation problem [11]. As an important contribution of rough set theory, attribute reduction can remove redundant and unnecessary attributes on the premise of retaining the discernible ability of original data. Therefore, many traditional rough set-based attribute reduction methods are widely researched. However, most existing attribute reduction algorithms are completely serial and cannot process big data effectively as high computational cost.
To improve these algorithms, many efficient methods were proposed in the past several years. Qian et al. [12] introduced a theoretic framework based on rough set theory, called positive approximation, which could be used to accelerate a heuristic process of attribute reduction. Zhang et al. [13] analyzed the relationships between the original approximation sets and the updated ones, and then proposed incremental methods for fast computing rough approximations. Raza and Qamar [14] proposed a new heuristic-based dependency calculation method, which avoided the timeconsuming calculation of the positive region and helped increase the performance of subsequent algorithms. Gao et al. [15] proposed a reduction algorithm (ARNI) for continuous value information systems, which could significantly speed up by graphics processing units. Chen et al. [16] proposed an acceleration strategy based on attribute group, which reduced the number of evaluations of candidate attributes. In conclusion, many efficient approaches have been developed. However, the approaches mentioned above cannot perform effectively if the data set is huge volume or high-dimensional.
Recently, many scholars have attempted to apply some parallelized and distributed computing technologies to achieve attribute reduction. Chen et al. [17] investigated the parallelization of dominance-based neighborhood rough sets and proposed effective and efficient attribute reduction algorithms based on MapReduce. Qian et al. [18] designed a novel structure of pairs to speed up the computation of equivalence classes and attribute significance and parallelized the traditional attribute reduction process based on the MapReduce mechanism. El-Alfy and Alshammari [19] introduced a scalable implementation of a parallel genetic algorithm in Hadoop MapReduce to approximate the minimum reduct which has the same discernibility power as the original attribute set in the decision table. Hu et al. [20] designed an efficient attribute reduction algorithm for large-scale multimodality fuzzy classification based on the proposed model of multi-kernel fuzzy rough sets in the framework of MapReduce. All these MapReduce-based algorithms were implemented on the Apache Hadoop framework [21]. However, implementing iterative parallel attribute reduction algorithms on the Hadoop platform need to read data from the disk and store a lot of intermediate results to Hadoop Distributed File System (HDFS) frequently, which causes excessive disk I/O. Therefore, they are still time-consuming to deal with big data.
To overcome the problems of the Hadoop MapReduce framework, Apache Spark, an improved distributed computing framework based on MapReduce, has been applied to data mining and machine learning real-world problems [22]. Apache Spark is 100 times faster than Hadoop at largerscale iterative calculation due to being iterative in memory and avoiding shuffle using localized calculation [23][24][25]. Hence, many distributed algorithms were proposed under the Apache Spark framework. Zhang et al. [26] proposed a parallel large-scale attribute reduction algorithm with four representative heuristic feature selection algorithms on Spark. Chen et al. [27] proposed an attribute reduction algorithm for energy data using Spark, where a heuristic formula measures the significance of the attribute to reduce search space. Ramírez-Gallego et al. [28] proposed a distributed implementation of a generic feature selection framework that includes a broad group of well-known information theorybased methods. These algorithms only focus on the partition of the object space, which can be easily implemented on the raw-oriented database. A column-oriented database has been widely used recently. When the attributes in each data partition increase, the reduction calculation time increases rapidly. Dagdia et al. [29] proposed a scalable and effective rough set theory-based approach, named Sp-RST, for highdimensional data pre-processing by splitting the attribute space of a big dataset into several partitions with a smaller number of attributes. The algorithm Sp-RST achieved a good speed-up and performed its attribute reduction task well without sacrificing performance. However, Sp-RST randomly splits the attribute space which may lead to similar attributes being divided into different data partitions, resulting in some redundant attributes in the final reduction set. This could have an uncertain impact on the reduction efficiency and classification accuracy.
The proposed method measures the similarity between attributes and gathers similar attributes to form multiple clusters before dividing attributes. The attributes in each cluster are considered to be similar and equally important. If these attributes in the same cluster are important to the dataset, then they are still important after they are divided into different data blocks. If these attributes in the same cluster are not necessary for the dataset, then they may become important when they are divided into different blocks. In both cases, these similar attributes that should have been deleted still exist in the reduction result, which will result in a low reduction rate. To form a candidate attribute subset, one attribute that can represent the cluster is selected randomly from each cluster. The data corresponding to the candidate subset are calculated by Sp-RST to obtain the final reduction, which avoids the repeated calculation of similar attributes and reduces the uncertainty in the division of attribute space in a random way. In the Sp-RST algorithm, the reduction calculation method of each partitioned data block needs to evaluate its attribute dependency containing all possible attribute subsets. The dependency measurement method based on the positive region has some unnecessary calculations and can be further optimized. In this paper, we introduce a method to calculate attribute dependency directly from equivalence classes to simplify the calculation and improve algorithm efficiency under the Spark framework. In summary, this paper improves the Sp-RST algorithm through the attribute partition method as well as the calculation efficiency in the divided sub-attribute space.
The rest of this paper is structured as follows: In Sect. 2, some basic concepts about rough set theory are introduced. Sect. 3 introduces Sp-RST in detail. In Sect. 4, an improved attribute reduction algorithm is proposed. Sect. 5 describes a series of comparative experiments are designed to show the feasibility and efficiency of the proposed algorithm. Sect. 6 concludes the paper.

Preliminaries
In this section, the basic concepts of rough set theory are reviewed.

Rough Set Theory
In rough set theory, a decision system (decision table) can be described as a quadruple DS =< U, C ∪ d, V, f > , where U = {x 1 , x 2 , ..., x M } is a nonempty finite set of M objects, called the universe; C = {a 1 , a 2 , ..., a N } is a nonempty finite set of conditional attributes; and d is the set of decision attributes. Besides, V = ∪ a∈C∪d V a , where V a is the set of values of attribute a, f ∶ U × C ∪ d → V , an information or a description function, assigns a value to each attribute of each object; and f(a, x) represents the value of object x on attribute a. Definition 1 [8,30]. In the given decision system DS =< U, C ∪ d, V, f > , with a nonempty attributes subset B ⊆ C , an indiscernibility relationship IND(B) on the universe U can be defined as follows: The equivalence relation is an indiscernible relation. The equivalence class of an object x with respect to B is denoted by  [10,31]. In the given decision system DS =< U, C ∪ d, V, f > , for Y ⊆ U, B ⊆ C , based on indiscernibility relationship, the lower approximation B(Y) and the upper approximation B(Y) of Y respect to B are defined as follows, respectively. Definition 3 [10,31]. In the given decision system And the dependency degree of the decision attribute d depends on B in given as follows: Based on the concept of positive region, attribute reduction based on rough set can be defined as given below.
Definition 4 [10,31]. In the given decision system DS =< U, C ∪ d, V, f > , and a conditional attributes set R ⊆ C , for any R ′ ⊂ R , R is a reduct of C if

The Sp-RST Algorithm
In this section, a rough set theory-based approach for highdimensional data pre-processing under the Spark framework, named Sp-RST [29], is discussed.
To effectively process high-dimensional data, Sp-RST divides the data into multiple data blocks based on splits from the conditional attribute set randomly. For each data block containing only partial attributes, Sp-RST uses the positive region-based dependency measure to perform attribute reduction. The result of one iterative reduction calculation of the whole data can be obtained by merging the reduction results of all data blocks. To further guarantee the attribute reduction performance, Sp-RST adopts a method of finding the intersection of multiple iterative results to achieve efficient reduction.
Page 4 of 14 The general description of the procedure for attribute reduction in Sp-RST is given below. • Step 1: Sp-RST starts the n − th, n = {1, 2, ..., N} round of reduction calculation and divides the whole decision Step 3: Sp-RST merges the reduction results generated by all the sub-decision systems to obtain the reduct of n − th Step 4: if n < N , then go back to step 1; else, jump to Step 5. • Step 5: Sp-RST performs an intersection operator on all the obtained R n to obtain the final reduct R, where Although the Sp-RST algorithm provides an effective attribute reduction method for high-dimensional data under the Spark framework, the attributes are randomly selected in each partitioned data block. It is possible to divide the similar attributes into different data blocks, which may result in an uncertain reduction effect. where By Eqs. (23) and (4), the positive regions of all possible attribute combinations of C 2 are as follows: By Eq. (6), the dependencies of all the possible attribute combinations of C 2 are as follows: The above example shows that there are redundant attributes in the reduction when similar attributes are divided into different data blocks because attributes that should be deleted may be retained. Meanwhile, the reduction method in each sub-decision system needs to generate all the possible combinations of attributes and calculate their dependency, and the attribute combination that has the largest dependency value and the least number of attributes is considered to be the reduct of the sub-decision system. Calculating the dependency in Sp-RST is a tedious task and can be further optimized by omitting the steps of calculating the equivalence class of the decision attribute and finding the positive region through the approximate sets.

Distributed Attribute Reduction Algorithm Based on Cosine Similarity
This section describes DARCS for high-dimensional data pro-processing under the Spark framework.

Similarity
To reduce the uncertainty in the division of attribute space randomly, we intend to measure the similarity between attributes and gather similar attributes to form multiple clusters before dividing attributes. If the distribution of values of all data objects under some attributes is almost consistent, it could be considered that these attributes are similar. The concept of distance, unlike entropy, mean value, and variance, can reflect the overall distribution of data, which is related to the concept of similarity. The Euclidean distance [32] is the most frequently used metric to measure the similarity by the distance between two high-dimensional vectors. Unfortunately, the Euclidean distance suffers from high sensitivity to magnitudes. As an alternative, cosine similarity is another commonly used metric to obtain the similarity between two non-zero vectors using the cosine value of the two vectors, which is not sensitive to magnitudes [33]. Cosine similarity measure, defined as the inner product of two vectors divided by the product of their lengths, has been commonly used in anomaly detection [34], pattern recognition [35], disease diagnosis [36] and feature selection [37].
Definition 5 [33]. For two discrete vectors x = {x 1 , x 2 , ...x n } and y = {y 1 , y 2 , ...y n } , the cosine angle similarity coefficient C(x, y) between two vectors x and y can be defined as follows: The C(x, y) range is [-1, +1]. The closer the value is to (+1) indicates a higher similarity of the two vectors.
As the attributes in each cluster are similar and considered to be equally important, only one attribute is selected randomly from each cluster, which can represent the whole cluster. All the selected attributes form a candidate attribute subset to participate in the subsequent reduction calculation, which avoids the repeated calculation of similar attributes. The pseudo-code of the attribute pre-processing is given in Algorithm 1. Table 1, a 2 and a 4 will be gathered into the same cluster since they are similar attributes. Supposing that a 2 is selected to represent the entire cluster, then m = 2, and the candidate attribute subset S = {a 1 , a 2 , a 3 } , C 1 = {a 1 }, C 2 = {a 2 , a 3 } . As C 1 is composed of only a single attribute, R 1 = {a 1 } and R 2 = {a 3 } , which has been calculated in Example 1. R = R 1 ∪ R 2 = {a 1 , a 3 } is the final reduct.

Example 2 For the decision table
Compared with the Sp-RST algorithm, as shown in Example 1, by clustering similar attributes and establishing the candidate attribute subset, the proposed algorithm can effectively avoid dividing similar attributes into different partitioned data blocks to reduce the uncertainty of partitioning attributes randomly. Moreover, the number of attributes in the candidate attribute subset is less than that in the original data, which can reduce computational cost and improve operational efficiency.

Improved Dependency Calculation Method
To calculate dependency in the sub-decision system, Sp-RST uses the positive region. It first needs to calculate the equivalence class for decision and conditional attributes. Then, the positive region is calculated that actually finds the union of all equivalence classes of conditional attributes that are a subset of the equivalence classes of decision attributes. The way to calculate dependency is complex and could be improved.
.., d n } , the types of decision values corresponding to elements in X can be defined as follows: where f d (x) represents the decision attribute value of x. where ∥ F d (X) ∥= 1 denotes that X has a consistent decision attribute value.

Theorem 1 In the given sub-decision system
Otherwise, ∃y ∈ X and y ∈ D i , that is, f d (y) = d i . However, according to Definition 6, d i ∈ F d (X) is contradictory with F d (X) = {d 1 } . This means that the elements in X are not distributed in D i , where i = {2, 3, ..., n} and only exists in

where X satisfies ∥ F d (X) ∥= 1 and |f d (X)| denotes the number of decision values corresponding to elements in X.
Proof Based on Theorem 1, The positive region only contains X, which satisfy ∥ F d (X) ∥= 1 . The number of elements in X is equal to that in the corresponding decision value array. Hence, |POS B (d)| can be calculated by accumulating all |f d (X)| which satisfy ∥ F d (X) ∥= 1 . ◻ By Theorem 1, it can be judged whether the equivalence class of conditional attribute belongs to a positive region by checking whether the corresponding decision values of the elements in the equivalence class of the conditional attribute are consistent. Meanwhile, by Theorem 2, the cardinality of the positive region can be obtained through the number of the decision attribute values corresponding to the equivalence class of the conditional attribute. Therefore, an efficient method to calculate the dependency is introduced, and the pseudo-code is given in Algorithm 2. Combined with the algorithms 1,2, and the Sp-RST algorithm, a distributed attribute reduction algorithm based on cosine similarity under the Spark framework (DARCS) can be constructed, which is described in Algorithm 3. The DARCS algorithm is an improvement of the Sp-RST algorithm. Compared with the Sp-RST algorithm, algorithm 2 is added in the DARCS algorithm to reduce the uncertainty of partitioning attributes randomly, and algorithm 2 is proposed to improve the way to calculate dependency in sub-decision systems. = 2∕3 , a n d , s i m i l a r i t y, a 2 = 0, {a 2 ,a 3 } = 2∕3 , which are the same as those obtained in Example 1.
Compared with the method of calculating the dependency in the Sp-RST algorithm as shown in Example 1, on the premise of ensuring the consistency of calculation results, the proposed method does not need to calculate the equivalence classes of decision attributes and find the positive region through approximate sets, which will simplify the calculation process and improves the computational efficiency. Therefore, it can be concluded that compared with the distributed version of the attribute reduction algorithm based on a random partitioning of the attributes space, the main benefits of the proposed algorithm can be observed as follows: 1. It measures the similarity between attributes before dividing attributes, selects candidate attribute subset to avoid the repeated calculation of similar attributes, and improves the reduction ability and calculational efficiency 2. It introduces an improved method to calculate the attribute dependency in the divided sub-attribute space to improve computational efficiency.

Experiments
The proposed algorithm, which is an improvement of Sp-RST [29], was evaluated experimentally to compare the two approaches.

Experimental Environment
To verify the effectiveness of our proposed approach, experiments were conducted on 8 data sets. The datasets are shown in Table 2. The BR, GA, GT, and MD datasets also used in literature [38] are from OpenML [39], and the remaining data sets are from the UCI machine learning repository [40]. The proposed algorithm is compared with Sp-RST in our experiments. Due to the limitation of the performance of the experimental equipment, three different radii, that is, four, five, eight attributes per partition, were used, as recommended in Sp-RST. These represent the number of attributes contained in each partitioned data block. Moreover, 5-fold cross-validation was used to calculate the classification accuracy in our experiments. To reduce the impact of randomness, each algorithm was executed ten times, and the averages of the classification accuracy, reduction rate, and execution time were taken as the experimental results. The Sp-RST algorithm is based on the open source code on GitHub (https:// github. com/ zeine bchel ly/ Sp-RST) and our algorithm is also uploaded to GitHub (https:// github. com/ popok k3/ DARCS).
The programs for calculating execution time and classification accuracy of all classifiers involved in this paper were programmed in Python and performed on a personal computer running Windows 10 with an Inter(R) Core(TM) i7-10875H CPU operating at 2.30 GHz with 16 GB memory space. The algorithm DARCS was coded in Scala and performed on a cluster with 10 computers. The cluster totally has 40 physical cores, each of which is an AMD Athlon II X4 645, the main frequency is 3.1 GHz, and the cluster totally has 100 GB RAM. Each computer is equipped with Hadoop 2.9.2, Spark 2.4.7, Java 1.8.191, and Scala 2.11.12.

Experiments About the Effect of Cosine Values
The cosine values influence the size of the candidate attribute subset, which affect the reduction rate and the classification accuracy. The purpose of attribute reduction is to delete redundant attributes on the premise of ensuring no significant loss of data classification ability. A higher reduction rate means a stronger reduction ability. To obtain a suitable cosine similarity parameter and a good attribute reduction subset, the cardinality and classification accuracy of a reduct subset for different cosine values should be discussed in detail. Experiments were performed to graphically illustrate the classification accuracy and reduction rate of DARCS under different cosine values. The results are shown in Figs. 1 and 2 respectively, where cosine value cv ∈ [0.90, 1.00] at intervals of 0.02 and the classification accuracy is verified by Random Forest classifier (RF). It should be noted that when cv is set to 1.00, the candidate attribute subset obtained by algorithm 1 contains all conditional attributes of the original data, all attributes will participate in reduction calculation, and DARCS degenerates to Sp-RST. From Fig. 1, it can be seen that cv greatly influences the classification accuracy of DARCS. From Fig. 2, we can observe that with the increase of cv, the reduction rate tends to decrease. The parameter cv is usually set to make both the classification accuracy and the reduction rate high. By analyzing Figs. 1 and 2, the appropriate cosine values of different datasets {BR, GA, GE,

Experimental Comparison with and Without DARCS
The classification accuracies of the original datasets, the reduced datasets using Sp-RST and DARCS on eight datasets obtained from the three classifiers RF, K-Nearest Neighbor, K = 5 (KNN), and Support Vector Machine (SVM) with 5-fold cross-validation are shown in Tables 3, 4, and 5, respectively. In Tables 3, 4, and 5, the bold numbers indicate higher classification accuracy based on RAW and the reduction of DARCS and the underlined numbers are the better ones based on Sp-RST and DARCS. For example, in Table 3, the classification accuracy of RF (Random Forest) for BR, the reduced data generated by DARCS is less quality than RAW, but it is better than Sp-RST. Figure 2 shows that the DARCS algorithm is efficient in deleting redundant attributes. Tables 3, 4 and 5 show that, in the cases of four, five and eight attributes in per partition, the classification accuracies of reduced datasets by DARCS are improved compared with that of original datasets in 50 of 72 experiments. The classification accuracies of the reduced BR dataset change greatly under different attribute partition conditions. Experiments on the reduced BR datasets also show that the classification accuracy could be adjusted by changing the number of attributes contained in each partitioned data block. In the experimental results on all the datasets but BR, when the classification accuracies of the reduced datasets are reduced compared with that of original datasets, the decrease of the classification accuracies of all the remaining experiments just ranges from 0.06 to 0.74%. It can be seen from Fig. 2 that with the increase of the number of attributes in each divided data block, the reduced ability of the algorithm tends to decrease. Here, we only choose the case that each partitioned data block contains the most attributes to show the computational efficiency of the classification algorithms. Table 6 show the execution times of the three classifiers (RF, KNN, SVM) on the eight reduced datasets by algorithm DARCS are reduced by 18.51 to 80.09% compared to that on the original datasets. In Table 6, bold numbers indicate the less execution time. Reduced data can effectively reduce the time and space complexity of classification algorithms. Hence, it can be concluded that the proposed algorithm could effectively reduce the dimensionality of datasets without a huge loss of classification accuracy and improve the performance of mining algorithms.

A Comparison of DARCS and Sp-RST
This section compares DARCS with Sp-RST based on classification accuracy, reduction rate, and execution time. The number of attributes in each data partition and the number of iterations were consistent in the two algorithms.
In terms of classification accuracy, Tables 3, 4 and 5 show that the classification accuracies of three classifiers (RF, KNN and SVM) based on DARCS are slightly superior to the one based on Sp-RST reduction in 40 of 72 experiments. As shown in Table 7, all the experimental results show that the reduced ability of DARCS has been improved by 0.32 to 39.61% compared with that of Sp-RST. In Table 7, the higher reduction rates are bold. However, it should be noted that the range of reduction rate improvement in 10 of 12 experiments on BR, GA, GT, and MM is just from 0.0032 to 0.0195, which is negligible. This is mainly because the number of attributes involved in the reduction operation would eventually affect the reduction rate, which is clearly shown in Fig. 2. The number of attributes involved in the two algorithms on these datasets is almost the same, so the reduction effect is also similar. For execution time, Table 8 shows that the running time of Sp-RST on all experiments is 1.46 to 17.04 times that of DARCS. In Table 8, the bold numbers indicate the less execution time between different reduction algorithms. In summary, DARCS is more efficient in terms of classification accuracy, reduction rate, and execution time compared with Sp-RST.

Conclusion
Attribute reduction for high-dimensional data is a key part of data analysis, machine learning, and data mining. In this paper, a distributed attribute reduction algorithm based on cosine similarity under the Spark framework is proposed. Using cosine similarity measure, similar attributes are gathered, and then a candidate attribute subset is formed to participate in the subsequent reduction operation, which avoids the repeated calculation of similar attributes and reduces uncertainty in the division of attribute space randomly. At the same time, an improved method is introduced to calculate the attribute dependency in the divided sub-attribute space to improve computational efficiency. Experiments on eight datasets show that on the premise of no significant loss of data classification ability, the reduction ability and computing efficiency of DARCS have been improved by 0.32 to 39.61% and 31.32 to 93.79% respectively compared to the distributed version of attribute reduction algorithm based on a random partitioning of the attributes space. The main benefits of the proposed methodology can be observed as follows: 1. It measures the similarity between attributes before dividing attributes, selects candidate attribute subset to avoid the repeated calculation of similar attributes and improves reduction ability, and calculational efficiency. 2. It introduces an improved method to calculate the attribute dependency in the divided sub-attribute space to improve computational efficiency. 3. It can exhibit a new insight for high-dimensional data pre-processing. 4. It can be successfully applied to attribute reduction in the fields of voice rehabilitation, disease classification, and gene selection. 5. The data it processed can improve the performance of mining algorithms.
However, the algorithm needs to calculate the dependency of all attribute combinations when calculating the reduction of each data block, which still has high time-consuming. In future work, to further improve the classification performance and computational efficiency of the proposed algorithms for high-dimensional datasets, new similarity measure methods, and heuristic computing strategies will be explored.