In this section, we will introduce our developed parallel rough set-based algorithm, that we name ‘Sp-RST,’ for big data pre-processing and specifically for feature selection. Sp-RST has a distributed architecture based on Apache Spark for a distributed and in-memory computation task. First, we will highlight the main motivation for developing the distributed Sp-RST algorithm by identifying the computational inefficiencies of the classical rough set theory which limit its application to small data sets only. Secondly, we will elucidate our Sp-RST solution as an efficient approach capable of performing big data feature selection without sacrificing performance.
Motivation and problem statement
Rough set theory for feature selection is an exhaustive search as the theory needs to compute every possible combination of attributes. The number of possible attribute subsets with m attributes from a set of N total attributes is \(\left( {\begin{array}{c}N\\ m\end{array}}\right) = \frac{N!}{m!(N - m)!}\) [17]. Thus, the total number of feature subsets to generate is \(\sum _{i=1}^N{\left( {\begin{array}{c}N\\ i\end{array}}\right) } = 2^N-1\). For example, for \(N=30\) we have roughly 1 billion combinations. This constraint prevents us to use high-dimensional data sets as the number of feature subsets is growing exponentially in the total number of features N. Moreover, hardware constraints, specifically memory consumption, do not allow us to store a high number of entries. This is because the system has to store the entire training data set in memory, together with all the supplementary data computations as well as the generated results. All of this data can be so big that its size can easily exceed the available RAM memory. These are the main motivations for our proposed Sp-RST solution, which makes use of parallelization.
The proposed solution
To overcome the standard RST inadequacy to perform feature selection in the context of big data, we propose our distributed Sp-RST solution. Technically, to handle a large set of data it is crucial to store all the given data set in a parallel framework and perform computations in a distributed way. Based on these requirements, we first partition the overall rough set feature selection process into a set of smaller and basic tasks that each can be processed independently. After that, we combine the generated intermediate outputs to finally build the sought result, i.e., the reduct set.
General model formalization
For feature selection, our learning problem aims to select a set of highly discriminating attributes from the initial large-scale input data set. The input base refers to the data stored in the distributed file system (DFS). To perform distributed tasks on the given DFS, a resilient distributed data set (RDD) is built. The latter can be formalized as a given information table that we name \(T_{ RDD }\). \(T_{ RDD }\) is defined via a universe \(U = \{x_1, x_2, \ldots , x_N\}\), which refers to the set of data instances (items), a large conditional feature set \(C = \{c_1, c_2, \ldots , c_V\}\) that includes all the features of the \(T_{ RDD }\) information table and finally via a decision feature D of the given learning problem. D refers to the label (also called class) of each \(T_{ RDD }\) data item and is defined as follows: \(D =\{d_1, d_2, \ldots , d_W\}\). C presents the conditional attribute pool from where the most significant attributes will be selected.
As explained in Sect. 5.1, the classical RST cannot deal with a very large number of features, which is defined as C in the \(T_{ RDD }\) information table. Thus, to ensure the scalability of our proposed algorithm when dealing with a large number of attributes, Sp-RST first partitions the input \(T_{ RDD }\) information table (the big data set) into a set of m data blocks based on splits from the conditional feature set C, i.e., m smaller data sets with a fewer number of features instead of using a single data block (\(T_{ RDD }\)) with an unmanageable C number of features that we note as \(T_{ RDD }(C)\). The key idea is to generate m smaller data sets that we name \(T_{ RDD _{(i)}}\), where \(i \in \{1, \ldots , m\}\), from the big \(T_{ RDD }\) data set, where each \(T_{ RDD _{(i)}}\) is defined via a manageable number of features r, where \(r \lll C = \{c_1, c_2, \ldots , c_V\}\) and \(r \in \{1, \ldots , V\}\). The definition of the parameter r will be further explained in what follows. We note the resulting data block as \(T_{ RDD _{(i)}}(C_r)\). This leads to the following formalization: \(T_{ RDD } = \bigcup _{i=1}^{m}T_{ RDD _{(i)}}(C_r)\), where \(r \in \{1, \ldots , V\}\). As mentioned above, r defines the number of attributes that will be considered to build every \(T_{ RDD _{(i)}}\) data block. Based on this, every \(T_{ RDD _{(i)}}\) is built using r random attributes which are selected from C. Each \(T_{ RDD _{(i)}}\) is constructed based on r distinct features as there are no common attributes between all the built \(T_{ RDD _{(i)}}\). This leads to the following formalization: \(\forall T_{ RDD _{(i)}}{:}\,\not \exists \{c_r\} = \bigcap _{i=1}^{m} T_{ RDD _{(i)}}\). Figure 2 presents this data partitioning phase.
With respect to the parallel implementation design, the distributed Sp-RST algorithm will be applied to every \(T_{ RDD _{(i)}}(C_r)\) while gathering all the intermediate results from the distinct m created partitions; rather than being applied to the complete \(T_{ RDD }\) that encloses the whole set C of conditional features. Based on this design, we can ensure that the algorithm can perform its feature selection task on a computable number of attributes and therefore overcome the standard rough set computational inefficiencies. The pseudocode of our proposed distributed Sp-RST solution is highlighted in Algorithm 1.
To further guarantee the Sp-RST feature selection performance while avoiding any critical information loss, to evolve the algorithm and to refine it, Sp-RST runs over N iterations on the \(T_{ RDD }\)m data blocks, i.e., N iterations on all the m built \(T_{ RDD _{(i)}}(C_r)\). Through all these N iterations, Sp-RST will first randomly build the m distinct \(T_{ RDD _{(i)}}(C_r)\) as explained above. Once this is achieved and for each partition, the algorithm’s distributed tasks defined in Algorithm 1 (lines 5–10) will be performed. As noticed, line 1 in Algorithm 1 that defines the initial Sp-RST parallel job is performed outside the loop iteration. This process calculates the indiscernibility relation \( IND (D)\) of the decision class D. The main reason for this implementation is that this process is totally separated from the m created partitions. This is because the output is tied to the label of the data instances and not on the attribute set.
Out from the iteration loop (line 12), the outcome of each created partition can be either only one reduct \( RED _{i_{(D)}}(C_r)\) or a set (a family) of reducts \( RED _{i_{(D)}}^{F}(C_r)\). As previously highlighted in Sect. 3, any reduct among the \( RED _{i_{(D)}}^{F}(C_r)\) reducts can be selected to describe the \(T_{ RDD _{(i)}}(C_r)\) information table. Therefore, in case where Sp-RST generates a single reduct for a specific \(T_{ RDD _{(i)}}(C_r)\) partition, the final output of this attribute selection phase is the set of features defined in \( RED _{i_{(D)}}(C_r)\). These attributes represent the most informative features among the \(C_r\) features and generate a new reduced \(T_{ RDD _{(i)}}\) defined as: \(T_{ RDD _{(i)}}(RED)\). The latter reduced base guarantees nearly the same data quality as its corresponding \(T_{ RDD _{(i)}} (C_r)\) which is based on the full attribute set \(C_r\). In the other case where Sp-RST generates multiple reducts, the algorithm performs a random selection of a single reduct among the generated family of reducts \( RED _{i_{(D)}}^{F}(C_r)\) to describe the corresponding \(T_{ RDD _{(i)}}(C_r)\). This random selection is supported by the RST fundamentals and is explained by the same level of importance of all the reducts defined in \( RED _{i_{(D)}}^{F}(C_r)\). More precisely, any reduct included in the family of reducts \( RED _{i_{(D)}}^{F}(C_r)\) can be selected to replace the \(T_{ RDD _{(i)}}\,(C_r)\) attributes.
At this level, the output of every i data block is \( RED _{i_{(D)}}(C_r)\) which refers to the selected set of features. Nevertheless, since every \(T_{ RDD _{(i)}}\) is described using r distinct attributes and with respect to \(T_{ RDD } = \bigcup _{i=1}^{m} T_{ RDD _{(i)}}(C_r)\), a union operator on the generated selected attributes is needed to represent the original \(T_{ RDD }\). This is defined as \( Reduct _m = \bigcup _{i=1}^{m} RED _{i_{(D)}}(C_r)\) (Algorithm 1, lines 12–14). As previously highlighted, Sp-RST will perform its distributed tasks over the N iterations generating N\( Reduct _m\). Therefore, finally, an intersection operator applied on all the obtained \( Reduct _m\) is required. This is defined as \(Reduct = \bigcap _{n=1}^{N} Reduct _m\). Sp-RST could diminish the dimensionality of the original data set from \(T_{ RDD }(C)\) to \(T_{ RDD }(Reduct)\) by removing irrelevant and redundant features at each computation level. Sp-RST could also simplify the learned model, speed up the overall learning process, and increase the performance of an algorithm, e.g., a classification algorithm, as will be discussed in the experimental setup section (Sect. 6). Figure 3 illustrates the global functioning of Sp-RST. In what follows, we will elucidate the different Sp-RST elementary distributed tasks.
Algorithmic details
As previously highlighted, the elementary Sp-RST distributed tasks will be executed on every \(T_{ RDD _{(i)}}\) partition defined by its \(C_r\) features (\(T_{ RDD _{(i)}}(C_r)\)), except for the first step, Algorithm 1—line 1, which deals with the calculation of the indiscernibility relation for the decision class D: \( IND (D)\). Sp-RST performs seven main distributed jobs to generate the final output, i.e., Reduct.
Sp-RST stars first of all by computing the indiscernibility relation for the decision class \(D = \{d_1, d_2, \ldots , d_W\}\). We define the indiscernibility relation as \( IND (D)\): \( IND (d_i)\), where \(i \in \{1, 2, \ldots , W\}\). Sp-RST will calculate \( IND (D)\) for each decision class \(d_i\) by associating the same \(T_{ RDD }\) data items (instances) that are expressed in the universe \(U = \{x_1, \ldots , x_N\}\) and that belong to the same decision class \(d_i\).
To achieve this task, Sp-RST processes a first map transformation operation taking the data in its format of (\(id_i\)of\(x_i\), List of the features of\(x_i\), Class\(d_i\)of\(x_i\)) and transforming it to a \(\langle key, value \rangle \) pair: \(\langle \)Class \(d_i\) of \(x_i\), List of \(id_i\) of \(x_i\rangle \). Based on this transformation, the decision class \(d_i\) defines the key of the generated output and the data items identifiers \(id_i\) of \(x_i\) of the \(T_{ RDD }\) define the values. After that, the foldByKey()Footnote 11 transformation operation is applied to merge all values of each key in the transformed RDD output. This is to represent the sought \( IND (D)\): \( IND (d_i)\). The pseudo-code related to this distributed job is highlighted in Algorithm 2.
After that and within a specific partition i, where \(i \in \{1, 2, \ldots , m\}\) and m is the number of partitions, the algorithm generates the \( AllComb _{(C_r)}\) RDD which reflects all the possible combinations of the \(C_r\) set of attributes. This is based on transforming the \(C_r\) RDD to the \( AllComb _{(C_r)}\) RDD using the flatmap()Footnote 12 transformation operation and by using the combinations() operation. This is shown in Algorithm 3.
In its third distributed job, Sp-RST calculates the indiscernibility relation \( IND ( AllComb _{(C_r)})\) for every created combination, i.e., the indiscernibility relation of every element in the output of Algorithm 3, and that we name \( AllComb _{{(C_r)}_i}\). In this task and as described in Algorithm 4, the algorithm aims at collecting all the identifiers \(id_i\) of the data items \(x_i\) that have identical values of the combination of attributes which are extracted from \( AllComb _{(C_r)}\). To do so, a first map operation is applied taking the data in its format of (\(id_i\)of\(x_i\), List of the features of\(x_i\), Class\(d_i\)of\(x_i\)) and transforming it to a \(\langle key, value \rangle \) pair: \(\langle ( AllComb _{{(C_r)}_i}\), List of the features of \(x_i)\), List of \(id_i\) of \(x_i\rangle \). Based on this transformation, the combination of features and their vector of features define the key and the identifiers \(id_i\) of the data items \(x_i\) define the value. After that, the foldByKey() operation is applied to merge all values of each key in the transformed RDD output, i.e., all the identifiers \(id_i\) of the data items \(x_i\) that have the same combination of features with their corresponding vector of features \(( AllComb _{{(C_r)}_i}\), List of the features of \(x_i)\). This is to represent the sought \( IND ( AllComb _{(C_r)})\). At its third step, Sp-RST prepares the set of features that will be selected in the coming steps.
In a next stage, Sp-RST computes the dependency degrees \(\gamma ( AllComb _{(C_r)})\) of each attribute combination as described in Algorithm 5. For this task, the distributed job requires three input parameters which are the calculated indiscernibility relations \( IND (D)\), the \( IND ( AllComb _{(C_r)})\) and the set of all attribute combinations \( AllComb _{(C_r)}\).
For every element \( AllComb _{{(C_r)}_i}\) in \( AllComb _{(C_r)}\), and using the intersection() transformation, the job tests first if the intersection of every \( IND (d_i)\) of \( IND (d)\) with each element \( IND ( AllComb _{{(C_r)})_i}\) in \( IND ( AllComb _{(C_r)})\) holds all the elements in the latter parameter. This process refers to the calculation of the lower approximation as detailed in Sect. 3. We name the length of the resulting intersection as LengthIntersect. If the condition is satisfied then a score, which is equal to the length of the elements resulting from the generated intersection, i.e., LengthIntersect, is assigned, else a 0 value is given.
After that a reduce function is applied over the different \( IND (D)\) elements together with a sum() function applied on the calculated scores which are based on the elements having the same \( IND (d_i)\). This operation is followed by a second reduce function which is applied over the different \( IND ( AllComb _{(C_r)})\) elements together with a sum() function applied on the previous calculated results which are indeed based on the elements having the same \( AllComb _{{(C_r)}_i}\).
The latter output refers to the dependency degrees: \(\gamma ( AllComb _{(C_r)})\). This distributed job generates two outputs namely the set of dependency degrees \(\gamma ( AllComb _{(C_r)})\) of the attribute combinations \( AllComb _{(C_r)}\) as well as their associated sizes \( Size _{( AllComb _{(C_r)})}\).
Once all the dependencies are calculated, in Algorithm 6, Sp-RST looks for the maximum value of the dependency among all the computed \(\gamma ( AllComb _{(C_r)})\) using the max() function operated on the given RDD input and which is referred to as RDD[\( AllComb _{(C_r)}\), \( Size _{( AllComb _{(C_r)})}\), \(\gamma ( AllComb _{(C_r)})\)]. Specifically, the max() function will be applied on the third argument of the given RDD, i.e., \(\gamma ( AllComb _{(C_r)})\).
Let us recall that based on the RST preliminaries (seen in Sect. 3), the maximum dependency refers to not only the dependency of the whole attribute set \((C_r)\) describing the \(T_{ RDD _i}(C_r)\) but also to the dependency of all the possible attribute combinations satisfying the following constraint: \(\gamma ( AllComb _{(C_r)})= \gamma (C_r)\). The maximum dependency MaxDependency reflects the baseline value for the feature selection task.
In a next step, Sp-RST performs a filtering process using the filter() function to only keep the set of all combinations which have the same dependency degrees, as the already selected dependency baseline value (MaxDependency), i.e., \(\gamma ( AllComb _{(C_r)}) = MaxDependency\). This is described in Algorithm 7. In fact, through these computations, the algorithm removes in each level the unnecessary attributes that may negatively influence the performance of any learning algorithm.
At a final stage, and using the results generated from the previous step, which is the input of Algorithm 8, Sp-RST applies first the min() operator to look for the minimum number of features among all the \( Size _{( AllComb _{(C_r)})}\); specifically, the min() operator will be applied to the second argument of the given RDD. Once determined, a result that we name minNbF, the algorithm applies a filter() method to only keep the set of combinations having the same minimum number of features as minNbF. This is achieved by satisfying the full reduct constraints highlighted in Sect. 3: \( \gamma ( AllComb _{(C_r)}) = \gamma (C_r)\) while there is no \( AllComb _{(C_r)}^{'} \subset AllComb _{(C_r)}\) such that \(\gamma ( AllComb _{(C_r)}^{'}) = \gamma ( AllComb _{(C_r)})\). Every combination that satisfies this constraint is evaluated as a possible minimum reduct set. The features defining the reduct set describe all concepts in the initial \(T_{ RDD _i}(C_r)\) training data set.
Sp-RST: a working example
We apply Sp-RST to an example of an information table, \(T_{ RDD }(C)\), which is presented in Table 1. By assuming that the considered \(T_{ RDD }(C)\) is a big data set, the information table is defined via a universe \(U = \{x_1, x_2, \ldots , x_5\}\) which refers to the set of data instances (items), a large conditional feature set C = {Headache, Muscle-pain, Temperature} that includes all the features of the \(T_{ RDD }(C)\) information table and finally via a decision feature Flu of the given learning problem. Flu refers to the label (or class) of each \(T_{ RDD }(C)\) data item and is defined as follows: \(Flu =\{yes, no\}\). C presents the conditional attribute pool from where the most significant attributes will be selected.
Independently from the set of conditional features C, Sp-RST starts first of all by computing the indiscernibility relation for the decision class Flu. We define the indiscernibility relation as \( IND (Flu)\): \( IND (Flu_i)\). Sp-RST will calculate \( IND (Flu)\) for each decision class \(Flu_i\) by associating the same \(T_{ RDD }(C)\) data items (instances) that are expressed in the universe U and that belong to the same decision class \(Flu_i\). Based on the Apache Spark framework and by applying Algorithm 2, line 1, we get the following outputs from the different Apache Spark data splits which are presented in Tables 2 and 3:
-
From Split 1:
-
\(\langle \)yes, \(x_0\rangle \)
-
\(\langle \)yes, \(x_1\rangle \)
-
\(\langle \)no, \(x_2\rangle \)
-
From Split 2:
-
\(\langle \)no, \(x_3\rangle \)
-
\(\langle \)yes, \(x_4\rangle \)
-
\(\langle \)yes, \(x_5\rangle \)
Table 2 Toy data set—split 1 Table 3 Toy data set—split 2 After that, and by applying Algorithm 2, line 2, we get the following output which refers to the indiscernibility relation of the class \( IND (Flu)\):
-
\(yes, \{x_0, x_1, x_4, x_5\} \)
-
\(no, \{x_2, x_3\}\)
In this example, we assume that we have two partitions \(m = 2\). For the first partition, \(m = 1\), a random number \(r = 2\) is selected to build the first \(T_{ RDD _{i=1}}(C_r)\). For the second partition, \(m = 2\), a random number \(r = 1\) is selected to build the first \(T_{ RDD _{i=2}}(C_r)\). Based on these assumptions, the following partitions and splits based on Apache Spark are obtained (Tables 4, 5, 6, 7).
Table 4 Partition \(m = 1\)—split 1 Table 5 Partition \(m = 1\)—split 2 Table 6 Partition \(m = 2\)—split 1 Table 7 Partition \(m = 2\)—split 2 Based on the first partition \(m = 1\), and by applying Algorithm 3, which aims to generate all the \( AllComb _{(C_r)}\) possible combinations of the \(C_r\) set of attributes, the output from both Apache Spark splits is the following:
-
Muscle-pain
-
Temperature
-
Muscle-pain, temperature
In its third distributed job, Sp-RST calculates the indiscernibility relation \( IND ( AllComb _{(C_r)})\) for every created combination, i.e., the indiscernibility relation of every element in the output of the previous step (Algorithm 3). By applying Algorithm 4 and based on both Apache Spark splits, the output is the following:
-
From\(m = 1\)—Split 1:
-
Muscle-pain, \(\{x_0\}, \{x_1, x_2\}\)
-
Temperature, \(\{x_0\}, \{x_1, x_2\}\)
-
Muscle-pain, Temperature, \(\{x_0\}, \{x_1, x_2\} \)
-
From\(m = 1\)—Split 2:
-
Muscle-pain, \(\{x_3, x_4, x_5\}\)
-
Temperature, \(\{x_3\}, \{x_4\}, \{x_5\}\)
-
Muscle-pain, Temperature, \(\{x_3\}, \{x_4\}, \{x_5\}\)
In a next stage, and by using the previous output as well as \( IND (Flu)\), Sp-RST computes the dependency degrees \(\gamma ( AllComb _{(C_r)})\) of each attribute combination as described in Algorithm 5. This distributed job generates two outputs namely the set of dependency degrees \(\gamma ( AllComb _{(C_r)})\) of the attribute combinations \( AllComb _{(C_r)}\) as well as their associated sizes \( Size _{( AllComb _{(C_r)})}\). The output from both splits for \(m = 1\) is the following:
Once all the dependencies are calculated, in Algorithm 6, Sp-RST looks for the maximum value of the dependency among all the computed \(\gamma ( AllComb _{(C_r)})\). The maximum dependency reflects the baseline value for the feature selection task. The output is the following:
In a next step, Sp-RST performs a filtering process to only keep the set of all combinations, which have the same dependency degrees, as the already selected dependency baseline value (\(MaxDependency = 4\)), i.e., \(\gamma ( AllComb _{(C_r)}) = MaxDependency = 4\). By applying Algorithm 7, the following output is obtained:
In fact, through these computations, the algorithm removes in each level the unnecessary attributes that may negatively influence the performance of any learning algorithm.
At a final stage, and using the results generated from the previous step and by applying Algorithm 8, Sp-RST looks for the minimum number of features among all the \( Size _{( AllComb _{(C_r)})}\). Once determined (\(minNbF = 1\)), the algorithm only keeps the set of combinations having the same minimum number of features as minNbF. The filtered selected features define the reduct set and describe all concepts in the initial \(T_{ RDD _i}(C_r)\) training data set. The output of Algorithm 8 and which presents the Reduct for \(m = 1\) is the following:
Based on these calculations, for \(m = 1\), Sp-RST reduced the \(TDD_{i=1}(C_{r=2})\) to \( Reduct _{m=1} = \{Temperature\}\).
The same calculations will be applied to \(m = 2\), and the output is \( Reduct _{m=2} = \{Headache\}\) (as the data is composed of a single feature).
At this stage, different reducts are generated from the different m partitions. With respect to Algorithm 1, lines 12–14, a union of the obtained results is required to represent the initial big information table \(T_{ RDD }(C)\), i.e., Table 1. The final output is \(Reduct = \{Headache, Temperature\}\).
In this example, we presented a single iteration of Sp-RST, i.e., \(N = 1\). Therefore, line 16 on Algorithm 1 will not be covered in this example.
Sp-RST could reduce the big information table presented in Table 1 from \(T_{ RDD }(C)\) to \(T_{ RDD }(Reduct)\). The output is presented in Table 8.