Advertisement

Hashing Supported Iterative MapReduce Based Scalable SBE Reduct Computation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10722)

Abstract

Feature Selection plays a major role in preprocessing stage of Data mining and helps in model construction by recognizing relevant features. Rough Sets has emerged in recent years as an important paradigm for feature selection i.e. finding Reduct of conditional attributes in given data set. Two control strategies for Reduct Computation are Sequential Forward Selection (SFS), Sequential Backward Elimination(SBE). With the objective of scalable feature seletion, several MapReduce based approaches were proposed in literature. All these approaches are SFS based and results in super set of reduct i.e. with redundant attributes. Even though SBE approaches results in exact Reduct, it requires lot of data movement in shuffle and sort phase of MapReduce. To overcome this problem and to optimize the network bandwidth utilization, a novel hashing supported SBE Reduct algorithm(MRSBER_Hash) is proposed in this work and implemented using Iterative MapReduce framework of Apache Spark. Experiments conducted on large benchmark decision systems have empirically established the relevance of proposed approach for decision systems with large cardinality of conditional attributes.

Keywords

Rough Sets Reduct Iterative MapReduce Apache Spark Scalable feature selection 

1 Introduction

The field of Rough Sets [5] was introduced in 1980’s by Prof. Pawlak as a soft computing paradigm for data analysis amidst vagueness and uncertainty. Reduct computation(feature subset selection) is an important application of Rough Sets in Data Mining. Two primary control strategies for Reduct Computation are Sequential Forward Selection(SFS) and Sequential Backward Elimination(SBE). The SFS approaches, though computationally efficient, have disadvantage in resulting in super set of reduct. SBE algorithm results always in exact reduct without any redundancy.

Standalone Reduct computation approaches suffer from scalability issues with large decision systems. For scalable reduct computation, several MapReduce based distributed/parallel approaches [6, 8, 9] for SFS based reduct computation were developed in literature in frameworks such as Hadoop [10], Twister [4], Apache Spark [11].

In our literature review we have not found any MapReduce based SBE implementations. In this work we identify the problems in MapReduce based SBE Reduct computation and design and develop an approach called MRSBER_HASH for overcoming these problems. The developed approach facilitates exact Reduct computation for very large scale decision systems. The proposed approach also can be utilized as a post processing optimization step in SFS based MapReduce algorithms for generation of exact Reduct out of super Reduct obtained.

2 SBE Based Reduct Computation

The basics of Rough Sets and approaches for reduct computation are given in [6]. Classical Rough Sets are applied to complete symbolic decision system DT defined as [7]
$$\begin{aligned} DT = (U,C\cup \{d\}, \{V_a,f_a\}_{a \in C \bigcup \{d\}} ) \end{aligned}$$
(1)
where U: Set of Objects, d: Decision attribute, C: Set of Conditional Attributes, for each a \(\in C \cup \{d\}\), \(V_{a}\) is domain of a and \(f_{a}: U \rightarrow V_{a}\) is value mapping for attribute a.
Reduct is a subset of a features that are individually necessary and jointly sufficient in order to maintain a heuristic dependency measure of a decision system. If M denotes heuristic dependency measure Reduct R is a minimal subset of conditional attribute set C such that M(R) = M(C). The structure for SBE reduct computation is given in Algorithm 1.

SBE algorithm start with initialization of Reduct R to all conditional attributes, and involves |C| iterations. In each iteration, a conditional attribute a is tested for redundancy based on given dependency measure M. An attribute \('a'\) is said to be redundant, if \(M(R-\{a\})\) is equal to M(R). If an attribute is found to be redundant, then it is dropped from R, otherwise it is retained. After |C| iterations R contains irreducible set of attributes satisfying \(M(R)=M(C)\). Hence SBE approach always result in exact Reduct. Sorting the attributes to be used for redundancy check from least significant(individually) to highest, helps in retaining more potential attributes in the obtained Reduct.

In the proposed approach conditional information entropy(CIE) is used as dependency measure. The CIE of \(B\subseteq C\) with respect to decision attributes {d} is defined as
$$\begin{aligned} E(\{d\}/B) =\sum _{g \epsilon IND(B)}P(g) \sum _{g^1 \epsilon g/IND(\{d\})}P(g^1) \log _2(P(g)) \end{aligned}$$
(2)
where \(p(g)=\frac{|g|}{|U|}\) and \(p(g^1)=\frac{|g^1|}{|g|}\). Here IND(B) denotes rough set based indiscernability relation defined as
$$\begin{aligned} IND(B) = \{(x,y)\in U^{2}/ f_{a}(x)=f_{a}(y),\forall a \in B\} \end{aligned}$$
(3)
IND(B) is an equivalence relation and partition of U induced by IND(B) is denoted by U / IND(B) which is a collection of distinct equivalence classes or granules. An equivalence class of \(x \in U \) is denoted by \([x]_{b}\). An equivalence class is said to be a consistent granule if all objects of equivalence class belongs to the same decision class, otherwise is said to be inconsistent.

3 Proposed MRSBER_Hash Algorithm

The proposed approach for iterative MapReduce based SBE Reduct algorithm is arrived with an objective of preserving scalability. The MapReduce based approach for M(B) computation is usually done by a common pattern [8, 9]. Computing M(B) requires computation of summary information for granules of quotient set U / IND(B) and arriving at M(B) using the summary information of granules. In MapReduce approach \(<key,value>\) for each object is formed by setting key to granule signature and value to required information for M computation. Granule signature is the domain value combination for B satisfied by all the objects of granule. In SBE Reduct algorithm, especially in the beginning iterations, |B| is nearly equal to |C|. Owing to Curse of Dimensionality principle, for decision systems of large cardinality of attributes, number of granules |U / IND(B)| is in order of |U|. In decision systems with |U / IND(C)| is near to |U|, the size of data communicated from mappers to reducers is in the order of original dataset. This can become a bottleneck and hamper the scalability of MapReduce based SBE Reduct algorithm.

Hence, the proposed approach is evolved with the objective of reducing the amount of data transferred in shuffle and sort phase of MapReduce. Algorithm MRSBER_Hash is the distributed/parallel approach for SBE Reduct computation using iterative MapReduce Framework. Initially ranking of attributes in the decreasing order of CIE is obtained using a single MapReduce job. The procedure for computing CIE for each attribute using MapReduce is adopted from [8].

Datasets of large cardinality of attributes have the least amount of inconsistent granules. Removal of them will make the decision system to be consistent without significant impact on resulting reduct obtained. Hence in MRSBER_Hash algorithm another MapReduce job is invoked for extracting objects in inconsistent granules and removing them for formation of consistent decision system(CDS).

SBE Reduct computation in CDS is simplified, as the granules of IND(C) are all consistent, redundancy check of an attribute a does not require exact computation of \( M(R-\{a\})\)(refer to Algorithm 1) but only requires any inconsistent granule is present or not in \(U/IND(R-\{a\})\). This facilitate optimization in amount of data transferred as part of value portion in MapReduce. Rest of the section explains how a two stage process is developed for inconsistency verification. The SBE reduct is obtained by following this two stage process for inconsistency verification for each attribute in the rank order of CIE.

Let the given decision system DT = \((U,C\cup \{d\}\), \(\{V_a,f_a\}_{a \in C \cup \{d\}} )\) be horizontally partitioned into \(DT_{1},DT_{2},....,DT_{n}\). Here \(DT_{i}\)= \((U_{i},C\cup \{d\}\), \(\{V_a,f_a\}_{a \in C \cup \{d\}} )\) for 1\(\le \)i\(\le \)n and \(\{U_{i}\}\) forms a partition of U. n mappers working parallel on \(DT_{i}\)’s produces \(<key,value>\) pairs for each data object of their portions. At mapper level a local optimization(as in reduceByKey of Spark’s MapReduce) helps in arriving at \(<key,value>\) pairs for partial granules \(U_{i}/IND(B)\). Each reduce invocation works with list of values from all mappers for the same key(granule signature) to arrive at required computation for granule of U / IND(B). In the normal approach for inconsistency verification, at mapper level the generated \(<key,value>\) pair represents \(<key>\) as granule signature and value as \(f_{d}(o)\). At reducer level, a granule is found to be inconsistent, if multiple decision values are associated with the same granule signature.

In order to avoid the communication of keys as granule signatures, the above process is divided into two stages. A hash function(HF) is employed and \(<key>\) is set to hashed value of granule signature instead of granule signature. This helps in passing a single number, instead of |B| numbers to reducers. Across all mappers the reduction in memory for keys in shuffle and sort phase is from the order of \(|U*B|\) to |U|(in situation of \(|U/IND(B)|\approx |U|\)). In practice, one can assume the mapping to obey many-one property. This results in a reduce invocation working with coalesced granule \(g^{*}\) representing possibly multiple granules whose granule signature, is mapped to the same hash value. If \(g^{*}\) is found to be consistent then, all the granules coalesced into \(g^{*}\) will also be consistent. But the inconsistency of \(g^{*}\) will not automatically imply inconsistency of comprised granules.

In stage-2, the keys(Hashed values) of inconsistent granules resulting from stage-1 are broadcasted to mappers. Each mapper generates \(<key,value>\) pairs for only those objects whose hashed values is present in inconsistent hash values. Here key is set to granule signature and value is set to decision value. Similar to normal procedure the reduce invocation determines the occurrence of inconsistency. The objective of two stage process is realized when the cardinality of the objects participating in \(<key,value>\) generation in stage-2 is much lesser than |U|. In our exploration of different hash functions, deepHashCode function available in (java.util.Arrays class) is found to be a suitable choice and employed in our experiments.

3.1 Relevance of Two Stage Hash Based Approach

To illustrate the obtained benefits with two stage approach Table 1 summarizes the size of key space(cardinality of granules*size of key) in stage-1 and stage-2 for an iteration for datasets used in our experiments. Assuming that two stage process is not followed and granule signature is taken as key value, the resulting size of key space is also reported under normal SBE column in Table 1. The results are significant as number of keys under shuffle and sort phase in stage-2 of MRSBER_Hash are very few and most of the data transfer is in stage-1 involving keys consisting of singleton hash numbers instead of granule signatures of order |C| as in Normal SBE.
Table 1.

Size of key space

Dataset

Normal SBE

MRSBER_Hash stage1

MRSBER_Hash stage2

KDD

57822*40

54575*1

6*40

Gisette

6000*5000

5999*1

2*5000

Table 2.

Description of dataset

Dataset

Objects

Features

Classes

Consistency

Gisette

6000

5001

2

Yes

kddcup99

4898431

41

23

No

4 Experiments and Analysis of Results

The details of the datasets used in experimentation are described in Table 2. Gisette is from UCI repository [3] and KDDCUP99 [2] is from UCI KDD Archive. Experiments were conducted on Baadal [1] cloud computing infrastructure, an initiative of Ministry of Human Resource Development, Government of India. Baadal is developed and supported by IIT-Delhi. A five node cluster environment is obtained on Baadal, where each node is with the following hardware and software configuration. Each node has 8 cpu cores and 8 GB of RAM. Each node is installed with Ubuntu 14.04 Desktop amd64, Java 1.7.0.131, Scala 2.10.4, sbt.13.8, Apache spark 1.6.2. In these 5 nodes one is set as master and the other 4 nodes are set as slaves.

4.1 Comparative Experiment with SFS MapReduce Approaches

Proposed algorithm(MRSBER_Hash) is compared with several SFS Reduct Computation approaches which are implemented using iterative MapReduce paradigm available in literature [8, 9, 10, 12]. Algorithms PLAR_PR, PLAR_LCE, PLAR_SCE, PLAR_CCE [12] are Iterative MapReduce based SFS Reduct computation algorithms using dependency measures PR(gamma measure), various conditional information entropy measures LCE, SCE, CCE. These algorithms were proposed by Junbo Zhang et al. in 2016 incorporating features of granular computing based initialization, model parallelism and data parallelism. The implementation was done in Apache Spark.

Algorithm IN_MRQRA_IG was given by Praveen Kumar Singh et al. [9] in 2015, is an iterative MapReduce based distributed algorithm for QRA_IG [6] in 2015. IN_MRQRA_IG is implemented in Indiana University’s Twister environment. Balu et al. in 2016 have given MRIQRA_IG [8] as an improvement to IN_MRQRA_IG as a distributed implementation of IQRA_IG algorithm using Twister’s framework. PAR algorithm was given by [10] Yong Yang et al. in 2010 using Hadoop MapReduce framework. The cluster configuration involved in each of these implementation are different and details are summarized in Table 3. The results reported with respect to these algorithms are as given in the respective publications.
Table 3.

Cluster configuration

PLAR

IN_MRQRA_IG

MRIQRA_IG

PAR

Cluster Size

19

4

6

11

RAM SIZE

Atleast 8 GB

4 GB

4 GB

*

Cores

Atleast 8

4

4

*

OS

Cent OS 6.5

OpenSuse-12.2

OpenSuse

*

Software

Spark

Twister

Twister

Hadoop

Table 4.

Comparison with KDDCUP99 dataset

Computation Time(sec)

Reduct Length

MRSBER_Hash

184

25

PLAR-PR

8

*

PLAR-LCE

8

*

PLAR-SCE

8

*

PLAR-CCE

8

*

MRIQRA_IG

68.84

31

IN_MRQRA_IG

1947.338

31

PAR

5050

*

Note: * not reported in original publication.

Comparative Experiments with KDDCUP99: The results of Reduct length obtained, computation time in seconds are provided in Table 4. The results of KDDCUP99 establish the need for MapReduce based SBE Reduct algorithm. In contrast to SFS approaches giving a super Reduct of 31 attributes MRSBER_Hash has resulted in exact Reduct of 25 attributes. The only way SFS approaches can result in exact Reduct is by augmenting SBE approach on obtained super Reduct. Computational time of MRSBER_hash is significantly higher than MRIQRA_IG and PLAR based algorithms. The computation efficiency of PLAR based algorithms is primarily due to high configuration cluster of 19 nodes and capability to utilize model parallelism on supporting infrastructure. By model parallelism, in SFS approach, authors meant initiation of separate jobs for obtaining candidate attribute sets significance. The proposed algorithms, being SBE based lack the advantages of granular computations as in PLAR approaches, positive region elimination as in MRIQRA_IG algorithm. It is to be noted that, proposed algorithms achieved much better performance than SFS approaches IN_MRQRA_IG, PAR which does not posses these optimization’s. As a first attempt towards MapReduce based SBE Reduct computation, our proposed algorithm achieved comparable performance with leading approaches in SFS.

Comparative Experiments with Gisette Dataset: Gisette dataset is contrasting one from KDDCUP99 having much larger cardinality in attributes and smaller cardinality in objects. The results for Gisette dataset in [12] only reported for first 5 iterations(Selection of five attributes into Reduct Set) with varying model parallelism level. PLAR_SCE has incurred 30806 s without model parallelism, 15293 s with model parallelism level of 2(supporting two parallel MapReduce jobs at any instance) and 1856 s with level of 64. With much simpler computational infrastructure, MRSBER_Hash could complete the Reduct computation and resulted in a Reduct of size 23 in 4929 s.

Assuming that PLAR_SCE in [12] completed the Reduct computation with 23 attributes, the estimated computational time will be 8537 s. Hence, it can be deduced that in datasets of higher cardinality of attributes MRSBER_Hash can obtain better computational performance than SFS approach. In our opinion, this may be due to fewer number of MapReduce jobs initiated in SBE approach. To be specific, an SBE approach requires |C| MapReduce jobs and an SFS approach followed in PLAR methods with model parallelism require \(|C|+(|C|-1|)+(|C|-2|)+....+(|C|-|R|+1)\) MapReduce jobs.

5 Conclusion

In this work MRSBER_Hash algorithm is developed as a MapReduce based SBE Reduct computation approach. MRSBER_Hash was designed with an objective of minimizing data transfer in shuffle and sort phase of MapReduce. In extensive comparative analysis with leading MapReduce based SFS Reduct computation approaches mixed results are obtained. MRSBER_Hash is found to be achieving better computational performance with datasets of large cardinality of conditional attributes. In future, MRSBER_Hash will be further improved to achieve similar or better computational performance than SFS approaches such as MRIQRA_IG. Irrespective of performance in computational time aspect, MRSBER_Hash provides an approach for exact Reduct computation in large scale decision systems.

References

  1. 1.
    Baadal: the iitd computing cloud (2011). http://www.cc.iitd.ernet.in
  2. 2.
    Dataset used for experiments (1999). http://kdd.ics.uci.edu/databases/kddcup99/
  3. 3.
    Uci machine learning repository. https://archive.ics.uci.edu/ml/datasets (2013)
  4. 4.
    Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.C.: Twister: a runtime for iterative mapreduce. In: HPDC, pp. 810–818. ACM (2010)Google Scholar
  5. 5.
    Pawlak, Z.: Rough sets. Int. J. Parallel Program. 11(5), 341–356 (1982)MATHGoogle Scholar
  6. 6.
    P.S.V.S., S.P., Raghavendra Rao, C.: Extensions to IQuickReduct. In: Sombattheera, C., Agarwal, A., Udgata, S.K., Lavangnananda, K. (eds.) MIWAI 2011. LNCS (LNAI), vol. 7080, pp. 351–362. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-25725-4_31 CrossRefGoogle Scholar
  7. 7.
    Shen, Q., Jensen, R.: Rough set-based feature selection: a review. In: Rough Computing: Theories, Technologies and Applications, pp. 70–107. IGI Global (2008)Google Scholar
  8. 8.
    Sai Prasad, P.S.V.S., Bala Subrahmanyam, H., Singh, P.K.: Scalable IQRA_IG algorithm: an iterative MapReduce approach for reduct computation. In: Krishnan, P., Radha Krishna, P., Parida, L. (eds.) ICDCIT 2017. LNCS, vol. 10109, pp. 58–69. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-50472-8_5 CrossRefGoogle Scholar
  9. 9.
    Singh, P.K., Sai Prasad, P.S.V.S.: Scalable quick reduct algorithm: iterative mapreduce approach. In: CODS, pp. 25:1–25:2. ACM (2016)Google Scholar
  10. 10.
    Yang, Y., Chen, Z., Liang, Z., Wang, G.: Attribute reduction for massive data based on rough set theory and MapReduce. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS (LNAI), vol. 6401, pp. 672–678. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-16248-0_91 CrossRefGoogle Scholar
  11. 11.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: HotCloud. USENIX Association (2010)Google Scholar
  12. 12.
    Zhang, J., Li, T., Pan, Y.: Parallel large-scale attribute reduction on cloud systems. CoRR, abs/1610.01807 (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Quadratic Insights Pvt Ltd.HyderabadIndia
  2. 2.School of Computer and Information SciencesUniversity of HyderabadHyderabadIndia

Personalised recommendations