Advertisement

Scalable Implementations of Rough Set Algorithms: A Survey

  • Bing Zhou
  • Hyuk Cho
  • Xin Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10868)

Abstract

With the rapid change of volume, variety, and velocity of data across real-life domains, learning from big data has become a growing challenge. Rough set theory has been successfully applied to knowledge discovery from databases (KDD) for handling data with imperfections. Most traditional rough set algorithms were implemented in a sequential manner and ran on a single machine, becoming computationally expensive and inefficient for handling massive data. Recent computing frameworks, such as MapReduce and Apache Spark, made it possible to realize parallel rough set algorithms on distributed clusters of commodity computers and speed up big data analyses. Although a variety of scalable rough set implementations have been developed, (1) most proposed research compared their work with outdated sequential implementations; (2) certain distributed computing frameworks were used more frequently, overlooking recently developed frameworks; and (3) existing issues and guidance in adapting new computing frameworks are lacking. The main objective of this paper is to provide current state-of-the-art scalable implementations of rough set algorithms. This paper will help researchers catch up with the recent developments in this field and further provide some insights to develop rough set algorithms in up-to-date high performance computing environments for big data analytics.

Keywords

Rough set Scalable Parallel Distributed Hadoop MapReduce Apache Spark 

References

  1. 1.
    Pawlak, Z.: Rough Sets, Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishers, Dordrecht (1991)zbMATHGoogle Scholar
  2. 2.
    Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973)zbMATHGoogle Scholar
  3. 3.
    Zadeh, L.A.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965)CrossRefGoogle Scholar
  4. 4.
    Zadeh, L.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 1(1), 3–28 (1978)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, Princeton (1976)zbMATHGoogle Scholar
  6. 6.
    Hasan, A., Srinivasan, R., Vasudevan, G., Verbiest, N., Cornelis, C., Tolentino, M.E., Teredesai, A., Cock, M.D.: Computing fuzzy rough approximations in large scale information systems. In: BigData Conference, pp. 9–16 (2014)Google Scholar
  7. 7.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  8. 8.
    Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRefGoogle Scholar
  9. 9.
    Apache Flink: Scalable stream and batch data processing. https://flink.apache.org/
  10. 10.
  11. 11.
  12. 12.
    Pawlak, Z.: Rough set approach to knowledge-based decision support. Eur. J. Oper. Res. 99(1), 48–57 (1997)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Jadhav, S., Suryawanshi, S.: A survey on parallel rough set based knowledge acquisition using MapReduce from big data (2014)Google Scholar
  14. 14.
    Nandgaonkar, Suruchi, V., Raut, A.B.: A survey on parallel method for rough set using MapReduce technique for data mining. Int. J. Eng. Comput. Sci. (2015)Google Scholar
  15. 15.
    Li, T., Luo, C., Chen, H., Zhang, J.: PICKT: a solution for big data analysis. In: Ciucci, D., Wang, G., Mitra, S., Wu, W.-Z. (eds.) RSKT 2015. LNCS (LNAI), vol. 9436, pp. 15–25. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-25754-9_2CrossRefGoogle Scholar
  16. 16.
    Zhang, J., Li, T., Pan, Y.: PLAR: parallel large-scale attribute reduction on cloud systems. In: PDCAT, pp. 184–191 (2013)Google Scholar
  17. 17.
    Li, S.Y., Li, T.R., Zhang, Z.X., Chen, H.M., Zhang, J.B.: Parallel computing of approximations in dominance-based rough sets approach. Knowl. Based Syst. 87, 102–111 (2015)CrossRefGoogle Scholar
  18. 18.
    Zhang, J.B., Wong, J.S., Pan, Y., Li, T.R.: A parallel matrix-based method for computing approximations in incomplete information systems. IEEE Trans. Knowl. Data Eng. 27(2), 326–229 (2015)CrossRefGoogle Scholar
  19. 19.
    Zhang, J.B., Li, T.R., Ruan, D., Gao, Z.Z., Zhao, C.B.: A parallel method for computing rough set approximations. Inf. Sci. 194, 209–223 (2012)CrossRefGoogle Scholar
  20. 20.
    Huang, K.M., Chen, H.Y., Hsiung, K.L.: On realizing rough set algorithms with apache spark. In: Third International Conference on Data Mining, Internet Computing and Big Data, pp. 111–112 (2016)Google Scholar
  21. 21.
    Gromniak, W.: Scalability of attribute selection methods: application of rough sets and MapReduce. Dissertation Institute of Mathematics, University of Warsaw (2015)Google Scholar
  22. 22.
    Sarah, V., Asfoor, H., Saeys, Y., Cornelis, C., Tolentino, M.E., Teredesai, A., Cock, M.D.: Distributed fuzzy rough prototype selection for big data regression. In: NAFIPS/WConSC, pp. 1–6 (2015)Google Scholar
  23. 23.
    Kawhale, R., Patil, S.: Obtaining approximation with data cube using MapReduce. Int. J. Recent Innov. Trends Comput. Commun. 3(7), 4880–4884 (2015). ISSN: 2321–8169Google Scholar
  24. 24.
    Cui, W.P., Huang, L.: A MapReduce solution for knowledge reduction in big data. IJCSA 13(1), 17–30 (2016)MathSciNetGoogle Scholar
  25. 25.
    Dhande, V., Sarkar, B.K.: Obtaining rough set approximation using MapReduce technique in data mining (2016)Google Scholar
  26. 26.
    Chaudhuri, A.: Parallel fuzzy rough support vector machine for data classification in cloud environment. Informatica 39(4), 397–420 (2015)MathSciNetGoogle Scholar
  27. 27.
    Nandgaonkar, S.V., Raut, A.B.: Parallel rough set approximation using MapReduce technique in Hadoop (2015)Google Scholar
  28. 28.
    El-Alfy, E., Alshammari, M.: Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce. Simul. Model. Pract. Theory 64, 18–29 (2016)CrossRefGoogle Scholar
  29. 29.
    Kwiatkowski, P., Nguyen, S.H., Nguyen, H.S.: On scalability of rough set methods. In: Hüllermeier, E., Kruse, R., Hoffmann, F. (eds.) IPMU 2010. CCIS, vol. 80, pp. 288–297. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-14055-6_30CrossRefGoogle Scholar
  30. 30.
    Chen, M., Yuan, J., Li, L., Liu, D., Li, T.: A fast heuristic attribute reduction algorithm using Spark. In: 2017 IEEE 37th International Conference Distributed Computing Systems (ICDCS) (2017)Google Scholar
  31. 31.
    Yang, Y., Chen, Z., Liang, Z., Wang, G.: Attribute reduction for massive data based on rough set theory and MapReduce. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS (LNAI), vol. 6401, pp. 672–678. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-16248-0_91CrossRefGoogle Scholar
  32. 32.
    Xi, D., Wang, G., Zhang, X., Zhang, F.: Parallel attribute reduction based on MapReduce. In: Miao, D., Pedrycz, W., Ślȩzak, D., Peters, G., Hu, Q., Wang, R. (eds.) RSKT 2014. LNCS (LNAI), vol. 8818, pp. 631–641. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11740-9_58CrossRefGoogle Scholar
  33. 33.
    Lv, P., Qian, J., Yue, X.: Incremental attribute reduction algorithm for big data using MapReduce. J. Comput. Methods Sci. Eng. 16(3), 641–652 (2016)MathSciNetzbMATHGoogle Scholar
  34. 34.
    Feng, L., Li, T., Ruan, D., Gou, S.: A vague-rough set approach for uncertain knowledge acquisition. Knowl. Based Syst. 24(6), 837–843 (2011)CrossRefGoogle Scholar
  35. 35.
    Zhang, J.B., Wong, J., Li, T., Pan, Y.: A comparison of parallel large-scale knowledge acquisition using rough set theory on different MapReduce runtime systems. Int. J. Approximate Reasoning 55(3), 896–907 (2014)CrossRefGoogle Scholar
  36. 36.
    Xin, R.S., Rosen, J., Zaharia, M., Franklin, M., Shenker, S., Stoic, I.: Shark: SQL and rich analytics at scale. In: 2013 ACM SIGMOD International Conference on Management of Data, pp. 13–24 (2013)Google Scholar
  37. 37.
    Karun, A.K., Chitharanjan, K.: A review on Hadoop–HDFS infrastructure extensions. In: 2013 IEEE Conference on Information & Communication Technologies (ICT), pp. 132–137 (2013)Google Scholar
  38. 38.
  39. 39.
    Pradeepa, A., Thanamani, A.: Hadoop file system and fundamental concept of MapReduce Interior and closure rough set approximations. Int. J. Adv. Res. Comput. Commun. Eng. 2(10), 5865–5868 (2013)Google Scholar
  40. 40.
    Patil, P.: Data mining with rough set using MapReduce. Int. J. Innov. Res. Comput. Commun. Eng. 2(11), 6980–6986 (2014)Google Scholar
  41. 41.
    Zhang, J.B., Li, T.R., Pan, Y.: Parallel rough set based knowledge acquisition using MapReduce from big data. In: 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, pp. 20–27. ACM (2012)Google Scholar
  42. 42.
    Xu, F., Wei, L., Bi, Z., Zhu, L.: Research on fuzzy rough parallel reduction based on mutual information. J. Comput. Inf. Syst. 10(12), 5391–5401 (2014)Google Scholar
  43. 43.
    Yang, Y., Chen, Z.: Parallelized computing of attribute core based on rough set theory and MapReduce. In: Li, T., Nguyen, H.S., Wang, G., Grzymala-Busse, J., Janicki, R., Hassanien, A.E., Yu, H. (eds.) RSKT 2012. LNCS (LNAI), vol. 7414, pp. 155–160. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31900-6_20CrossRefGoogle Scholar
  44. 44.
    Qian, J., Miao, D., Zhang, Z., Yue, X.: Parallel attribute reduction algorithms using MapReduce. Inf. Sci. 279, 671–690 (2014)MathSciNetCrossRefGoogle Scholar
  45. 45.
    Wu, M., Sakai, H.: On parallelization of the NIS-apriori algorithm for data mining. Procedia Comput. Sci. 60, 623–631 (2015)CrossRefGoogle Scholar
  46. 46.
    Dai, Y., Sun, H.: The naive Bayes text classification algorithm based on rough set in the cloud platform. J. Chem. Pharm. Res. 6, 1636–1643 (2014)Google Scholar
  47. 47.
    Weka 3 - Data mining with open source machine learning software in Java. https://www.cs.waikato.ac.nz/ml/weka/
  48. 48.
    R: The R project for statistical computing. https://www.r-project.org/
  49. 49.
    Komorowski, J., Ohrn, A., Skowron, A.: The ROSETTA rough set software system. In: Handbook of Data Mining and Knowledge Discovery, pp. 2–3 (2002)Google Scholar
  50. 50.
    Owen, S.: Mahout in Action. Manning, Shelter Island (2012)Google Scholar
  51. 51.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)MathSciNetzbMATHGoogle Scholar
  52. 52.
    Lin, J., Dyer, C.: Data-Intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies, vol. 3, pp. 1–177 (2010)Google Scholar
  53. 53.
  54. 54.
    Garca-Gil, D., Ramrez-Gallego, S., Garca, S., Herrera, F.: A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics 2(1) (2017)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceSam Houston State UniversityHuntsvilleUSA

Personalised recommendations