Progress in Artificial Intelligence

, Volume 6, Issue 2, pp 145–158 | Cite as

Exhaustive search algorithms to mine subgroups on Big Data using Apache Spark

Regular Paper

Abstract

Subgroup discovery is a well-known technique for the extraction of patterns, with respect to a variable of interest in the data. However, the explosion in data gathering has hampered the performance of traditional algorithms to discover interesting relationships between different objects in a set with respect to a specific property which is of interest to the user. In this regard, our goal is to propose a set of efficient techniques to mine subgroups on Big Data by means of Apache Spark. On this matter, AprioriK-SD-OE and PFP-SD-OE are proposed as fast exhaustive search algorithms to discover subgroups on Big Data using Apache Spark. The experimental study includes more than 70 datasets considering search spaces bigger than \(10^{15}\) subgroups. The scalability of our proposals are analyzed by considering datasets with 200 million of instances demonstrating the usefulness of using Spark to tackle Big Data.

Keywords

Subgroup discovery Big Data Apache Spark 

Notes

Acknowledgements

This work was supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds, Project TIN-2014-55252-P.

References

  1. 1.
    Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRefGoogle Scholar
  2. 2.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2011)MATHGoogle Scholar
  3. 3.
    Herrera, F., Carmona, C.J., González, P., Jesus, M.J.: An overview on subgroup discovery: foundations and applications. Knowl. Inf. Syst. 29(3), 495–525 (2010)CrossRefGoogle Scholar
  4. 4.
    Ventura, S., Luna, J.M.: Pattern Mining with Evolutionary Algorithms. Springer, Berlin (2016)CrossRefGoogle Scholar
  5. 5.
    Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22(2), 207–216 (1993)CrossRefGoogle Scholar
  6. 6.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMOD Rec. 29(2), 1–12 (2000)CrossRefGoogle Scholar
  7. 7.
    Luna, J.M., Romero, J.R., Romero, C., Ventura, S.: On the use of genetic programming for mining comprehensible rules in subgroup discovery. IEEE Trans. Cybernet. 44(12), 2329–2341 (2014)CrossRefGoogle Scholar
  8. 8.
    Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2003)MathSciNetMATHGoogle Scholar
  9. 9.
    Grosskreutz, H., Rüping, S., Wrobel, S.: Proceedings, Part I European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008. Tight Optimistic Estimates for Fast Subgroup Discovery (Berlin, Heidelberg, 2008) pp. 440–456 (2008)Google Scholar
  10. 10.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, Ser. HotCloud’10, Berkeley (2010)Google Scholar
  11. 11.
    Klösgen, W.: Advances in knowledge discovery and data mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Explora: A Multipattern and Multistrategy Discovery Assistant, pp. 249–271. American Association for Artificial Intelligence, Menlo Park (1996)Google Scholar
  12. 12.
    Kavšek, B., Lavrač, N., Jovanoski, V.: 5th International Symposium on Intelligent Data Analysis, IDA: ch, pp. 230–241. APRIORI-SD, Adapting Association Rule Learning to Subgroup Discovery (2003)Google Scholar
  13. 13.
    Atzmueller, M., Puppe, F.: Sd-map-a fast algorithm for exhaustive subgroup discovery. In: 17th European Conference on Machine Learning and 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2006). Lecture Notes on Computer Science, vol. 4213, pp. 6–17. Springer (2006)Google Scholar
  14. 14.
    Klösgen, W.: Advances in knowledge discovery and data mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Explora: A Multipattern and Multistrategy Discovery Assistant. American Association for Artificial Intelligence, Menlo Park (1996)Google Scholar
  15. 15.
    García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Switzerland (2015)Google Scholar
  16. 16.
    Lemmerich, F., Atzmueller, M., Puppe, F.: Fast exhaustive subgroup discovery with numerical target concepts. Data Min. Knowl. Discov. 30(3), 711–762 (2015)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Atzmueller, M., Lemmerich, F.: Fast subgroup discovery for continuous target concepts. In: Foundations of Intelligent Systems, pp. 35–44. Springer, Berlin (2009)Google Scholar
  18. 18.
    Grosskreutz, H., Rüping, S., Wrobel, S.: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part I. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, ch. Tight Optimistic Estimates for Fast Subgroup Discovery, pp. 440–456Google Scholar
  19. 19.
    Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, New York (2011)CrossRefGoogle Scholar
  20. 20.
    Padillo, F., Luna, J.M., Cano, A., Ventura, S.: A data structure to speed-up machine learning algorithms on massive datasets. In: Proceedings of the 11th International Conference on Hybrid Artificial Intelligence Systems, ser. HAIS 2016, Seville, Spain, pp. 365–376 (2016)Google Scholar
  21. 21.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM - 50th anniversary issue: 1958 - 2008, 51(1), 107–113 (2008)Google Scholar
  22. 22.
    Lam, C.: Hadoop in Action, 1st edn. Manning Publications Co., Greenwich (2010)Google Scholar
  23. 23.
    Luna, J.M.: Pattern mining: current status and emerging topics. Prog. Artif. Intel. 5(3), 1–6 (2016)CrossRefGoogle Scholar
  24. 24.
    Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: Parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ser. RecSys ’08. New York, NY, USA: ACM, pp. 107–114 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.Department of Computer Science and Numerical AnalysisUniversity of CordobaCórdobaSpain

Personalised recommendations