Exhaustive search algorithms to mine subgroups on Big Data using Apache Spark

Padillo, F.; Luna, J. M.; Ventura, S.

doi:10.1007/s13748-017-0112-x

Exhaustive search algorithms to mine subgroups on Big Data using Apache Spark

Regular Paper
Published: 27 January 2017

Volume 6, pages 145–158, (2017)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

369 Accesses
10 Citations
2 Altmetric
Explore all metrics

Abstract

Subgroup discovery is a well-known technique for the extraction of patterns, with respect to a variable of interest in the data. However, the explosion in data gathering has hampered the performance of traditional algorithms to discover interesting relationships between different objects in a set with respect to a specific property which is of interest to the user. In this regard, our goal is to propose a set of efficient techniques to mine subgroups on Big Data by means of Apache Spark. On this matter, AprioriK-SD-OE and PFP-SD-OE are proposed as fast exhaustive search algorithms to discover subgroups on Big Data using Apache Spark. The experimental study includes more than 70 datasets considering search spaces bigger than \(10^{15}\) subgroups. The scalability of our proposals are analyzed by considering datasets with 200 million of instances demonstrating the usefulness of using Spark to tackle Big Data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data

pysubgroup: Easy-to-Use Subgroup Discovery in Python

A Review of Scalable Algorithms for Frequent Itemset Mining for Big Data Using Hadoop and Spark

References

Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Article Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2011)
MATH Google Scholar
Herrera, F., Carmona, C.J., González, P., Jesus, M.J.: An overview on subgroup discovery: foundations and applications. Knowl. Inf. Syst. 29(3), 495–525 (2010)
Article Google Scholar
Ventura, S., Luna, J.M.: Pattern Mining with Evolutionary Algorithms. Springer, Berlin (2016)
Book Google Scholar
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22(2), 207–216 (1993)
Article Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMOD Rec. 29(2), 1–12 (2000)
Article Google Scholar
Luna, J.M., Romero, J.R., Romero, C., Ventura, S.: On the use of genetic programming for mining comprehensible rules in subgroup discovery. IEEE Trans. Cybernet. 44(12), 2329–2341 (2014)
Article Google Scholar
Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2003)
MathSciNet MATH Google Scholar
Grosskreutz, H., Rüping, S., Wrobel, S.: Proceedings, Part I European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008. Tight Optimistic Estimates for Fast Subgroup Discovery (Berlin, Heidelberg, 2008) pp. 440–456 (2008)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, Ser. HotCloud’10, Berkeley (2010)
Klösgen, W.: Advances in knowledge discovery and data mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Explora: A Multipattern and Multistrategy Discovery Assistant, pp. 249–271. American Association for Artificial Intelligence, Menlo Park (1996)
Kavšek, B., Lavrač, N., Jovanoski, V.: 5th International Symposium on Intelligent Data Analysis, IDA: ch, pp. 230–241. APRIORI-SD, Adapting Association Rule Learning to Subgroup Discovery (2003)
Atzmueller, M., Puppe, F.: Sd-map-a fast algorithm for exhaustive subgroup discovery. In: 17th European Conference on Machine Learning and 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2006). Lecture Notes on Computer Science, vol. 4213, pp. 6–17. Springer (2006)
Klösgen, W.: Advances in knowledge discovery and data mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Explora: A Multipattern and Multistrategy Discovery Assistant. American Association for Artificial Intelligence, Menlo Park (1996)
Google Scholar
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Switzerland (2015)
Lemmerich, F., Atzmueller, M., Puppe, F.: Fast exhaustive subgroup discovery with numerical target concepts. Data Min. Knowl. Discov. 30(3), 711–762 (2015)
Article MathSciNet Google Scholar
Atzmueller, M., Lemmerich, F.: Fast subgroup discovery for continuous target concepts. In: Foundations of Intelligent Systems, pp. 35–44. Springer, Berlin (2009)
Grosskreutz, H., Rüping, S., Wrobel, S.: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part I. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, ch. Tight Optimistic Estimates for Fast Subgroup Discovery, pp. 440–456
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, New York (2011)
Book Google Scholar
Padillo, F., Luna, J.M., Cano, A., Ventura, S.: A data structure to speed-up machine learning algorithms on massive datasets. In: Proceedings of the 11th International Conference on Hybrid Artificial Intelligence Systems, ser. HAIS 2016, Seville, Spain, pp. 365–376 (2016)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM - 50th anniversary issue: 1958 - 2008, 51(1), 107–113 (2008)
Lam, C.: Hadoop in Action, 1st edn. Manning Publications Co., Greenwich (2010)
Google Scholar
Luna, J.M.: Pattern mining: current status and emerging topics. Prog. Artif. Intel. 5(3), 1–6 (2016)
Article Google Scholar
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: Parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ser. RecSys ’08. New York, NY, USA: ACM, pp. 107–114 (2008)

Download references

Acknowledgements

This work was supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds, Project TIN-2014-55252-P.

Author information

Authors and Affiliations

Department of Computer Science and Numerical Analysis, University of Cordoba, Rabanales Campus, Córdoba, Spain
F. Padillo, J. M. Luna & S. Ventura

Authors

F. Padillo
View author publications
You can also search for this author in PubMed Google Scholar
J. M. Luna
View author publications
You can also search for this author in PubMed Google Scholar
S. Ventura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Ventura.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Padillo, F., Luna, J.M. & Ventura, S. Exhaustive search algorithms to mine subgroups on Big Data using Apache Spark. Prog Artif Intell 6, 145–158 (2017). https://doi.org/10.1007/s13748-017-0112-x

Download citation

Received: 17 December 2016
Accepted: 11 January 2017
Published: 27 January 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s13748-017-0112-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exhaustive search algorithms to mine subgroups on Big Data using Apache Spark

Abstract

Access this article

Similar content being viewed by others

HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data

pysubgroup: Easy-to-Use Subgroup Discovery in Python

A Review of Scalable Algorithms for Frequent Itemset Mining for Big Data Using Hadoop and Spark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exhaustive search algorithms to mine subgroups on Big Data using Apache Spark

Abstract

Access this article

Similar content being viewed by others

HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data

pysubgroup: Easy-to-Use Subgroup Discovery in Python

A Review of Scalable Algorithms for Frequent Itemset Mining for Big Data Using Hadoop and Spark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation