SQL-Based KDD with Infobright’s RDBMS: Attributes, Reducts, Trees

Wróblewski, Jakub; Stawicki, Sebastian

doi:10.1007/978-3-319-08729-0_3

Jakub Wróblewski¹⁰ &
Sebastian Stawicki¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8537))

1043 Accesses
3 Citations

Abstract

We present a framework for KDD process implemented using SQL procedures, consisting of constructing new attributes, finding rough set-based reducts and inducing decision trees. We focus particularly on attribute reduction, which is important especially for high-dimensional data sets. The main technical contribution of this paper is a complete framework for calculating short reducts using SQL queries on data stored in a relational form, without a need of any external tools generating or modifying their syntax. A case study of large real-world data is presented. The paper also recalls some other examples of SQL-based data mining implementations. The experimental results are based on the usage of Infobright’s analytic RDBMS, whose performance characteristics perfectly fit the requirements of presented algorithms.

The second author was partly supported by Polish National Science Centre (NCN) grants DEC-2011/01/B/ST6/03867 and DEC-2012/05/B/ST6/03215, and by National Centre for Research and Development (NCBiR) grant SP/I/1/77065/10 by the strategic scientific research and experimental development program: “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apanowicz, C., Eastwood, V., Ślęzak, D., Synak, P., Wojna, A., Wojnarski, M., Wróblewski, J.: Method and system for data compression in a relational database. US Patent 8,700,579 (2014)
Google Scholar
Bae, S.-H., Choi, J.Y., Qiu, J., Fox, G.C.: High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis. In: Proc. of HPDC, pp. 203–214 (2010)
Google Scholar
Bazan, J.G., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough Set Algorithms in Classification Problem. In: Rough Set Methods and Applications, pp. 49–88. Physica-Verlag (2000)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 137–150 (2004)
Google Scholar
Hu, X., Han, J., Lin, T.Y.: A New Rough Sets Model Based on Database Systems. Fundamenta Informaticae 59(2-3), 135–152 (2003)
MathSciNet MATH Google Scholar
Janusz, A., Nguyen, H.S., Ślęzak, D., Stawicki, S., Krasuski, A.: JRS’2012 Data Mining Competition: Topical Classification of Biomedical Research Papers. In: Yao, J., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS, vol. 7413, pp. 422–431. Springer, Heidelberg (2012)
Chapter Google Scholar
Janusz, A., Ślęzak, D.: Rough Set Methods for Attribute Clustering and Selection. Applied Artificial Intelligence 28(3), 220–242 (2014)
Article Google Scholar
Kowalski, M., Ślęzak, D., Synak, P.: Approximate Assistance for Correlated Subqueries. In: Proc. of FedCSIS, pp. 1455–1462 (2013)
Google Scholar
Kowalski, M., Ślęzak, D., Toppin, G., Wojna, A.: Injecting Domain Knowledge into RDBMS – Compression of Alphanumeric Data Attributes. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 386–395. Springer, Heidelberg (2011)
Chapter Google Scholar
Kowalski, M., Stawicki, S.: SQL-Based Heuristics for Selected KDD Tasks over Large Data Sets. In: Proc. of FedCSIS, pp. 303–310 (2012)
Google Scholar
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley (2004)
Google Scholar
Liu, H., Motoda, H. (eds.): Feature extraction, construction and selection – a data mining perspective. Kluwer Academic Publishers, Dordrecht (1998)
MATH Google Scholar
Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman & Hall/CRC (2008)
Google Scholar
Nguyen, H.S., Nguyen, S.H.: Fast split selection method and its application in decision tree construction from large databases. Int. J. Hybrid Intell. Syst. 2(2), 149–160 (2005)
Article Google Scholar
Nguyen, H.S., Ślęzak, D.: Approximate reducts and association rules. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC 1999. LNCS (LNAI), vol. 1711, pp. 137–145. Springer, Heidelberg (1999)
Chapter Google Scholar
Pawlak, Z., Skowron, A.: Rudiments of Rough Sets. Information Sciences 177(1), 3–27 (2007)
Article MathSciNet Google Scholar
Rahman, M.M., Ślęzak, D., Wróblewski, J.: Parallel Island Model for Attribute Reduction. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 714–719. Springer, Heidelberg (2005)
Chapter Google Scholar
Sarawagi, S., Thomas, S., Agrawal, R.: Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications. Data Min. Knowl. Discov. 4(2/3), 89–125 (2000)
Article Google Scholar
Ślęzak, D., Kowalski, M.: Towards approximate SQL – infobright’s approach. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 630–639. Springer, Heidelberg (2010)
Chapter Google Scholar
Ślęzak, D., Stencel, K., Nguyen, H.S.: (No)SQL Platform for Scalable Semantic Processing of Fast Growing Document Repositories. ERCIM News 2012(90) (2012)
Google Scholar
Ślęzak, D., Synak, P., Wojna, A., Wróblewski, J.: Two Database Related Interpretations of Rough Approximations: Data Organization and Query Execution. Fundamenta Informaticae 127(1-4), 445–459 (2013)
Google Scholar
Ślęzak, D., Wróblewski, J., Eastwood, V., Synak, P.: Brighthouse: An Analytic Data Warehouse for Ad-hoc Queries. PVLDB 1(2), 1337–1345 (2008)
Google Scholar
Świeboda, W., Nguyen, H.S.: Rough Set Methods for Large and Spare Data in EAV Format. In: Proc. of RIVF, pp. 1–6 (2012)
Google Scholar
Szczuka, M.S., Wojdyłło, P.: Neuro-wavelet classifiers for EEG signals based on rough set methods. Neurocomputing 36(1-4), 103–122 (2001)
Article Google Scholar
Widz, S., Ślęzak, D.: Rough Set Based Decision Support – Models Easy to Interpret. In: Selected Methods and Applications of Rough Sets in Management and Engineering, pp. 95–112. Springer (2012)
Google Scholar
Widz, S., Ślęzak, D.: Granular attribute selection: A case study of rough set approach to MRI segmentation. In: Maji, P., Ghosh, A., Murty, M.N., Ghosh, K., Pal, S.K. (eds.) PReMI 2013. LNCS, vol. 8251, pp. 47–52. Springer, Heidelberg (2013)
Chapter Google Scholar
Wojnarski, M., et al.: RSCTC’2010 Discovery Challenge: Mining DNA Microarray Data for Medical Diagnosis and Treatment. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 4–19. Springer, Heidelberg (2010)
Chapter Google Scholar
Wróblewski, J.: Analyzing relational databases using rough set based methods. In: Proc. of IPMU, vol. 1, pp. 256–262 (2000)
Google Scholar
Wróblewski, J.: Pairwise Cores in Information Systems. In: Proc. of RSFDGrC, vol. 1, pp. 166–175 (2005)
Google Scholar
Zhang, J., Li, T., Ruan, D., Gao, Z., Zhao, C.: A parallel method for computing rough set approximations. Information Sciences 194, 209–223 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Infobright Inc., ul. Krzywickiego 34, lok. 219, 02-078, Warsaw, Poland
Jakub Wróblewski
Institute of Mathematics, University of Warsaw, ul. Banacha 2, 02-097, Warsaw, Poland
Sebastian Stawicki

Authors

Jakub Wróblewski
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Stawicki
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665, Warsaw, Poland
Marzena Kryszkiewicz & Zbigniew W. Raś &
Department of Computer Science and Artificial Intelligence, University of Granada, Calle del Periodista Daniel Saucedo Aranda s/n, 18071, Granada, Spain
Chris Cornelis
DISCo, Università di Milano – Bicocca, Viale Sarca 336 – U14, 20126, Milano, Italy
Davide Ciucci
Dpt. de Matemáticas, University of Càdiz, Spain
Jesús Medina-Moreno
School of Computing and Information Systems, University of Tasmania, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wróblewski, J., Stawicki, S. (2014). SQL-Based KDD with Infobright’s RDBMS: Attributes, Reducts, Trees. In: Kryszkiewicz, M., Cornelis, C., Ciucci, D., Medina-Moreno, J., Motoda, H., Raś, Z.W. (eds) Rough Sets and Intelligent Systems Paradigms. Lecture Notes in Computer Science(), vol 8537. Springer, Cham. https://doi.org/10.1007/978-3-319-08729-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-08729-0_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08728-3
Online ISBN: 978-3-319-08729-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics