Advertisement

SQL-Based KDD with Infobright’s RDBMS: Attributes, Reducts, Trees

  • Jakub Wróblewski
  • Sebastian Stawicki
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8537)

Abstract

We present a framework for KDD process implemented using SQL procedures, consisting of constructing new attributes, finding rough set-based reducts and inducing decision trees. We focus particularly on attribute reduction, which is important especially for high-dimensional data sets. The main technical contribution of this paper is a complete framework for calculating short reducts using SQL queries on data stored in a relational form, without a need of any external tools generating or modifying their syntax. A case study of large real-world data is presented. The paper also recalls some other examples of SQL-based data mining implementations. The experimental results are based on the usage of Infobright’s analytic RDBMS, whose performance characteristics perfectly fit the requirements of presented algorithms.

Keywords

KDD Rough sets Reducts Decision trees Feature extraction SQL High-dimensional data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apanowicz, C., Eastwood, V., Ślęzak, D., Synak, P., Wojna, A., Wojnarski, M., Wróblewski, J.: Method and system for data compression in a relational database. US Patent 8,700,579 (2014)Google Scholar
  2. 2.
    Bae, S.-H., Choi, J.Y., Qiu, J., Fox, G.C.: High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis. In: Proc. of HPDC, pp. 203–214 (2010)Google Scholar
  3. 3.
    Bazan, J.G., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough Set Algorithms in Classification Problem. In: Rough Set Methods and Applications, pp. 49–88. Physica-Verlag (2000)Google Scholar
  4. 4.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 137–150 (2004)Google Scholar
  5. 5.
    Hu, X., Han, J., Lin, T.Y.: A New Rough Sets Model Based on Database Systems. Fundamenta Informaticae 59(2-3), 135–152 (2003)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Janusz, A., Nguyen, H.S., Ślęzak, D., Stawicki, S., Krasuski, A.: JRS’2012 Data Mining Competition: Topical Classification of Biomedical Research Papers. In: Yao, J., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS, vol. 7413, pp. 422–431. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Janusz, A., Ślęzak, D.: Rough Set Methods for Attribute Clustering and Selection. Applied Artificial Intelligence 28(3), 220–242 (2014)CrossRefGoogle Scholar
  8. 8.
    Kowalski, M., Ślęzak, D., Synak, P.: Approximate Assistance for Correlated Subqueries. In: Proc. of FedCSIS, pp. 1455–1462 (2013)Google Scholar
  9. 9.
    Kowalski, M., Ślęzak, D., Toppin, G., Wojna, A.: Injecting Domain Knowledge into RDBMS – Compression of Alphanumeric Data Attributes. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 386–395. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  10. 10.
    Kowalski, M., Stawicki, S.: SQL-Based Heuristics for Selected KDD Tasks over Large Data Sets. In: Proc. of FedCSIS, pp. 303–310 (2012)Google Scholar
  11. 11.
    Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley (2004)Google Scholar
  12. 12.
    Liu, H., Motoda, H. (eds.): Feature extraction, construction and selection – a data mining perspective. Kluwer Academic Publishers, Dordrecht (1998)zbMATHGoogle Scholar
  13. 13.
    Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman & Hall/CRC (2008)Google Scholar
  14. 14.
    Nguyen, H.S., Nguyen, S.H.: Fast split selection method and its application in decision tree construction from large databases. Int. J. Hybrid Intell. Syst. 2(2), 149–160 (2005)CrossRefGoogle Scholar
  15. 15.
    Nguyen, H.S., Ślęzak, D.: Approximate reducts and association rules. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC 1999. LNCS (LNAI), vol. 1711, pp. 137–145. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  16. 16.
    Pawlak, Z., Skowron, A.: Rudiments of Rough Sets. Information Sciences 177(1), 3–27 (2007)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Rahman, M.M., Ślęzak, D., Wróblewski, J.: Parallel Island Model for Attribute Reduction. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 714–719. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  18. 18.
    Sarawagi, S., Thomas, S., Agrawal, R.: Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications. Data Min. Knowl. Discov. 4(2/3), 89–125 (2000)CrossRefGoogle Scholar
  19. 19.
    Ślęzak, D., Kowalski, M.: Towards approximate SQL – infobright’s approach. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 630–639. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  20. 20.
    Ślęzak, D., Stencel, K., Nguyen, H.S.: (No)SQL Platform for Scalable Semantic Processing of Fast Growing Document Repositories. ERCIM News 2012(90) (2012)Google Scholar
  21. 21.
    Ślęzak, D., Synak, P., Wojna, A., Wróblewski, J.: Two Database Related Interpretations of Rough Approximations: Data Organization and Query Execution. Fundamenta Informaticae 127(1-4), 445–459 (2013)Google Scholar
  22. 22.
    Ślęzak, D., Wróblewski, J., Eastwood, V., Synak, P.: Brighthouse: An Analytic Data Warehouse for Ad-hoc Queries. PVLDB 1(2), 1337–1345 (2008)Google Scholar
  23. 23.
    Świeboda, W., Nguyen, H.S.: Rough Set Methods for Large and Spare Data in EAV Format. In: Proc. of RIVF, pp. 1–6 (2012)Google Scholar
  24. 24.
    Szczuka, M.S., Wojdyłło, P.: Neuro-wavelet classifiers for EEG signals based on rough set methods. Neurocomputing 36(1-4), 103–122 (2001)CrossRefGoogle Scholar
  25. 25.
    Widz, S., Ślęzak, D.: Rough Set Based Decision Support – Models Easy to Interpret. In: Selected Methods and Applications of Rough Sets in Management and Engineering, pp. 95–112. Springer (2012)Google Scholar
  26. 26.
    Widz, S., Ślęzak, D.: Granular attribute selection: A case study of rough set approach to MRI segmentation. In: Maji, P., Ghosh, A., Murty, M.N., Ghosh, K., Pal, S.K. (eds.) PReMI 2013. LNCS, vol. 8251, pp. 47–52. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  27. 27.
    Wojnarski, M., et al.: RSCTC’2010 Discovery Challenge: Mining DNA Microarray Data for Medical Diagnosis and Treatment. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 4–19. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  28. 28.
    Wróblewski, J.: Analyzing relational databases using rough set based methods. In: Proc. of IPMU, vol. 1, pp. 256–262 (2000)Google Scholar
  29. 29.
    Wróblewski, J.: Pairwise Cores in Information Systems. In: Proc. of RSFDGrC, vol. 1, pp. 166–175 (2005)Google Scholar
  30. 30.
    Zhang, J., Li, T., Ruan, D., Gao, Z., Zhao, C.: A parallel method for computing rough set approximations. Information Sciences 194, 209–223 (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jakub Wróblewski
    • 1
  • Sebastian Stawicki
    • 2
  1. 1.Infobright Inc.WarsawPoland
  2. 2.Institute of MathematicsUniversity of WarsawWarsawPoland

Personalised recommendations