Skip to main content

In-Database Feature Selection Using Rough Set Theory

  • Conference paper
  • First Online:
Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2016)

Abstract

Despite their traditional roles, database systems increasingly became attractive as scalable analytical platforms using extensible SQL over the last decade. This methodology is termed in-database processing and provides several advantages over traditional mining attempts. In this work we bring Variable Precision Rough Sets to the domain of databases as a common framework to unlock hidden knowledge from data. Our derived model is built upon pure relational operations and thus very efficient. We further demonstrate its applicability for feature selection by introducing two in-database algorithms. Our experiments indicate, the model scales and is comparable to existing approaches in terms of performance but superior when applied to real-life applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Let \(\langle U,A,D\rangle \) be inconsistent. Thus, we have \(K,K'\in U/A: K\subseteq _\beta X\) and \(K'\cap X \ne \emptyset \) with \(K' \nsubseteq _\beta X, X\in U/D\). For indispensable attributes \(a\in A\), we consider \(U/(A\setminus \{a\})\) and may get \(K^*\in U/(A\setminus \{a\})\) with \(K^*=K\cup K'\) and \(\vert POS_{A\setminus \{a\}}^{\beta }(D) \vert \ne \vert POS_A^{\beta }(D) \vert \). Hence a is in the \(\beta \)-core. This case is not covered when \(\langle U,A,D\rangle \) is cleaned up-front.

  2. 2.

    OS: Microsoft Windows 2012 R2 (Standard edition x64); DBs: Microsoft SQL Server 2014 (Developer edition 12.0.2, x64), Oracle 12c (Enterprise edition 12.1.0.2, x64); Misc: JDK 1.8.0.51, latest JDBC, R 3.2.0 (x64), RSESLib 3.0.4, RSR 1.3.0; Memory: 48 GByte; CPU: 32x2.6 GHz Intel Xeon E312xx (Sandy Bridge); HDD: 500 GByte.

  3. 3.

    KDD99m is a modification of the original KDD99 dataset available in [22]. In contrast, it holds one additional attribute resulting in evenly sized equivalence classes.

References

  1. Nguyen, H.S.: Approximate boolean reasoning: foundations and applications in data mining. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 334–506. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Shreya, P., Fard, A., Gupta, V., Martinez, J., LeFevre, J., Xu, V., Hsu, M., Roy, I.: Large-scale predictive analytics in vertica: fast data transfer, distributed model creation, and in-database prediction. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1657–1668 (2015)

    Google Scholar 

  3. Fernandez-Baizán, M.C., Menasalvas Ruiz, E., Peña Sánchez, J.M.: Integrating rdms and data mining capabilities using rough sets. In: Proceedings of the 6th International Conference on IPMU, pp. 1439–1445 (1996)

    Google Scholar 

  4. Ohrn, A., Komorowski, J.: ROSETTA - a rough set toolkit for analysis of data. In: Proceedings of the 3rd International Joint Conference on Information Sciences, pp. 403–407 (1997)

    Google Scholar 

  5. Bazan, J., Szczuka, M.S.: The rough set exploration system. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  6. Tileston, T.: Have your cake & eat it too! accelerate data mining combining SAS & teradata. In: Teradata Partners 2005 “Experience the Possibilities” (2005)

    Google Scholar 

  7. Hellerstein, J.M., Re, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endowment 5(12), 1700–1711 (2012)

    Article  Google Scholar 

  8. Beer, F., Bühler, U.: An in-database rough set toolkit. In: Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR and FGDB, pp. 146–157 (2015)

    Google Scholar 

  9. Pawlak, Z.: Rough Sets. Int. J. Comput. Inf. Sci. 11(5), 341–356 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  10. Pawlak, Z.: Rough Sets - Theoretical Aspects of Reasoning about Data. Springer, Netherlands (1991)

    MATH  Google Scholar 

  11. Ziarko, W.: Variable precision rough set model. J. Comput. Syst. Sci. 46(1), 39–59 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  12. Kumar, A.: New techniques for data reduction in a database system for knowledge discovery applications. JIIS 10(1), 31–48 (1998)

    Google Scholar 

  13. Hu, X.T., Lin, T.Y., Han, J.: A new rough sets model based on database systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. LNCS, vol. 2639, pp. 114–121. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  14. Vaithyanathan, K., Lin, T.Y.: High frequency rough set model based on database systems. In: Annual Meeting of the NAFIPS, pp. 1–6 (2008)

    Google Scholar 

  15. Sun, H.Q., Xiong, Z., Wang, Y.: Research on integrating ordbms and rough set theory. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 169–175. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  16. Chan, C.-C.: Learning rules from very large databases using rough multisets. In: Peters, J.F., Skowron, A., Grzymała-Busse, J.W., Kostek, B., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 59–77. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  17. Naouali, S., Missaoui, R.: Flexible query answering in data cubes. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 221–232. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  18. Slezak, D., Wroblewski, J., Eastwood, V., Synak, P.: Brighthouse: An Analytic Data Warehouse for Ad-hoc Queries. Proc. of the VLDB Endowment 1, 1337–1345 (2008)

    Article  Google Scholar 

  19. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. Intelligent Decision Support. Theory and Decision Library, pp. 331–362. Springer, Netherlands (1992)

    Chapter  Google Scholar 

  20. Shen, Q., Chouchoulas, A.: A modular approach to generating fuzzy rules with reduced attributes for the monitoring of complex systems. Eng. Appl. Artif. Intell. 13(3), 263–278 (2000)

    Article  Google Scholar 

  21. Jensen, R., Shen, Q.: Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches. Wiley, Hoboken (2008)

    Book  Google Scholar 

  22. Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, January 2016. http://archive.ics.uci.edu/ml

  23. Riza, L.S., Januszb, A., Bergmeira, C., Cornelisa, C., Herreraa, F., ŚleZak, D., Benítez, J.: Implementing algorithms of rough set theory and fuzzy rough set theory in the R package roughsets. Inf. Sci. 287, 68–89 (2014)

    Article  Google Scholar 

  24. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5 (2014)

    Google Scholar 

  25. Reiss, A., Stricker, D.: Introducing a new benchmarked dataset for activity monitoring. In: Proceedings of the 16th ISWC, pp. 108–109 (2012)

    Google Scholar 

  26. Cattral, R., Oppacher, F., Deugo, D.: Evolutionary data mining with automatic rule generalization. In: Recent Advances in Computers, Computing and Communications, pp. 296–300 (2002)

    Google Scholar 

  27. Blackard, J.A., Dean, D.J.: Comparative accuracies of neural networks and discriminant analysis in predicting forest cover types from cartographic variables. In: Proceedings of the 2nd Sourthern Forestry GIS Conference, pp. 189–199 (1998)

    Google Scholar 

  28. NSL-KDD: Data Set for Network-based Intrusion Detection Systems, January 2016. http://nsl.cs.unb.ca/NSL-KDD

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frank Beer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Beer, F., Bühler, U. (2016). In-Database Feature Selection Using Rough Set Theory. In: Carvalho, J., Lesot, MJ., Kaymak, U., Vieira, S., Bouchon-Meunier, B., Yager, R. (eds) Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2016. Communications in Computer and Information Science, vol 611. Springer, Cham. https://doi.org/10.1007/978-3-319-40581-0_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-40581-0_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-40580-3

  • Online ISBN: 978-3-319-40581-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics