In-Database Feature Selection Using Rough Set Theory

Beer, Frank; Bühler, Ulrich

doi:10.1007/978-3-319-40581-0_32

Frank Beer¹⁶ &
Ulrich Bühler¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 611))

Included in the following conference series:

International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems

842 Accesses
2 Citations

Abstract

Despite their traditional roles, database systems increasingly became attractive as scalable analytical platforms using extensible SQL over the last decade. This methodology is termed in-database processing and provides several advantages over traditional mining attempts. In this work we bring Variable Precision Rough Sets to the domain of databases as a common framework to unlock hidden knowledge from data. Our derived model is built upon pure relational operations and thus very efficient. We further demonstrate its applicability for feature selection by introducing two in-database algorithms. Our experiments indicate, the model scales and is comparable to existing approaches in terms of performance but superior when applied to real-life applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Let \(\langle U,A,D\rangle \) be inconsistent. Thus, we have \(K,K'\in U/A: K\subseteq _\beta X\) and \(K'\cap X \ne \emptyset \) with \(K' \nsubseteq _\beta X, X\in U/D\). For indispensable attributes \(a\in A\), we consider \(U/(A\setminus \{a\})\) and may get \(K^*\in U/(A\setminus \{a\})\) with \(K^*=K\cup K'\) and \(\vert POS_{A\setminus \{a\}}^{\beta }(D) \vert \ne \vert POS_A^{\beta }(D) \vert \). Hence a is in the \(\beta \)-core. This case is not covered when \(\langle U,A,D\rangle \) is cleaned up-front.
2.
OS: Microsoft Windows 2012 R2 (Standard edition x64); DBs: Microsoft SQL Server 2014 (Developer edition 12.0.2, x64), Oracle 12c (Enterprise edition 12.1.0.2, x64); Misc: JDK 1.8.0.51, latest JDBC, R 3.2.0 (x64), RSESLib 3.0.4, RSR 1.3.0; Memory: 48 GByte; CPU: 32x2.6 GHz Intel Xeon E312xx (Sandy Bridge); HDD: 500 GByte.
3.
KDD99m is a modification of the original KDD99 dataset available in [22]. In contrast, it holds one additional attribute resulting in evenly sized equivalence classes.

References

Nguyen, H.S.: Approximate boolean reasoning: foundations and applications in data mining. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 334–506. Springer, Heidelberg (2006)
Chapter Google Scholar
Shreya, P., Fard, A., Gupta, V., Martinez, J., LeFevre, J., Xu, V., Hsu, M., Roy, I.: Large-scale predictive analytics in vertica: fast data transfer, distributed model creation, and in-database prediction. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1657–1668 (2015)
Google Scholar
Fernandez-Baizán, M.C., Menasalvas Ruiz, E., Peña Sánchez, J.M.: Integrating rdms and data mining capabilities using rough sets. In: Proceedings of the 6th International Conference on IPMU, pp. 1439–1445 (1996)
Google Scholar
Ohrn, A., Komorowski, J.: ROSETTA - a rough set toolkit for analysis of data. In: Proceedings of the 3rd International Joint Conference on Information Sciences, pp. 403–407 (1997)
Google Scholar
Bazan, J., Szczuka, M.S.: The rough set exploration system. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005)
Chapter Google Scholar
Tileston, T.: Have your cake & eat it too! accelerate data mining combining SAS & teradata. In: Teradata Partners 2005 “Experience the Possibilities” (2005)
Google Scholar
Hellerstein, J.M., Re, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endowment 5(12), 1700–1711 (2012)
Article Google Scholar
Beer, F., Bühler, U.: An in-database rough set toolkit. In: Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR and FGDB, pp. 146–157 (2015)
Google Scholar
Pawlak, Z.: Rough Sets. Int. J. Comput. Inf. Sci. 11(5), 341–356 (1982)
Article MathSciNet MATH Google Scholar
Pawlak, Z.: Rough Sets - Theoretical Aspects of Reasoning about Data. Springer, Netherlands (1991)
MATH Google Scholar
Ziarko, W.: Variable precision rough set model. J. Comput. Syst. Sci. 46(1), 39–59 (1993)
Article MathSciNet MATH Google Scholar
Kumar, A.: New techniques for data reduction in a database system for knowledge discovery applications. JIIS 10(1), 31–48 (1998)
Google Scholar
Hu, X.T., Lin, T.Y., Han, J.: A new rough sets model based on database systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. LNCS, vol. 2639, pp. 114–121. Springer, Heidelberg (2003)
Chapter Google Scholar
Vaithyanathan, K., Lin, T.Y.: High frequency rough set model based on database systems. In: Annual Meeting of the NAFIPS, pp. 1–6 (2008)
Google Scholar
Sun, H.Q., Xiong, Z., Wang, Y.: Research on integrating ordbms and rough set theory. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 169–175. Springer, Heidelberg (2004)
Chapter Google Scholar
Chan, C.-C.: Learning rules from very large databases using rough multisets. In: Peters, J.F., Skowron, A., Grzymała-Busse, J.W., Kostek, B., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 59–77. Springer, Heidelberg (2004)
Chapter Google Scholar
Naouali, S., Missaoui, R.: Flexible query answering in data cubes. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 221–232. Springer, Heidelberg (2005)
Chapter Google Scholar
Slezak, D., Wroblewski, J., Eastwood, V., Synak, P.: Brighthouse: An Analytic Data Warehouse for Ad-hoc Queries. Proc. of the VLDB Endowment 1, 1337–1345 (2008)
Article Google Scholar
Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. Intelligent Decision Support. Theory and Decision Library, pp. 331–362. Springer, Netherlands (1992)
Chapter Google Scholar
Shen, Q., Chouchoulas, A.: A modular approach to generating fuzzy rules with reduced attributes for the monitoring of complex systems. Eng. Appl. Artif. Intell. 13(3), 263–278 (2000)
Article Google Scholar
Jensen, R., Shen, Q.: Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches. Wiley, Hoboken (2008)
Book Google Scholar
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, January 2016. http://archive.ics.uci.edu/ml
Riza, L.S., Januszb, A., Bergmeira, C., Cornelisa, C., Herreraa, F., ŚleZak, D., Benítez, J.: Implementing algorithms of rough set theory and fuzzy rough set theory in the R package roughsets. Inf. Sci. 287, 68–89 (2014)
Article Google Scholar
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5 (2014)
Google Scholar
Reiss, A., Stricker, D.: Introducing a new benchmarked dataset for activity monitoring. In: Proceedings of the 16th ISWC, pp. 108–109 (2012)
Google Scholar
Cattral, R., Oppacher, F., Deugo, D.: Evolutionary data mining with automatic rule generalization. In: Recent Advances in Computers, Computing and Communications, pp. 296–300 (2002)
Google Scholar
Blackard, J.A., Dean, D.J.: Comparative accuracies of neural networks and discriminant analysis in predicting forest cover types from cartographic variables. In: Proceedings of the 2nd Sourthern Forestry GIS Conference, pp. 189–199 (1998)
Google Scholar
NSL-KDD: Data Set for Network-based Intrusion Detection Systems, January 2016. http://nsl.cs.unb.ca/NSL-KDD

Download references

Author information

Authors and Affiliations

University of Applied Sciences Fulda, Leipziger Straße 123, 36037, Fulda, Germany
Frank Beer & Ulrich Bühler

Authors

Frank Beer
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Bühler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frank Beer .

Editor information

Editors and Affiliations

INESC-ID,Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
Joao Paulo Carvalho
LIP 6, Université Pierre et Marie Curie, Paris, France
Marie-Jeanne Lesot
School of Industrial Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
Uzay Kaymak
IDMEC,Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
Susana Vieira
LIP6, Université Pierre et Marie Curie, CNRS, Paris, France
Bernadette Bouchon-Meunier
Iona College, Machine Intelligence Institute, New Rochelle, New York, USA
Ronald R. Yager

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Beer, F., Bühler, U. (2016). In-Database Feature Selection Using Rough Set Theory. In: Carvalho, J., Lesot, MJ., Kaymak, U., Vieira, S., Bouchon-Meunier, B., Yager, R. (eds) Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2016. Communications in Computer and Information Science, vol 611. Springer, Cham. https://doi.org/10.1007/978-3-319-40581-0_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-40581-0_32
Published: 11 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40580-3
Online ISBN: 978-3-319-40581-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics