Abstract
Despite their traditional roles, database systems increasingly became attractive as scalable analytical platforms using extensible SQL over the last decade. This methodology is termed in-database processing and provides several advantages over traditional mining attempts. In this work we bring Variable Precision Rough Sets to the domain of databases as a common framework to unlock hidden knowledge from data. Our derived model is built upon pure relational operations and thus very efficient. We further demonstrate its applicability for feature selection by introducing two in-database algorithms. Our experiments indicate, the model scales and is comparable to existing approaches in terms of performance but superior when applied to real-life applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Let \(\langle U,A,D\rangle \) be inconsistent. Thus, we have \(K,K'\in U/A: K\subseteq _\beta X\) and \(K'\cap X \ne \emptyset \) with \(K' \nsubseteq _\beta X, X\in U/D\). For indispensable attributes \(a\in A\), we consider \(U/(A\setminus \{a\})\) and may get \(K^*\in U/(A\setminus \{a\})\) with \(K^*=K\cup K'\) and \(\vert POS_{A\setminus \{a\}}^{\beta }(D) \vert \ne \vert POS_A^{\beta }(D) \vert \). Hence a is in the \(\beta \)-core. This case is not covered when \(\langle U,A,D\rangle \) is cleaned up-front.
- 2.
OS: Microsoft Windows 2012 R2 (Standard edition x64); DBs: Microsoft SQL Server 2014 (Developer edition 12.0.2, x64), Oracle 12c (Enterprise edition 12.1.0.2, x64); Misc: JDK 1.8.0.51, latest JDBC, R 3.2.0 (x64), RSESLib 3.0.4, RSR 1.3.0; Memory: 48 GByte; CPU: 32x2.6 GHz Intel Xeon E312xx (Sandy Bridge); HDD: 500 GByte.
- 3.
KDD99m is a modification of the original KDD99 dataset available in [22]. In contrast, it holds one additional attribute resulting in evenly sized equivalence classes.
References
Nguyen, H.S.: Approximate boolean reasoning: foundations and applications in data mining. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 334–506. Springer, Heidelberg (2006)
Shreya, P., Fard, A., Gupta, V., Martinez, J., LeFevre, J., Xu, V., Hsu, M., Roy, I.: Large-scale predictive analytics in vertica: fast data transfer, distributed model creation, and in-database prediction. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1657–1668 (2015)
Fernandez-Baizán, M.C., Menasalvas Ruiz, E., Peña Sánchez, J.M.: Integrating rdms and data mining capabilities using rough sets. In: Proceedings of the 6th International Conference on IPMU, pp. 1439–1445 (1996)
Ohrn, A., Komorowski, J.: ROSETTA - a rough set toolkit for analysis of data. In: Proceedings of the 3rd International Joint Conference on Information Sciences, pp. 403–407 (1997)
Bazan, J., Szczuka, M.S.: The rough set exploration system. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005)
Tileston, T.: Have your cake & eat it too! accelerate data mining combining SAS & teradata. In: Teradata Partners 2005 “Experience the Possibilities” (2005)
Hellerstein, J.M., Re, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endowment 5(12), 1700–1711 (2012)
Beer, F., Bühler, U.: An in-database rough set toolkit. In: Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR and FGDB, pp. 146–157 (2015)
Pawlak, Z.: Rough Sets. Int. J. Comput. Inf. Sci. 11(5), 341–356 (1982)
Pawlak, Z.: Rough Sets - Theoretical Aspects of Reasoning about Data. Springer, Netherlands (1991)
Ziarko, W.: Variable precision rough set model. J. Comput. Syst. Sci. 46(1), 39–59 (1993)
Kumar, A.: New techniques for data reduction in a database system for knowledge discovery applications. JIIS 10(1), 31–48 (1998)
Hu, X.T., Lin, T.Y., Han, J.: A new rough sets model based on database systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. LNCS, vol. 2639, pp. 114–121. Springer, Heidelberg (2003)
Vaithyanathan, K., Lin, T.Y.: High frequency rough set model based on database systems. In: Annual Meeting of the NAFIPS, pp. 1–6 (2008)
Sun, H.Q., Xiong, Z., Wang, Y.: Research on integrating ordbms and rough set theory. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 169–175. Springer, Heidelberg (2004)
Chan, C.-C.: Learning rules from very large databases using rough multisets. In: Peters, J.F., Skowron, A., Grzymała-Busse, J.W., Kostek, B., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 59–77. Springer, Heidelberg (2004)
Naouali, S., Missaoui, R.: Flexible query answering in data cubes. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 221–232. Springer, Heidelberg (2005)
Slezak, D., Wroblewski, J., Eastwood, V., Synak, P.: Brighthouse: An Analytic Data Warehouse for Ad-hoc Queries. Proc. of the VLDB Endowment 1, 1337–1345 (2008)
Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. Intelligent Decision Support. Theory and Decision Library, pp. 331–362. Springer, Netherlands (1992)
Shen, Q., Chouchoulas, A.: A modular approach to generating fuzzy rules with reduced attributes for the monitoring of complex systems. Eng. Appl. Artif. Intell. 13(3), 263–278 (2000)
Jensen, R., Shen, Q.: Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches. Wiley, Hoboken (2008)
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, January 2016. http://archive.ics.uci.edu/ml
Riza, L.S., Januszb, A., Bergmeira, C., Cornelisa, C., Herreraa, F., ŚleZak, D., Benítez, J.: Implementing algorithms of rough set theory and fuzzy rough set theory in the R package roughsets. Inf. Sci. 287, 68–89 (2014)
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5 (2014)
Reiss, A., Stricker, D.: Introducing a new benchmarked dataset for activity monitoring. In: Proceedings of the 16th ISWC, pp. 108–109 (2012)
Cattral, R., Oppacher, F., Deugo, D.: Evolutionary data mining with automatic rule generalization. In: Recent Advances in Computers, Computing and Communications, pp. 296–300 (2002)
Blackard, J.A., Dean, D.J.: Comparative accuracies of neural networks and discriminant analysis in predicting forest cover types from cartographic variables. In: Proceedings of the 2nd Sourthern Forestry GIS Conference, pp. 189–199 (1998)
NSL-KDD: Data Set for Network-based Intrusion Detection Systems, January 2016. http://nsl.cs.unb.ca/NSL-KDD
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Beer, F., Bühler, U. (2016). In-Database Feature Selection Using Rough Set Theory. In: Carvalho, J., Lesot, MJ., Kaymak, U., Vieira, S., Bouchon-Meunier, B., Yager, R. (eds) Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2016. Communications in Computer and Information Science, vol 611. Springer, Cham. https://doi.org/10.1007/978-3-319-40581-0_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-40581-0_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40580-3
Online ISBN: 978-3-319-40581-0
eBook Packages: Computer ScienceComputer Science (R0)