A clustering-based feature selection method for automatically generated relational attributes
- 96 Downloads
Although data mining problems require a flat mining table as input, in many real-world applications analysts are interested in finding patterns in a relational database. To this end, new methods and software have been recently developed that automatically add attributes (or features) to a target table of a relational database which summarize information from all other tables. When attributes are automatically constructed by these methods, selecting the important attributes is particularly difficult, because a large number of the attributes are highly correlated. In this setting, attribute selection techniques such as the Least Absolute Shrinkage and Selection Operator (lasso), elastic net, and other machine learning methods tend to under-perform. In this paper, we introduce a novel attribute selection procedure, where after an initial screening step, we cluster the attributes into different groups and apply the group lasso to select both the true attributes groups and then the true attributes. The procedure is particularly suited to high dimensional data sets where the attributes are highly correlated. We test our procedure on several simulated data sets and a real-world data set from a marketing database. The results show that our proposed procedure obtains a higher predictive performance while selecting a much smaller set of attributes when compared to other state-of-the-art methods.
KeywordsRelational attribute generation Feature selection Lasso Elastic net Clustering
- Batini, C., Ceri, S., & Navathe, S. (1989). Entity relationship approach. North Holland: Elsevier Science Publishers BV.Google Scholar
- Fan, J., & LV, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101.Google Scholar
- Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.Google Scholar
- Han, J., Kamber, M., & Pei, J. (2011). Data mining: Concepts and techniques: concepts and techniques. Amsterdam: Elsevier.Google Scholar
- Hess, J. D., & Mayhew, G. E. (1997). Modeling merchandise returns in direct marketing. Journal of Interactive Marketing, 11(2), 20–35.Google Scholar
- Kendall, M. (1957). A course in multivariate analysis. London: Griffin.Google Scholar
- Popescul, A., & Ungar, L. H. (2003). Statistical relational learning for link prediction. In IJCAI workshop on learning statistical models from relational data (Vol. 2003).Google Scholar
- Samorani, M. (2015). Automatically generate a flat mining table with dataconda. In 2015 IEEE international conference on data mining workshop (ICDMW), IEEE (pp. 1644–1647).Google Scholar
- Samorani, M., Ahmed, F., & Zaiane, O. R. (2016). Automatic generation of relational attributes: An application to product returns. In 2016 IEEE international conference on big data (Big Data) (pp. 1454–1463). https://doi.org/10.1109/BigData.2016.7840753.
- She, Y. (2008). Sparse regression with exact clustering. Ann Arbor: ProQuest.Google Scholar
- Simon, H. A. (1979). Rational decision making in business organizations. The American Economic Review, 69, 493–513.Google Scholar
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.Google Scholar