Advertisement

Factor Preselection and Multiple Measures of Dependence

  • Nina Büchel
  • Kay. F. Hildebrand
  • Ulrich Müller-Funk
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

Factor selection or factor reduction is carried out to reduce the complexity of a data analysis problems (classification, regression) or to improve the fit of a model (via parameter estimation). In data mining there are special needs for a process by which relevant factors of influence are identified in order to achieve a balance between bias and noise. Insurance companies, for example, face data sets that contain hundreds of attributes or factors per object. With a large number of factors, the selection procedure requires a suitable process model. A process like that becomes compelling once data analysis is to be (semi) automated.We suggest an approach that proceeds in two phases: In the first one, we cluster attributes that are highly correlated in order to identify factor combinations that—statistically speaking—are near duplicates. In the second phase, we choose factors from each cluster that are highly associated with a target variable. The implementation requires some form of non-linear canonical correlation analysis. We define a correlation measure for two blocks of factors that will be employed as a measure of similarity within the clustering process. Such measures, in turn, are based on multiple indices of dependence. Few indices have been introduced cf. Wolff (Stochastica 4(3):175–188, 1980), ‘Few indices have been introduced in the literature’. All of them, however, are hard to interpret if the number of dimensions considerably exceeds two. For that reason we come up with signed measures that can be interpreted in the usual way.

References

  1. Becker, J. & Schütte, R. (1996). Handelsinformationssysteme. Verl. Moderne Industrie, Landsberg/Lech.Google Scholar
  2. Hall, M. A. (1999). Correlation-based feature subset selection for machine learning. PhD thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand.Google Scholar
  3. Kiesl, H. (2003). Ordinale Streuungsmasse: Theoretische Fundierung und statistische Anwendung. PhD thesis, Universität Bamberg.Google Scholar
  4. Renyi, A. (1958). On measures of dependence. Acta mathematica hungarica, 9, 441–451.Google Scholar
  5. Rüschendorf, L. (1976). Asymptotic distributions of multivariate rank order statistics. The Annals of Statistics, 4, 912–923.MathSciNetMATHCrossRefGoogle Scholar
  6. Rüschendorf, L. (2009). On the distributional transform, sklar’s theorem, and the empirical copula process. Journal of Statistical Planning and Inference, 139, 3921–3927.MathSciNetMATHCrossRefGoogle Scholar
  7. Schmid, F., Blumentritt, T., Gaißer, S., Ruppert, M., & Schmidt, R. (2010). Copula-based measures of multivariate association. In F. Durante, W. Härdle, P. Jaworski, & T. Rychlik (Eds.), Workshop on copula theory and its applications, Warsaw. Berlin Heidelberg: Springer-Verlag.Google Scholar
  8. Witting, H., & Müller-Funk, U. (1995). Mathematische Statistik II. Stuttgart: Teubner Verlag.MATHCrossRefGoogle Scholar
  9. Wolff, E. F. (1980). N-dimensional measures of dependence. Stochastica, 4(3), 175–188.MathSciNetMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  • Nina Büchel
    • 1
  • Kay. F. Hildebrand
    • 1
  • Ulrich Müller-Funk
    • 1
  1. 1.European Research Center for Information Systems (ERCIS)University of MünsterMünsterGermany

Personalised recommendations