Skip to main content

Towards Robust Performance Guarantees for Models Learned from High-Dimensional Data

  • Chapter

Part of the Studies in Big Data book series (SBD,volume 9)

Abstract

Models learned from high-dimensional spaces, where the high number of features can exceed the number of observations, are susceptible to overfit since the selection of subspaces of interest for the learning task is prone to occur by chance. In these spaces, the performance of models is commonly highly variable and dependent on the target error estimators, data regularities and model properties. High-variable performance is a common problem in the analysis of omics data, healthcare data, collaborative filtering data, and datasets composed by features extracted from unstructured data or mapped from multi-dimensional databases. In these contexts, assessing the statistical significance of the performance guarantees of models learned from these high-dimensional spaces is critical to validate and weight the increasingly available scientific statements derived from the behavior of these models. Therefore, this chapter surveys the challenges and opportunities of evaluating models learned from big data settings from the less-studied angle of big dimensionality. In particular, we propose a methodology to bound and compare the performance of multiple models. First, a set of prominent challenges is synthesized. Second, a set of principles is proposed to answer the identified challenges. These principles provide a roadmap with decisions to: i) select adequate statistical tests, loss functions and sampling schema, ii) infer performance guarantees from multiple settings, including varying data regularities and learning parameterizations, and iii) guarantee its applicability for different types of models, including classification and descriptive models. To our knowledge, this work is the first attempt to provide a robust and flexible assessment of distinct types of models sensitive to both the dimensionality and size of data. Empirical evidence supports the relevance of these principles as they offer a coherent setting to bound and compare the performance of models learned in high-dimensional spaces, and to study and refine the behavior of these models.

Keywords

  • high-dimensional data
  • performance guarantees
  • statistical significance of learning models
  • error estimators
  • classification
  • biclustering

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-11056-1_3
  • Chapter length: 34 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   119.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-11056-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   159.99
Price excludes VAT (USA)
Hardcover Book
USD   179.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adcock, C.J.: Sample size determination: a review. J. of the Royal Statistical Society: Series D (The Statistician) 46(2), 261–283 (1997)

    Google Scholar 

  2. Amaratunga, D., Cabrera, J., Shkedy, Z.: Exploration and Analysis of DNA Microarray and Other High-Dimensional Data. Wiley Series in Probability and Statistics. Wiley (2014)

    Google Scholar 

  3. Apolloni, B., Gentile, C.: Sample size lower bounds in PAC learning by algorithmic complexity theory. Theoretical Computer Science 209(1-2), 141–162 (1998)

    CrossRef  MathSciNet  MATH  Google Scholar 

  4. Assent, I., et al.: DUSC: Dimensionality Unbiased Subspace Clustering. In: ICDM, pp. 409–414 (2007)

    Google Scholar 

  5. Beleites, C., et al.: Sample size planning for classification models. Analytica Chimica Acta 760, 25–33 (2013)

    CrossRef  Google Scholar 

  6. Blumer, A., et al.: Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36(4), 929–965 (1989)

    CrossRef  MathSciNet  MATH  Google Scholar 

  7. Boonyanunta, N., Zeephongsekul, P.: Predicting the Relationship Between the Size of Training Sample and the Predictive Power of Classifiers. In: Negoita, M.G., Howlett, R.J., Jain, L.C. (eds.) KES 2004. LNCS (LNAI), vol. 3215, pp. 529–535. Springer, Heidelberg (2004)

    CrossRef  Google Scholar 

  8. Bozdağ, D., Kumar, A.S., Catalyurek, U.V.: Comparative analysis of biclustering algorithms. In: BCB, Niagara Falls, pp. 265–274. ACM, New York (2010)

    CrossRef  Google Scholar 

  9. Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer (2011)

    Google Scholar 

  10. Cai, T., Shen, X.: High-Dimensional Data Analysis (Frontiers of Statistics). World Scientific (2010)

    Google Scholar 

  11. Cheng, Y., Church, G.M.: Biclustering of Expression Data. In: Intelligent Systems for Molecular Biology, pp. 93–103. AAAI Press (2000)

    Google Scholar 

  12. Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. J. Machine Learning Res. 7, 1–30 (2006)

    MATH  Google Scholar 

  13. Deng, G.: Simulation-based optimization. University of Wisconsin–Madison (2007)

    Google Scholar 

  14. Dobbin, K., Simon, R.: Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6(1), 27+ (2005)

    Google Scholar 

  15. Dobbin, K.K., Simon, R.M.: Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics 8(1), 101–117 (2007)

    CrossRef  MATH  Google Scholar 

  16. Domingos, P.: A Unified Bias-Variance Decomposition and its Applications. In: IC on Machine Learning, pp. 231–238. Morgan Kaufmann (2000)

    Google Scholar 

  17. Dougherty, E.R., et al.: Performance of Error Estimators for Classification. Current Bioinformatics 5(1), 53–67 (2010)

    CrossRef  Google Scholar 

  18. El-Sheikh, T.S., Wacker, A.G.: Effect of dimensionality and estimation on the performance of gaussian classifiers. Pattern Recognition 12(3), 115–126 (1980)

    CrossRef  MATH  Google Scholar 

  19. Figueroa, R.L., et al.: Predicting sample size required for classification performance. BMC Med. Inf. & Decision Making 12, 8 (2012)

    CrossRef  Google Scholar 

  20. Fleiss, J.L.: Statistical Methods for Rates and Proportions. Wiley P. In: Applied Statistics. Wiley (1981)

    Google Scholar 

  21. García, S., Herrera, F.: An Extension on ”Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons. Journal of Machine Learning Research 9, 2677–2694 (2009)

    Google Scholar 

  22. Glick, N.: Additive estimators for probabilities of correct classification. Pattern Recognition 10(3), 211–222 (1978)

    CrossRef  MATH  Google Scholar 

  23. Guo, Y., et al.: Sample size and statistical power considerations in highdimensionality data settings: a comparative study of classification algorithms. BMC Bioinformatics 11(1), 1–19 (2010)

    CrossRef  Google Scholar 

  24. Guyon, I., et al.: What Size Test Set Gives Good Error Rate Estimates? IEEE Trans. Pattern Anal. Mach. Intell. 20(1), 52–64 (1998)

    CrossRef  Google Scholar 

  25. Hand, D.J.: Recent advances in error rate estimation. Pattern Recogn. Lett. 4(5), 335–346 (1986)

    CrossRef  Google Scholar 

  26. Haussler, D., Kearns, M., Schapire, R.: Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. In: IW on Computational Learning Theory, pp. 61–74. Morgan Kaufmann Publishers Inc., Santa Cruz (1991)

    Google Scholar 

  27. Hochreiter, S., et al.: FABIA: factor analysis for bicluster acquisition. Bioinformatics 26(12), 1520–1527 (2010)

    CrossRef  Google Scholar 

  28. Hocking, R.: Methods and Applications of Linear Models: Regression and the Analysis of Variance. Wiley Series in Probability and Statistics, p. 81. Wiley (2005)

    Google Scholar 

  29. Hua, J., et al.: Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21(8), 1509–1515 (2005)

    CrossRef  Google Scholar 

  30. Iswandy, K., Koenig, A.: Towards Effective Unbiased Automated Feature Selection. In: Hybrid Intelligent Systems, pp. 29–29 (2006)

    Google Scholar 

  31. Jain, A., Chandrasekaran, B.: Dimensionality and Sample Size Considerations. In: Krishnaiah, P., Kanal, L. (eds.) Pattern Recognition in Practice, pp. 835–855 (1982)

    Google Scholar 

  32. Jain, N., et al.: Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics 19(15), 1945–1951 (2003)

    CrossRef  Google Scholar 

  33. Kanal, L., Chandrasekaran, B.: On dimensionality and sample size in statistical pattern classification. Pattern Recognition 3(3), 225–234 (1971)

    CrossRef  Google Scholar 

  34. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97(1-2), 273–324 (1997)

    CrossRef  MATH  Google Scholar 

  35. Kohavi, R., Wolpert, D.H.: Bias Plus Variance Decomposition for Zero-One Loss Functions. In: Machine Learning, pp. 275–283. Morgan Kaufmann Publishers (1996)

    Google Scholar 

  36. Lissack, T., Fu, K.-S.: Error estimation in pattern recognition via Ldistance between posterior density functions. IEEE Transactions on Information Theory 22(1), 34–45 (1976)

    CrossRef  MathSciNet  MATH  Google Scholar 

  37. Madeira, S.C., Oliveira, A.L.: Biclustering Algorithms for Biological Data Analysis: A Survey. IEEE/ACM Trans. Comput. Biol. Bioinformatics 1(1), 24–45 (2004)

    CrossRef  Google Scholar 

  38. Martin, J.K., Hirschberg, D.S.: Small Sample Statistics for Classification Error Rates II: Confidence Intervals and Significance Tests. Tech. rep. DICS (1996)

    Google Scholar 

  39. Molinaro, A.M., Simon, R., Pfeiffer, R.M.: Prediction error estimation: a comparison of resampling methods. Bioinformatics 21(15), 3301–3307 (2005)

    CrossRef  Google Scholar 

  40. Mukherjee, S., et al.: Estimating dataset size requirements for classifying DNA Microarray data. Journal of Computational Biology 10, 119–142 (2003)

    CrossRef  Google Scholar 

  41. Munkres, J.: Algorithms for the Assignment and Transportation Problems. Society for Ind. and Applied Math. 5(1), 32–38 (1957)

    CrossRef  MathSciNet  MATH  Google Scholar 

  42. van Ness, J.W., Simpson, C.: On the Effects of Dimension in Discriminant Analysis. Technometrics 18(2), 175–187 (1976)

    CrossRef  MATH  Google Scholar 

  43. Niyogi, P., Girosi, F.: On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Comput. 8(4), 819–842 (1996)

    CrossRef  Google Scholar 

  44. Okada, Y., Fujibuchi, W., Horton, P.: A biclustering method for gene expression module discovery using closed itemset enumeration algorithm. IPSJ Transactions on Bioinformatics 48(SIG5), 39–48 (2007)

    Google Scholar 

  45. Opper, M., et al.: On the ability of the optimal perceptron to generalise. Journal of Physics A: Mathematical and General 23(11), L581 (1990)

    Google Scholar 

  46. Patrikainen, A., Meila, M.: Comparing Subspace Clusterings. IEEE TKDE 18(7), 902–916 (2006)

    Google Scholar 

  47. Prelić, A., et al.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinf. 22(9), 1122–1129 (2006)

    CrossRef  Google Scholar 

  48. Qin, G., Hotilovac, L.: Comparison of non-parametric confidence intervals for the area under the ROC curve of a continuous-scale diagnostic test. Stat. Methods Med. Res. 17(2), 207–221 (2008)

    MathSciNet  MATH  Google Scholar 

  49. Raeder, T., Hoens, T.R., Chawla, N.V.: Consequences of Variability in Classifier Performance Estimates. In: ICDM, pp. 421–430 (2010)

    Google Scholar 

  50. Raudys, S.J., Jain, A.K.: Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners. IEEE Trans. Pattern Anal. Mach. Intell. 13(3), 252–264 (1991)

    CrossRef  Google Scholar 

  51. Sequeira, K., Zaki, M.: SCHISM: a new approach to interesting subspace mining. Int. J. Bus. Intell. Data Min. 1(2), 137–160 (2005)

    CrossRef  Google Scholar 

  52. Serin, A., Vingron, M.: DeBi: Discovering Differentially Expressed Biclusters using a Frequent Itemset Approach. Algorithms for Molecular Biology 6(1), 1–12 (2011) (English)

    Google Scholar 

  53. Singhi, S.K., Liu, H.: Feature subset selection bias for classification learning. In: IC on Machine Learning, pp. 849–856. ACM, Pittsburgh (2006)

    Google Scholar 

  54. Surendiran, B., Vadivel, A.: Feature Selection using Stepwise ANOVA Discriminant Analysis for Mammogram Mass Classification. IJ on Signal Image Proc. 2(1), 4 (2011)

    Google Scholar 

  55. Toussaint, G.: Bibliography on estimation of misclassification. IEEE Transactions on Information Theory 20(4), 472–479 (1974)

    CrossRef  MathSciNet  MATH  Google Scholar 

  56. Vapnik, V.: Estimation of Dependences Based on Empirical Data. Springer Series in Statistics. Springer-Verlag New York, Inc., Secaucus (1982)

    Google Scholar 

  57. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (1998)

    Google Scholar 

  58. Vayatis, N., Azencott, R.: Distribution-Dependent Vapnik-Chervonenkis Bounds. In: Fischer, P., Simon, H.U. (eds.) EuroCOLT 1999. LNCS (LNAI), vol. 1572, pp. 230–240. Springer, Heidelberg (1999)

    CrossRef  Google Scholar 

  59. Way, T., et al.: Effect of finite sample size on feature selection and classification: A simulation study. Medical Physics 37(2), 907–920 (2010)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Henriques .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Henriques, R., Madeira, S.C. (2015). Towards Robust Performance Guarantees for Models Learned from High-Dimensional Data. In: Hassanien, A., Azar, A., Snasael, V., Kacprzyk, J., Abawajy, J. (eds) Big Data in Complex Systems. Studies in Big Data, vol 9. Springer, Cham. https://doi.org/10.1007/978-3-319-11056-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11056-1_3

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11055-4

  • Online ISBN: 978-3-319-11056-1

  • eBook Packages: EngineeringEngineering (R0)