Skip to main content

Big Data Classification : Aspects on Many Features and Many Observations

  • Conference paper
  • First Online:
Analysis of Large and Complex Data

Abstract

In this paper we discuss the performance of classical classification methods on Big Data. We distinguish the cases many features and many observations. For the many features case we look at projection methods, distance-based methods, and feature selection. For the many observations case we mainly consider subsampling. The examples in this paper show that standard classification methods should not be blindly applied to Big Data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Thanks to T. Glasmachers for suggesting this definition.

  2. 2.

    This part of the paper was supported by the Mercator Research Center Ruhr, grant Pr-2013-0015, see http://www.largescalesvm.de/.

  3. 3.

    This simulation was carried out using the R-packages BatchJobs (Bischl et al. 2015) and mlr on the SLURM cluster of the Statistics Department of TU Dortmund University.

  4. 4.

    This example is inspired by Fan et al. (2011).

  5. 5.

    We used the R-library libSVM, see http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  6. 6.

    Data sets taken from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

References

  • Bair, E., Hastie, T., Paul, D., & Tibshirani, R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association, 101, 119–137.

    Article  MathSciNet  MATH  Google Scholar 

  • Bickel, P. J., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli, 10, 989–1010.

    Article  MathSciNet  MATH  Google Scholar 

  • Bischl, B., Lang, M., Mersmann, O., Rahnenfuehrer, J., & Weihs, C. (2015). BatchJobs and BatchExperiments: Abstraction mechanisms for using R in batch environments. Journal of Statistical Software, 64(11), doi:10.18637/jss.v064.i11.

  • Boulesteix, A. L. (2004). PLS dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology, 3, 1–33.

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, J., Fan, Y., & Wu, Y. (2011). High-dimensional classification. In T. T. Cai, & X. Shen (Eds.), High-dimensional data analysis (pp. 3–37). New Jersey: World Scientific.

    Google Scholar 

  • Graf, H.P., Cosatto, E., Bottou, L., Durdanovic, I., & Vapnik, V. (2005). Parallel support vector machines: The cascade SVM. Advances in Neural Information Processing Systems, 17, 521–528.

    Google Scholar 

  • Kiiveri, H.T. (2008). A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations. BMC Bioinformatics, 9, 195. doi:10.1186/1471-2105-9-195

    Article  Google Scholar 

  • Meyer, O., Bischl, B., & Weihs, C. (2013). Support vector machines on large data sets: Simple parallel approaches. In M. Spiliopoulou, L. Schmidt-Thieme, & R. Jannings (Eds.), Data analysis, machine learning, and knowledge discovery (pp. 87–95). Berlin: Springer.

    Google Scholar 

  • R Core Team (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Claus Weihs .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Weihs, C., Horn, D., Bischl, B. (2016). Big Data Classification : Aspects on Many Features and Many Observations. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1_10

Download citation

Publish with us

Policies and ethics