Abstract
In this paper we discuss the performance of classical classification methods on Big Data. We distinguish the cases many features and many observations. For the many features case we look at projection methods, distance-based methods, and feature selection. For the many observations case we mainly consider subsampling. The examples in this paper show that standard classification methods should not be blindly applied to Big Data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Thanks to T. Glasmachers for suggesting this definition.
- 2.
This part of the paper was supported by the Mercator Research Center Ruhr, grant Pr-2013-0015, see http://www.largescalesvm.de/.
- 3.
This simulation was carried out using the R-packages BatchJobs (Bischl et al. 2015) and mlr on the SLURM cluster of the Statistics Department of TU Dortmund University.
- 4.
This example is inspired by Fan et al. (2011).
- 5.
We used the R-library libSVM, see http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
- 6.
Data sets taken from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
References
Bair, E., Hastie, T., Paul, D., & Tibshirani, R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association, 101, 119–137.
Bickel, P. J., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli, 10, 989–1010.
Bischl, B., Lang, M., Mersmann, O., Rahnenfuehrer, J., & Weihs, C. (2015). BatchJobs and BatchExperiments: Abstraction mechanisms for using R in batch environments. Journal of Statistical Software, 64(11), doi:10.18637/jss.v064.i11.
Boulesteix, A. L. (2004). PLS dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology, 3, 1–33.
Fan, J., Fan, Y., & Wu, Y. (2011). High-dimensional classification. In T. T. Cai, & X. Shen (Eds.), High-dimensional data analysis (pp. 3–37). New Jersey: World Scientific.
Graf, H.P., Cosatto, E., Bottou, L., Durdanovic, I., & Vapnik, V. (2005). Parallel support vector machines: The cascade SVM. Advances in Neural Information Processing Systems, 17, 521–528.
Kiiveri, H.T. (2008). A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations. BMC Bioinformatics, 9, 195. doi:10.1186/1471-2105-9-195
Meyer, O., Bischl, B., & Weihs, C. (2013). Support vector machines on large data sets: Simple parallel approaches. In M. Spiliopoulou, L. Schmidt-Thieme, & R. Jannings (Eds.), Data analysis, machine learning, and knowledge discovery (pp. 87–95). Berlin: Springer.
R Core Team (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Weihs, C., Horn, D., Bischl, B. (2016). Big Data Classification : Aspects on Many Features and Many Observations. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-25226-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25224-7
Online ISBN: 978-3-319-25226-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)