Big Data Classification : Aspects on Many Features and Many Observations

Weihs, Claus; Horn, Daniel; Bischl, Bernd

doi:10.1007/978-3-319-25226-1_10

Claus Weihs²⁰,
Daniel Horn²¹ &
Bernd Bischl²¹

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2297 Accesses
1 Citations

Abstract

In this paper we discuss the performance of classical classification methods on Big Data. We distinguish the cases many features and many observations. For the many features case we look at projection methods, distance-based methods, and feature selection. For the many observations case we mainly consider subsampling. The examples in this paper show that standard classification methods should not be blindly applied to Big Data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Thanks to T. Glasmachers for suggesting this definition.
2.
This part of the paper was supported by the Mercator Research Center Ruhr, grant Pr-2013-0015, see http://www.largescalesvm.de/.
3.
This simulation was carried out using the R-packages BatchJobs (Bischl et al. 2015) and mlr on the SLURM cluster of the Statistics Department of TU Dortmund University.
4.
This example is inspired by Fan et al. (2011).
5.
We used the R-library libSVM, see http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
6.
Data sets taken from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

References

Bair, E., Hastie, T., Paul, D., & Tibshirani, R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association, 101, 119–137.
Article MathSciNet MATH Google Scholar
Bickel, P. J., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli, 10, 989–1010.
Article MathSciNet MATH Google Scholar
Bischl, B., Lang, M., Mersmann, O., Rahnenfuehrer, J., & Weihs, C. (2015). BatchJobs and BatchExperiments: Abstraction mechanisms for using R in batch environments. Journal of Statistical Software, 64(11), doi:10.18637/jss.v064.i11.
Boulesteix, A. L. (2004). PLS dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology, 3, 1–33.
Article MathSciNet MATH Google Scholar
Fan, J., Fan, Y., & Wu, Y. (2011). High-dimensional classification. In T. T. Cai, & X. Shen (Eds.), High-dimensional data analysis (pp. 3–37). New Jersey: World Scientific.
Google Scholar
Graf, H.P., Cosatto, E., Bottou, L., Durdanovic, I., & Vapnik, V. (2005). Parallel support vector machines: The cascade SVM. Advances in Neural Information Processing Systems, 17, 521–528.
Google Scholar
Kiiveri, H.T. (2008). A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations. BMC Bioinformatics, 9, 195. doi:10.1186/1471-2105-9-195
Article Google Scholar
Meyer, O., Bischl, B., & Weihs, C. (2013). Support vector machines on large data sets: Simple parallel approaches. In M. Spiliopoulou, L. Schmidt-Thieme, & R. Jannings (Eds.), Data analysis, machine learning, and knowledge discovery (pp. 87–95). Berlin: Springer.
Google Scholar
R Core Team (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/

Download references

Author information

Authors and Affiliations

Chair of Computational Statistics, Faculty of Statistics, TU Dortmund, Dortmund, Germany
Claus Weihs
Department of Statistics, TU Dortmund University, Dortmund, Germany
Daniel Horn & Bernd Bischl

Authors

Claus Weihs
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Horn
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Bischl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claus Weihs .

Editor information

Editors and Affiliations

Jacobs University Bremen , Bremen, Germany
Adalbert F.X. Wilhelm
Universität Ulm, Institute of Medical Systems Biology Universität Ulm, Ulm, Baden-Württemberg, Germany
Hans A. Kestler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Weihs, C., Horn, D., Bischl, B. (2016). Big Data Classification : Aspects on Many Features and Many Observations. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-25226-1_10
Published: 04 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25224-7
Online ISBN: 978-3-319-25226-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics