Don’t Rule Out Simple Models Prematurely: A Large Scale Benchmark Comparing Linear and Non-linear Classifiers in OpenML

Strang, Benjamin; Putten, Peter van der; Rijn, Jan N. van; Hutter, Frank

doi:10.1007/978-3-030-01768-2_25

Benjamin Strang¹⁶,
Peter van der Putten¹⁷,
Jan N. van Rijn^16,18 &
…
Frank Hutter¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11191))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

1348 Accesses
9 Citations

Abstract

A basic step for each data-mining or machine learning task is to determine which model to choose based on the problem and the data at hand. In this paper we investigate when non-linear classifiers outperform linear classifiers by means of a large scale experiment. We benchmark linear and non-linear versions of three types of classifiers (support vector machines; neural networks; and decision trees), and analyze the results to determine on what type of datasets the non-linear version performs better. To the best of our knowledge, this work is the first principled and large scale attempt to support the common assumption that non-linear classifiers excel only when large amounts of data are available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this study, we do not compare (still quite interpretable) decision trees against (more powerful, yet less interpretable) random forests in order to limit ourselves purely to a comparison of linear vs. non-linear models.
2.
https://www.openml.org/s/123.

References

Altman, E.I., Marco, G., Varetto, F.: Corporate distress diagnosis: comparisons using linear discriminant analysis and neural networks (the Italian experience). J. Bank. Financ. 18(3), 505–529 (1994)
Article Google Scholar
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(Feb), 281–305 (2012)
MathSciNet MATH Google Scholar
Bischl, B., et al.: OpenML Benchmarking Suites and the OpenML100. arXiv preprint arXiv:1708.03731 (2017)
Chu, C.W., Zhang, G.P.: A comparative study of linear and nonlinear models for aggregate retail sales forecasting. Int. J. Prod. Econ. 86(3), 217–231 (2003)
Article Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)
MathSciNet MATH Google Scholar
Flach, P., Kull, M.: Precision-recall-gain curves: PR analysis done right. In: Advances in Neural Information Processing Systems, pp. 838–846 (2015)
Google Scholar
Garrett, D., Peterson, D.A., Anderson, C.W., Thaut, M.H.: Comparison of linear, nonlinear, and feature selection methods for EEG signal classification. IEEE Trans. Neural Syst. Rehabil. Eng. 11(2), 141–144 (2003)
Article Google Scholar
Gaudart, J., Giusiano, B., Huiart, L.: Comparison of the performance of multi-layer perceptron and linear regression for epidemiological data. Comput. Stat. Data Anal. 44(4), 547–570 (2004)
Article MathSciNet Google Scholar
Goodman, B., Flaxman, S.: European Union regulations on algorithmic decision-making and a “right to explanation”. arXiv preprints arXiv:1606.08813 (June 2016)
Kaytez, F., Taplamacioglu, M.C., Cam, E., Hardalac, F.: Forecasting electricity consumption: a comparison of regression analysis, neural networks and least squares Support Vector Machines. Int. J. Electr. Power Energy Syst. 67, 431–438 (2015)
Article Google Scholar
Olier, I., et al.: Meta-QSAR: a large-scale application of meta-learning to drug design and discovery. Mach. Learn. 1–27 (2018)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Google Scholar
Pino-Mejías, R., Pérez-Fargallo, A., Rubio-Bellido, C., Pulido-Arcas, J.A.: Comparison of linear regression and artificial neural networks models to predict heating and cooling energy demand, energy consumption and CO\(_2\) emissions. Energy 118, 24–36 (2017)
Article Google Scholar
Post, Martijn J., van der Putten, Peter, van Rijn, Jan N.: Does feature selection improve classification? A large scale experiment in OpenML. In: Boström, Henrik, Knobbe, Arno, Soares, Carlos, Papapetrou, Panagiotis (eds.) IDA 2016. LNCS, vol. 9897, pp. 158–170. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46349-0_14
Chapter Google Scholar
van der Putten, P., van Someren, M.: A bias-variance analysis of a real world learning problem: The CoIL Challenge 2000. Mach. Learn. 57(1), 177–195 (2004)
Article Google Scholar
Rahimi, A., Recht, B.: Reflections on random kitchen sinks (2017)
Google Scholar
Rice, J.R.: The algorithm selection problem. Adv. Comput. 15, 65–118 (1976)
Article Google Scholar
van Rijn, J.N.: Massively Collaborative Machine Learning. Ph.D. thesis, Leiden University (2016)
Google Scholar
van Rijn, J.N., Hutter, F.: Hyperparameter importance across datasets. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2367–2376. ACM (2018)
Google Scholar
Ross, B.C.: Mutual information between discrete and continuous data sets. PloS one 9(2), e87357 (2014)
Article Google Scholar
Schütze, H., Hull, D.A., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 229–237. ACM (1995)
Google Scholar
Sculley, D., Snoek, J., Wiltschko, A., Rahimi, A.: Winner’s curse? On pace, progress, and empirical rigor. In: Proceedings of ICLR 2018 (2018)
Google Scholar
Swanson, N.R., White, H.: A model selection approach to real-time macroeconomic forecasting using linear models and artificial neural networks. Rev. Econ. Stat. 79(4), 540–550 (1997)
Article Google Scholar
Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explo. Newsl. 15(2), 49–60 (2014)
Article Google Scholar

Download references

Acknowledgement

This work has partly been supported by the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme under grant no. 716721. The authors acknowledge support by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant no INST 39/963-1 FUGG.

Author information

Authors and Affiliations

University of Freiburg, Freiburg im Breisgau, Germany
Benjamin Strang, Jan N. van Rijn & Frank Hutter
Leiden University, Leiden, The Netherlands
Peter van der Putten
Columbia University, New York, USA
Jan N. van Rijn

Authors

Benjamin Strang
View author publications
You can also search for this author in PubMed Google Scholar
Peter van der Putten
View author publications
You can also search for this author in PubMed Google Scholar
Jan N. van Rijn
View author publications
You can also search for this author in PubMed Google Scholar
Frank Hutter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benjamin Strang .

Editor information

Editors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
Wouter Duivesteijn
Department of Information and Computing Sciences, University Utrecht, Utrecht, The Netherlands
Arno Siebes
University of Helsinki, Helsinki, Finland
Antti Ukkonen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Strang, B., Putten, P.v.d., Rijn, J.N.v., Hutter, F. (2018). Don’t Rule Out Simple Models Prematurely: A Large Scale Benchmark Comparing Linear and Non-linear Classifiers in OpenML. In: Duivesteijn, W., Siebes, A., Ukkonen, A. (eds) Advances in Intelligent Data Analysis XVII. IDA 2018. Lecture Notes in Computer Science(), vol 11191. Springer, Cham. https://doi.org/10.1007/978-3-030-01768-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-01768-2_25
Published: 05 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01767-5
Online ISBN: 978-3-030-01768-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics