The Open International Soccer Database for machine learning
- 435 Downloads
How well can machine learning predict the outcome of a soccer game, given the most commonly and freely available match data? To help answer this question and to facilitate machine learning research in soccer, we have developed the Open International Soccer Database. Version v1.0 of the Database contains essential information from 216,743 league soccer matches from 52 leagues in 35 countries. The earliest entries in the Database are from the year 2000, which is when football leagues generally adopted the “three points for a win” rule. To demonstrate the use of the Database for machine learning research, we organized the 2017 Soccer Prediction Challenge. One of the goals of the Challenge was to estimate where the limits of predictability lie, given the type of match data contained in the Database. Another goal of the Challenge was to pose a real-world machine learning problem with a fixed time line and a genuine prediction task: to develop a predictive model from the Database and then to predict the outcome of the 206 future soccer matches taking place from 31 March 2017 to the end of the regular season. The Open International Soccer Database is released as an open science project, providing a valuable resource for soccer analysts and a unique benchmark for advanced machine learning methods. Here, we describe the Database and the 2017 Soccer Prediction Challenge and its results.
KeywordsOpen International Soccer Database 2017 Soccer Prediction Challenge Open science Soccer analytics
After we released the Challenge data sets, we received valuable feedback from the participants regarding the evaluation of the predicted outcomes. In particular, we wish to thank team ACC and team FK for their constructive comments. We also thank the three anonymous reviewers for their valuable comments. JD is partially supported by the KU Leuven Research Fund (C14/17/070, C22/15/015, C32/17/036), FWO-Vlaanderen (SBO-150033) and Interreg V A project NANO4Sports.
- Berrar, D., Lopes, P., Davis, J., Dubitzky, W. (2017a). The 2017 Soccer Prediction Challenge. https://doi.org/10.17605/OSF.IO/FTUVA.
- Berrar, D., Lopes, P., Dubitzky, W. (2018). Incorporating domain knowledge in machine learning for soccer outcome prediction. Machine Learning (to appear).Google Scholar
- Büchner, A. G., Dubitzky, W., Schuster, A., Lopes, P., O’Donoghue, P. G., Hughes, J. G., Bell, D. A., Adamson, K., White, J. A., Anderson, J. M. C. C., & Mulvenna, M. D. (1997). Corporate evidential decision making in performance prediction domains. In Proceedings of the 13th conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers, San Francisco, CA, USA, UAI’97 (pp. 38–45).Google Scholar
- Constantinou, A. (2018). Dolores: A model that predicts football match outcomes from all over the world. Machine Learning. https://doi.org/10.1007/s10994-018-5703-7.
- Constantinou, A. C., & Fenton, N. E., (2012). Solving the problem of inadequate scoring rules for assessing probabilistic football forecast models. Journal of Quantitative Analysis in Sports, 8(1), 1. https://doi.org/10.1515/1559-0410.1418.
- Dixon, M., & Coles, S. (1997). Modelling association football scores and inefficiencies in the football betting market. Applied Statistics, 46(2), 265–280.Google Scholar
- Drummond, C. (2009). Replicability is not reproducibility: Nor is it good science. In Proceedings of Evaluation Methods for Machine Learning Workshop at the 26th International Conference on Machine Learning, Montreal, Canada (pp. 1–6).Google Scholar
- Dubitzky, W., Lopes, P., Davis, J., & Berrar, D. (2017). The Open International Soccer Database. https://doi.org/10.17605/OSF.IO/KQCYE.
- Elo, A. E. (1978). The rating of chessplayers, past and present. London: Batsford.Google Scholar
- Foster, E., & Deardorff, A. (2017). Open science framework (OSF). Journal of the Medical Library Association, 105(2), 203–206.Google Scholar
- Hubáček, O., Šourek, G., & Železný, F. (2018). Learning to predict soccer results from relational data with gradient boosted trees. Machine Learning. https://doi.org/10.1007/s10994-018-5704-6.
- Kumar, G. (2013). Machine learning for soccer analytics. Master’s thesis, Department Computerwetenschappen, KU Leuven, Belgium.Google Scholar
- Lichman, M. (2013). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Accessed 16 June 2018.
- Mathien, H. (2017). The European Soccer Database. https://www.kaggle.com/hugomathien/soccer. Accessed 16 June 2018.
- O’Donoghue, P., Dubitzky, W., Lopes, P., Berrar, D., Lagan, K., Hassan, D., et al. (2004). An evaluation of quantitative and qualitative methods of predicting the 2002 FIFA World Cup. Journal of Sports Sciences, 22(6), 513–514.Google Scholar
- Reep, C., & Benjamin, B. (1968). Skill and chance in association football. Journal of the Royal Statistical Society, Series A (General), 131(4):581–585.Google Scholar
- Tsokos, A., Narayanan, S., Kosmidis, I., Baio, G., Cucuringu, M., Whitaker, G., & Király, F. J. (2018). Modeling outcomes of soccer matches. Machine Learning (to appear).Google Scholar
- Van Haaren, J., & Van den Broeck, G. (2011). Relational learning for football-related predictions. In Proceedings of the 21st International Conference on Inductive Logic Programming (ILP-2011), Windsor Great Park, UK (pp. 1–6).Google Scholar