Incorporating domain knowledge in machine learning for soccer outcome prediction
The task of the 2017 Soccer Prediction Challenge was to use machine learning to predict the outcome of future soccer matches based on a data set describing the match outcomes of 216,743 past soccer matches. One of the goals of the Challenge was to gauge where the limits of predictability lie with this type of commonly available data. Another goal was to pose a real-world machine learning challenge with a fixed time line, involving the prediction of real future events. Here, we present two novel ideas for integrating soccer domain knowledge into the modeling process. Based on these ideas, we developed two new feature engineering methods for match outcome prediction, which we denote as recency feature extraction and rating feature learning. Using these methods, we constructed two learning sets from the Challenge data. The top-ranking model of the 2017 Soccer Prediction Challenge was our k-nearest neighbor model trained on the rating feature learning set. In further experiments, we could slightly improve on this performance with an ensemble of extreme gradient boosted trees (XGBoost). Our study suggests that a key factor in soccer match outcome prediction lies in the successful incorporation of domain knowledge into the machine learning modeling process.
Keywords2017 Soccer Prediction Challenge Feature engineering k-NN Knowledge representation Open International Soccer Database Rating feature learning Recency feature extraction Soccer analytics XGBoost
We thank the three anonymous reviewers for their detailed comments that have helped us a lot to improve this manuscript.
- Chen, T., & Guestrin, C. (2016). XGBoost: Reliable large-scale tree boosting system. In: M. Shah, A. Smola, C. Aggarwal, D. Shen, & R. Rastogi (Eds.) Proceedings of the 22nd ACM SIGKDD conference on knowledge discovery and data mining, San Francisco, CA, USA (pp. 785–794).Google Scholar
- Constantinou, A. (2018). Dolores: A model that predicts football match outcomes from all over the world. Machine Learning. https://doi.org/10.1007/s10994-018-5703-7.
- Constantinou, A., & Fenton, N. (2012). Solving the problem of inadequate scoring rules for assessing probabilistic football forecast models. Journal of Quantitative Analysis in Sports, 8(1). https://doi.org/10.1515/1559-0410.1418.
- Dixon, M., & Coles, S. (1997). Modelling association football scores and inefficiencies in the football betting market. Applied Statistics, 46(2), 265–280.Google Scholar
- Dubitzky, W., Lopes, P., Davis, J., & Berrar, D. (2018). The Open International Soccer Database. Machine Learning. https://doi.org/10.1007/s10994-018-5726-0.
- Elo, A. E. (1978). The rating of chessplayers, past and present. London: Batsford.Google Scholar
- Hubáček, O., Šourek, G., & Železný, F. (2018). Learning to predict soccer results from relational data with gradient boosted trees. Machine Learning. https://doi.org/10.1007/s10994-018-5704-6.
- Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. In Proceedings of IEEE international conference on neural networks (Vol. 4, pp. 1942–1948).Google Scholar
- O’Donoghue, P., Dubitzky, W., Lopes, P., Berrar, D., Lagan, K., Hassan, D., et al. (2004). An evaluation of quantitative and qualitative methods of predicting the 2002 FIFA World Cup. Journal of Sports Sciences, 22(6), 513–514.Google Scholar
- R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed 24 July 2018.
- Shi, Y., & Eberhart, R. (1998). A modified particle swarm optimizer. In Proceedings of IEEE international conference on evolutionary computation (pp. 69–73).Google Scholar
- Tsokos, A., Narayanan, S., Kosmidis, I., Baio., G., Cucuringu, M., Whitaker, G., & Király, F. (2018). Modeling outcomes of soccer matches. Machine Learning. (to appear).Google Scholar
- Van Haaren, J., Dzyuba, V., Hannosset, S., & Davis, J. (2015). Automatically discovering offensive patterns in soccer match data. In E. Fromont, T. De Bie, & M. van Leeuwen (Eds.) International symposium on intelligent data analysis. Lecture notes in computer science, Saint-Étienne, France, October 22–24, 2015 (pp. 286–297). Springer, Berlin.Google Scholar
- Van Haaren, J., Hannosset, S., & Davis, J. (2016). Strategy discovery in professional soccer match data. In Proceedings of the KDD-16 workshop on large-scale sports analytics (LSSA-2016) (pp. 1–4).Google Scholar