Machine Learning

, Volume 108, Issue 1, pp 29–47 | Cite as

Learning to predict soccer results from relational data with gradient boosted trees

  • Ondřej HubáčekEmail author
  • Gustav Šourek
  • Filip Železný
Part of the following topical collections:
  1. Special Issue on Machine Learning for Soccer


We describe our winning solution to the 2017’s Soccer Prediction Challenge organized in conjunction with the MLJ’s special issue on Machine Learning for Soccer. The goal of the challenge was to predict outcomes of future matches within a selected time-frame from different leagues over the world. A dataset of over 200,000 past match outcomes was provided to the contestants. We experimented with both relational and feature-based methods to learn predictive models from the provided data. We employed relevant latent variables computable from the data, namely so called pi-ratings and also a rating based on the PageRank method. A method based on manually constructed features and the gradient boosted tree algorithm performed best on both the validation set and the challenge test set. We also discuss the validity of the assumption that probability predictions on the three ordinal match outcomes should be monotone, underlying the RPS measure of prediction quality.


Prediction challenge Relational data Soccer Gradient boosted trees Relational dependency networks Sports Forecasting 



The authors acknowledge support by project no. 17-26999S granted by the Czech Science Foundation. Computational resources were provided by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures”. We thank the anonymous reviewers for their constructive comments.

Author Contributions

OH processed the data, developed the concept, ran and evaluated the experiments and participated in the writing of the manuscript, GŠ provided consultations through each stage of the work and participated in the writing of the manuscript, FŽ supervised the work and wrote the final manuscript.


  1. Baio, G., & Blangiardo, M. (2010). Bayesian hierarchical model for the prediction of football results. Journal of Applied Statistics, 37(2), 253–264.MathSciNetCrossRefGoogle Scholar
  2. Chen ,T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp 785–794). ACM.Google Scholar
  3. Constantinou, A. C., & Fenton, N. E. (2013). Determining the level of ability of football teams by dynamic ratings based on the relative discrepancies in scores between adversaries. Journal of Quantitative Analysis in Sports, 9(1), 37–50.CrossRefGoogle Scholar
  4. Constantinou, A. C., Fenton, N. E., & Neil, M. (2012a). pi-football: A Bayesian network model for forecasting association football match outcomes. Knowledge-Based Systems, 36, 322–339.CrossRefGoogle Scholar
  5. Constantinou, A. C., Fenton, N. E., et al. (2012b). Solving the problem of inadequate scoring rules for assessing probabilistic football forecast models. Journal of Quantitative Analysis in Sports, 8(1), 1559-0410.CrossRefGoogle Scholar
  6. Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8(6), 985–987.CrossRefGoogle Scholar
  7. Forrest, D., Goddard, J., & Simmons, R. (2005). Odds-setters as forecasters: The case of English football. International Journal of Forecasting, 21(3), 551–564.CrossRefGoogle Scholar
  8. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.MathSciNetCrossRefzbMATHGoogle Scholar
  9. Goddard, J. (2005). Regression models for forecasting goals and match results in association football. International Journal of Forecasting, 21(2), 331–340.CrossRefGoogle Scholar
  10. Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match result prediction in association football. International Journal of Forecasting, 26(3), 460–470.CrossRefGoogle Scholar
  11. Koopman, S. J., & Lit, R. (2015). A dynamic bivariate Poisson model for analysing and forecasting match results in the English Premier League. Journal of the Royal Statistical Society: Series A (Statistics in Society), 178(1), 167–186.MathSciNetCrossRefGoogle Scholar
  12. Lago-Ballesteros, J., & Lago-Peñas, C. (2010). Performance in team sports: Identifying the keys to success in soccer. Journal of Human Kinetics, 25, 85–91.CrossRefGoogle Scholar
  13. Lahvička, J. (2015). Using Monte Carlo simulation to calculate match importance: The case of English Premier League. Journal of Sports Economics, 16(4), 390–409.CrossRefGoogle Scholar
  14. Lasek, J., Szlávik, Z., & Bhulai, S. (2013). The predictive power of ranking systems in association football. International Journal of Applied Pattern Recognition, 1(1), 27–46.CrossRefGoogle Scholar
  15. Lazova, V. & Basnarkov, L. (2015). PageRank approach to ranking national football teams. arXiv preprint arXiv:1503.01331.
  16. McHale, I., & Scarf, P. (2007). Modelling soccer matches using bivariate discrete distributions with general dependence structure. Statistica Neerlandica, 61(4), 432–445.MathSciNetCrossRefzbMATHGoogle Scholar
  17. Natarajan, S., Khot, T., Kersting, K., Gutmann, B., & Shavlik, J. (2010). Boosting relational dependency networks. In Online Proceedings of the international conference on inductive logic programming, 2010 (pp. 1–8).Google Scholar
  18. Natarajan, S., Khot, T., Kersting, K., Gutmann, B., & Shavlik, J. (2012). Gradient-based boosting for statistical relational learning: The relational dependency network case. Machine Learning, 86(1), 25–56.MathSciNetCrossRefzbMATHGoogle Scholar
  19. Oberstone, J., et al. (2009). Differentiating the top English Premier League football clubs from the rest of the pack: Identifying the keys to success. Journal of Quantitative Analysis in Sports, 5(3), 10.MathSciNetCrossRefGoogle Scholar
  20. Odom, P. & Natarajan, S. (2016). Actively interacting with experts: A probabilistic logic approach. In Joint European conference on machine learning and knowledge discovery in databases (pp. 527–542). Springer.Google Scholar
  21. Pollard, R., & Pollard, G. (2005). Home advantage in soccer: A review of its existence and causes. International Journal of Soccer and Science, 3(1), 28–44.MathSciNetGoogle Scholar
  22. Štrumbelj, E. (2014). On determining probability forecasts from betting odds. International Journal of Forecasting, 30(4), 934–943.CrossRefGoogle Scholar
  23. Van Haaren, J. & Davis, J. (2015). Predicting the final league tables of domestic football leagues. In Proceedings of the 5th international conference on mathematics in sport (pp. 202–207).Google Scholar
  24. Van Haaren, J. & Van den Broeck, G. (2015). Relational learning for football-related predictions. In Latest advances in inductive logic programming, world scientific (pp. 237–244).Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Czech Technical University in PraguePragueCzech Republic

Personalised recommendations