Skip to main content

How Data Availability Affects the Ability to Learn Good xG Models

  • Conference paper
  • First Online:
Machine Learning and Data Mining for Sports Analytics (MLSA 2020)

Abstract

Motivated by the fact that some shots are better than others, the expected goals (xG) metric attempts to quantify the quality of goal-scoring opportunities in soccer. The metric is becoming increasingly popular, making its way to TV analysts’ desks. Yet, a vastly underexplored topic in the context of xG is how these models are affected by the data on which they are trained. In this paper, we explore several data-related questions that may affect the performance of an xG model. We showed that the amount of data needed to train an accurate xG model depends on the complexity of the learner and the number of features, with up to 5 seasons of data needed to train a complex gradient boosted trees model. Despite the style of play changing over time and varying between leagues, we did not find that using only recent data or league-specific models improves the accuracy significantly. Hence, if limited data is available, training models on less recent data or different leagues is a viable solution. Mixing data from multiple data sources should be avoided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Since penalties and free-kicks are relatively easy to predict, our xG models might seem less accurate than other models which include these penalty and free-kick shots.

  2. 2.

    This version of the Brier score is only valid for binary classification. The original definition by Brier is applicable to multi-category classification as well.

References

  1. Brier, G.W.: Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78(1), 1–3 (1950)

    Article  Google Scholar 

  2. Caley, M.: Premier league projections and new expected goals (2015). Accessed 27 May 2020. https://cartilagefreecaptain.sbnation.com/2015/10/19/9295905/premier-league-projections-and-new-expected-goals

  3. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, ACM, New York, NY, USA ,pp. 785–794 (2016). https://doi.org/10.1145/2939672.2939785

  4. Decroos, T., Bransen, L., Van Haaren, J., Davis, J.: Actions speak louder than goals: Valuing player actions in soccer. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1851–1861 (2019)

    Google Scholar 

  5. Decroos, T., Davis, J.: Interpretable prediction of goals in soccer. In: Proceedings of the AAAI-20 Workshop on Artificial Intelligence in Team Sports, December 2019

    Google Scholar 

  6. Fairchild, A., Pelechrinis, K., Kokkodis, M.: Spatial analysis of shots in MLS: a model for expected goals and fractal dimensionality. J. Sports Anal. 4(3), 165–174 (2018)

    Article  Google Scholar 

  7. Gelade, G.: Which team formations produce the most expected goals? July 2017. http://business-analytic.co.uk/blog/which-team-formations-produce-the-most-expected-goals/

  8. Green, S.: Assessing the performance of premier league goalscorers, April 2012. https://www.optasportspro.com/news-analysis/assessing-the-performance-of-premier-league-goalscorers/

  9. Ijtsma, S.: A close look at my new expected goals model, August 2015. http://www.11tegen11.com/2015/08/14/a-close-look-at-my-new-expected-goals-model/

  10. Kullowatz, M.: Expected goals 3.0 methodology, April 2015. https://www.americansocceranalysis.com/home/2015/4/14/expected-goals-methodology

  11. Manfredi, G.: Expected goals & player analysis, May 2019. https://www.kaggle.com/gabrielmanfredi/expected-goals-player-analysis

  12. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  13. Statsbomb: Danish football analysis, April 2019. https://divisionsforeningen.dk/wp-content/uploads/2019/04/Superliga_Analysis.pdf

  14. Webb, G.I., Ting, K.M.: On the application of ROC analysis to predict classification performance under varying class distributions. Mach. Learn. 58(1), 25–32 (2005)

    Article  Google Scholar 

Download references

Acknowledgements

This research received funding from the KU Leuven Research Fund (C14/17/070) and the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pieter Robberechts .

Editor information

Editors and Affiliations

A Hyperparameters

A Hyperparameters

Logistic Regression + Basic Features

figure a

XGBoost + Basic Features

figure b

Logistic Regression + Advanced Features

figure c

XGBoost + Advanced Features

figure d

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Robberechts, P., Davis, J. (2020). How Data Availability Affects the Ability to Learn Good xG Models. In: Brefeld, U., Davis, J., Van Haaren, J., Zimmermann, A. (eds) Machine Learning and Data Mining for Sports Analytics. MLSA 2020. Communications in Computer and Information Science, vol 1324. Springer, Cham. https://doi.org/10.1007/978-3-030-64912-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-64912-8_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-64911-1

  • Online ISBN: 978-3-030-64912-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics