Modeling outcomes of soccer matches

We compare various extensions of the Bradley-Terry model and a hierarchical Poisson log-linear model in terms of their performance in predicting the outcome of soccer matches (win, draw, or loss). The parameters of the Bradley-Terry extensions are estimated by maximizing the log-likelihood, or an appropriately penalized version of it, while the posterior densities of the parameters of the hierarchical Poisson log-linear model are approximated using integrated nested Laplace approximations. The prediction performance of the various modeling approaches is assessed using a novel, context-specific framework for temporal validation that is found to deliver accurate estimates of the test error. The direct modeling of outcomes via the various Bradley-Terry extensions and the modeling of match scores using the hierarchical Poisson log-linear model demonstrate similar behavior in terms of predictive performance.


Introduction
The current paper stems from our participation in the 2017 Machine Learning Journal (Springer) challenge on predicting outcomes of soccer matches from a range of leagues around the world (MLS challenge, in short). Details of the challenge and the data can be found in Berrar et al. (2017).
We consider two distinct modeling approaches for the task. The first approach focuses on modeling the probabilities of win, draw, or loss, using various extensions of Bradley-Terry models (Bradley and Terry, 1952). The second approach focuses on directly modeling the number of goals scored by each team in each match using a hierarchical Poisson log-linear model, building on the modeling frameworks in Maher (1982), Dixon and Coles (1997), Karlis and Ntzoufras (2003) and Baio and Blangiardo (2010).
The performance of the various modeling approaches in predicting the outcomes of matches is assessed using a novel, context-specific framework for temporal validation that is found to deliver accurate estimates of the prediction error. The direct modeling of the outcomes using the various Bradley-Terry extensions and the modeling of match scores using the hierarchical Poisson log-linear model deliver similar performance in terms of predicting the outcome.
The paper is structured as follows: Section 2 briefly introduces the data, presents the necessary datacleaning operations undertaken, and describes the various features that were extracted. Section 3 presents the various Bradley-Terry models and extensions we consider for the challenge and describes the associated estimation procedures. Section 4 focuses on the hierarchical Poisson log-linear model and the Integrated Nested Laplace Approximations (INLA; Rue et al. 2009) of the posterior densities for the model parameters.  Section 5 introduces the validation framework and the models are compared in terms of their predictive performance in Section 6. Section 7 concludes with discussion and future directions.
2 Pre-processing and feature extraction

Data exploration
The data contain matches from 52 leagues, covering 35 countries, for a varying number of seasons for each league. Nearly all leagues have data since 2008, with a few having data extending as far back as 2000. There are no cross-country leagues (e.g. UEFA Champions League) or teams associated with different countries. The only way that teams move between leagues is within each country by either promotion or relegation. Figure 1 shows the number of available matches for each country in the data set. England dominates the data in terms of matches recorded, with the available matches coming from 5 distinct leagues. The other highly-represented countries are Scotland with data from 4 distinct leagues, and European countries, such as Spain, Germany, Italy and France, most probably because they also have a high UEFA coefficient (Wikipedia, 2018). Figure 2 shows the number of matches per number of goals scored by the home (dark grey) and away (light grey) teams. Home teams appear to score more goals than away teams, with home teams having consistently higher frequencies for two or more goals and away teams having higher frequencies for no goal and one goal. Overall, home teams scored 304, 918 goals over the whole data set, whereas away teams scored 228, 293 goals. In Section 1 of the Supplementary Material, the trend shown in Figure 2 is also found to be present within each country, pointing towards the existence of a home advantage.

Data cleaning
Upon closer inspection of the original sequence of matches for the MLS challenge, we found and corrected the following three anomalies in the data. The complete set of matches from the 2015-2016 season of the Venezuelan league was duplicated in the data. We kept only one instance of these matches. Furthermore, 26 matches from the 2013-2014 season of the Norwegian league were assigned the year 2014 in the date field instead of 2013. The dates for these matches were modified accordingly. Finally, one match in the 2013-2014 season of the Moroccan league (Raja Casablanca vs Maghrib de Fes) was assigned the month February in the date field instead of August. The date for this match was corrected, accordingly.

Feature extraction
The features that were extracted can be categorized into team-specific, match-specific and/or season-specific. Match-specific features were derived from the information available on each match. Season-specific features have the same value for all matches and teams in a season of a particular league, and differ only across seasons for the same league and across leagues. Table 1 gives short names, descriptions, and ranges for the features that were extracted. Table 2 gives an example of what values the features take for an artificial data set with observations on the first 3 matches of a season for team A playing all matches at home. The team-specific features are listed only for Team A to allow for easy tracking of their evolution.
The features we extracted are proxies for a range of aspects of the game, and their choice was based on common sense and our understanding of what is important in soccer, and previous literature. Home (feature 1 in Table 2) can be used for including a home advantage in the models; newly promoted (feature 2 in Table 2) is used to account for the fact that a newly promoted team is typically weaker than the competition; days since previous match (feature 3 in Table 2) carries information regarding fatigue of the players and the team, overall; form (feature 4 in Table 2) is a proxy for whether a team is doing better or worse during a particular period in time compared to its general strength; matches played (feature 5 in Table 2) determines how far into the season a game occurs; points tally, goal difference, and points per match (features 6, 7 and 10 in Table 2) are measures of how well a team is doing so far in the season; goals scored per match and goals conceded per match (features 8 and 9 in Table 2) are measures of a team's attacking and defensive ability, respectively; previous season points tally and previous season goal difference (features 11 and 12 in Table 2) are measures of how well a team performed in the previous season, which can be a useful indicator of how well a team will perform in the early stages of a season when other features such as points tally do not carry much information; finally, team rankings (feature 13 in Table 2) refers to a variety of measures that rank teams based on their performance in previous matches, as detailed in Section 2 of the Supplementary Material.
In order to avoid missing data in the features we extracted, we made the following conventions. The value of form for the first match of the season for each team was drawn from a Uniform distribution in (0, 1). The form for the second and third match were a third of the points in the first match, and a sixth of the total points in the first two matches, respectively. Days since previous match was left unspecified for the very first match of the team in the data. If the team was playing its first season then we treated it as being newly promoted. The previous season points tally was set to 15 for newly promoted teams and to 65 for newly relegated teams, and the previous season goal difference was set to −35 for newly promoted teams and 35 for newly relegated teams. These values were set in an ad-hoc manner prior to estimation and validation, based on our sense and experience of what is a small or large value for the corresponding features. In principle, the choice of these values could be made more formally by minimizing a criterion of predictive quality, but we did not pursue this as it would complicate the estimation-prediction workflow described later in the paper and increase computational effort significantly without any guarantee of improving the predictive quality of the models.

Bradley-Terry models and extensions
The Bradley-Terry model (Bradley and Terry, 1952) is commonly used to model paired comparisons, which often arise in competitive sport. For a binary win/loss outcome, let where n is the number of teams present in the data. The Bradley-Terry model assumes that where π i = exp(λ i ), and λ i is understood as the "strength" of team i. In the original Bradley-Terry formulation, λ i does not vary with time.
For the purposes of the MLS challenge prediction task, we consider extensions of the original Bradley-Terry formulation where we allow λ i to depend on a p-vector of time-dependent features x it for team i at time t as λ it = f (x it ) for some function f (·). Bradley-Terry models can also be equivalently written as linking the log-odds of a team winning to the difference in strength of the two teams competing. Some of the extensions below directly specify that difference.

BL: Baseline
The simplest specification of all assumes that where h it = 1 if team i is playing at home at time t, and h it = 0 otherwise. The only parameter to estimate with this specification is β, which can be understood as the difference in strength when the team plays at home. We use this model to establish a baseline to improve upon for the prediction task.

CS: Constant strengths
This specification corresponds to the standard Bradley-Terry model with a home-field advantage, under which The above specification involves n + 1 parameters, where n is the number of teams. The parameter α i represents the time-invariant strength of the ith team.

LF: Linear with features
Suppose now that we are given a vector of features x it associated with team i at time t. A simple way to model the team strengths λ it is to assume that they are a linear combination of the features. Hence, in this model we have where x itk is the kth element of the feature vector x it . Note that the coefficients in the linear combination are shared between all teams, and so the number of parameters to estimate is p, where p is the dimension of the feature vector. This specification is similar to the one implemented in the R package BradleyTerry (Firth, 2005), but without the team specific random effects.

TVC: Time-varying coefficients
Some of the features we consider, like points tally season (feature 6 in Table 1) vary during the season. Ignoring any special circumstances such as teams being punished, the points accumulated by a team is a non-decreasing function of the number of matches the team has played.
It is natural to assume that the contribution of points accumulated to the strength of a team is different at the beginning of the season than it is at the end. In order to account for such effects, the parameters for the corresponding features can be allowed to vary with the matches played. Specifically, the team strengths can be modeled as where m it denotes the number of matches that team i has played within the current season at time t and V denotes the set of coefficients that are allowed to vary with the matches played. The functions γ k (m it ) can be modeled non-parametrically, but in the spirit of keeping the complexity low we instead set γ k (m it ) = α k + β k m it . With this specification for γ k (m it ), TVC is equivalent to LF with the inclusion of an extra set of features {m it x itk } k∈V .

AFD: Additive feature differences with time interactions
For the LF specification, the log-odds of team i beating team j is Hence, the LF specification assumes that the difference in strength between the two teams is a linear combination of differences between the features of the teams. We can relax the assumption of linearity, and include non-linear time interactions, by instead assuming that each difference in features contributes to the difference in strengths through an arbitrary bivariate smooth function g k that depends on the feature difference and the number of matches played. We then arrive at the AFD specification, which can be written as where for simplicity we take the number of matches played to be the number of matches played by the home team.

Handling draws
The extra outcome of a draw in a soccer match can be accommodated within the Bradley-Terry formulation in two ways. The first is to treat win, loss and draw as multinomial ordered outcomes, in effect assuming that "win" "draw" "loss", where denotes strong transitive preference. Then, the ordered outcomes can be modeled using cumulative link models (Agresti, 2015) with the various strength specifications. Specifically, let 1 , if team i and j draw at time t , 0 , if team j beats team i at time t . and assume that y ijt has p(y ijt ≤ y) = e δy+λit e δy+λit + e λjt , where −∞ < δ 0 ≤ δ 1 < δ 2 = ∞, and δ 0 , δ 1 are parameters to be estimated from the data. Cattelan et al. (2013) and Király and Qian (2017) use of this approach for modeling soccer outcomes. Another possibility for handling draws is to use the Davidson (1970) extension of the Bradley-Terry model, under which where δ is a parameter to be estimated from the data.

Likelihood-based approaches
The parameters of the Bradley-Terry model extensions presented above can be estimated by maximizing the log-likelihood of the multinomial distribution.
The log-likelihood about the parameter vector θ is where I A takes the value 1 if A holds and 0 otherwise, and M is the set of triplets {i, j, t} corresponding to the matches whose outcomes have been observed. For estimating the functions involved in the AFD specification, we represent each f k using thin plate splines (Wahba, 1990), and enforce smoothness constraints on the estimate of f k by maximizing a penalized log-likelihood of the form where P is a penalty matrix and k is a tuning parameter. For penalized estimation we only consider ordinal models through the R package mgcv (Wood, 2006), and select k by optimizing the Generalized Cross Validation criterion (Golub et al., 1979). Details on the fitting procedure for specifications like AFD and the implementation of thin plate spline regression in mgcv can be found in Wood (2003). The parameters of the Davidson extensions of the Bradley-Terry model are estimated by using the BFGS optimization algorithm (Byrd et al., 1995) to minimize − (θ).

Identifiability
In the CS model, the team strengths are identifiable only up to an additive constant, because λ i − λ j = (λ i + d) − (λ j + d) for any d ∈ . This unidentifiability can be dealt with by setting the strength of an arbitrarily chosen team to zero. The CS model was fitted league-by-league with one identifiability constraint per league.
The parameters δ 0 and δ 1 in (6) are identifiable only if the specification used for λ i − λ j does not involve an intercept parameter. An alternative is to include an intercept parameter in λ i − λ j and fix δ 0 at a value. The estimated probabilities are invariant to these alternatives, and we use the latter simply because this is the default in the mgcv package.

Other data-specific considerations
The parameters in the LF, TVC, and AFD specifications (which involve features) are shared across the leagues and matches in the data. For computational efficiency we restrict the fitting procedures to use the 20, 000 most recent matches, or less if less is available, at the time of the first match that a prediction needs to be made. The CS specification requires estimating the strength parameters directly. For computational efficiency, we estimate the strength parameters independently for each league within each country, and only consider matches that took place in the past calendar year from the date of the first match that a prediction needs to be made.

Modeling scores 4.1 Model structure
Every league consists of a number of teams T , playing against each other twice in a season (once at home and once away). We indicate the number of goals scored by the home and the away team in the gth match of the season (g = 1, . . . , G) as y g1 and y g2 , respectively.
The observed goal counts y g1 and y g2 are assumed to be realizations of conditionally independent random variables Y g1 and Y g2 , respectively, with The parameters θ g1 and θ g2 represent the scoring intensity in the g-th match for the home and away team, respectively.
We assume that θ g1 and θ g2 are specified through the regression structures The indices h g and a g determine the home and away team for match g respectively, with h g , a g ∈ {1, . . . , T }. The parameters β 1 , . . . , β p represent the effects corresponding to the observed match-and team-specific features z gj1 , . . . , z gjp , respectively, collected in a G × 2p matrix Z. The other effects in the linear predictor η gj reflect assumptions of exchangeability across the teams involved in the matches. Specifically, α t and ξ t represent the latent attacking and defensive ability of team t and are assumed to be distributed as We used vague log-Gamma priors on the precision parameters τ α = 1/σ 2 α and τ ξ = 1/σ 2 ξ . In order to account for the time dynamics across the different seasons, we also include the latent interactions γ ts and δ ts between the team-specific attacking and defensive strengths and the season s ∈ {1, . . . , S}, which were modeled using autoregressive specifications with For the specification of prior distributions for the hyperparameters ρ γ , ρ δ , σ we used the default settings of the R-INLA package (Lindgren and Rue, 2015, version 17.6.20), which we also use to fit the model (see Subsection 4.2). Specifically, R-INLA sets vague Normal priors (centred at 0 with large variance) on suitable transformations (e.g. log) of the hyperparameters with unbounded range.

Estimation
The hierarchical Poisson log-linear model (HPL) of Subsection 4.1 was fitted using INLA (Rue et al., 2009). Specifically, INLA avoids time-consuming MCMC simulations by numerically approximating the posterior densities for the parameters of latent Gaussian models, which constitute a wide class of hierarchical models of the form where Y i is the random variable corresponding to the observed response y i , φ is a set of parameters (which may have a large dimension) and ψ is a set of hyperparameters. The basic principle is to approximate the posterior densities for ψ and φ using a series of nested Normal approximations. The algorithm uses numerical optimization to find the mode of the posterior, while the marginal posterior distributions are computed using numerical integration over the hyperparameters. The posterior densities for the parameters of the HPL model are computed on the available data for each league.
To predict the outcome of a future match, we simulated 1000 samples from the joint approximated predictive distribution of the number of goalsỸ 1 ,Ỹ 2 , scored in the future match by the home and away teams respectively, given featuresz j = (z j1 , . . . ,z j2 ) . Sampling was done using the inla.posterior.sample method of the R-INLA package. The predictive distribution has a probability mass function of the form where the vector ν collects all model parameters. We then compute the relative frequencies of the events Y 1 >Ỹ 2 ,Ỹ 1 =Ỹ 2 , andỸ 1 <Ỹ 2 , which correspond to home win, draw, and loss respectively.

MLS challenge
The MLS challenge consists of predicting the outcomes (win, draw, loss) of 206 soccer matches from 52 leagues that take place between 31st March 2017 and 10th April 2017. The prediction performance of each submission was assessed in terms of the average ranked probability score (see Subsection 5.2) over those matches. To predict the outcomes of these matches, the challenge participants have access to over 200,000 matches up to and including the 21st March 2017, which can be used to train a classifier.
In order to guide the choice of the model that is best suited to make the final predictions, we designed a validation framework that emulates the requirements of the MLS Challenge. We evaluated the models in 3.

Validation criteria
The main predictive criterion we used in the validation framework is the ranked probability score, which is also the criterion that was used to determine the outcome of the challenge. Classification accuracy was also computed.

Ranked probability score
Let R be the number of possible outcomes (e.g. R = 3 in soccer) and p be the R-vector of predicted probabilities with j-th component p j ∈ [0, 1] and p 1 + . . . + p R = 1. Suppose that the observed outcomes are encoded in an R-vector a with j-th component a j ∈ {0, 1} and a 1 + . . . + a r = 1. The ranked probability score is defined as The ranked probability score was introduced by Epstein (1969) (see also, Gneiting and Raftery, 2007, for a general review of scoring rules) and is a strictly proper probabilistic scoring rule, in the sense that the true odds minimize its expected value (Murphy, 1969).

Classification accuracy
Classification accuracy measures how often the classifier makes the correct prediction, i.e. how many times the outcome with the maximum estimated probability of occurence actually occurs. Table 3: Illustration of the calculation of the ranked probability score and classification accuracy on artificial data.
Observed outcome Predicted probabilities Predicted outcome RPS Accuracy Table 3 illustrates the calculations leading to the ranked probability score and classification accuracy for several combinations of p and a. The left-most group of three columns gives the observed outcomes, the next group gives the predicted outcome probabilities, and the third gives the predicted outcomes using maximum probability allocation. The two columns in the right give the ranked probability scores and classification accuracies. As shown, a ranked probability score of zero indicates a perfect prediction (minimum error) and a ranked probability score of one indicates a completely wrong prediction (maximum error).
The ranked probability score and classification accuracy for a particular experiment in the validation framework are computed by averaging over their respective values over the matches in the prediction set. The uncertainty in the estimates from each experiment is quantified using leave-one-match out jackknife (Efron, 1982), as detailed in step 9 of Algorithm 1.

Meta-analysis
The proposed validation framework consists of K = 17 experiments, one for each calendar year in the data. Each experiment results in pairs of observations (s i ,σ i 2 ), where s i is the ranked probability score or classification accuracy from the ith experiment, andσ i 2 is the associated jackknife estimate of its variance (i = 1, . . . , K).
We synthesized the results of the experiments using meta-analysis (DerSimonian and Laird, 1986). Specifically, we make the working assumptions that the summary variancesσ i 2 are estimated well-enough to be considered as known, and that s 1 , . . . , s K are realizations of random variables S 1 , . . . , S K , respectively, which are independent conditionally on independent random effects U 1 , ..., U K , with The parameter α is understood here as the overall ranked probability score or classification accuracy, after accounting for the heterogeneity between the experiments.
The maximum likelihood estimate of the overall ranked probability or classification accuracy is then the weighted averageα fit the model 6: {ōg : g ∈ B} ← predict({xg : g ∈ B}, f (·)) get predictions 7: {vg : g ∈ B} ← criterion({og,ōg : g ∈ B}) 8: where w i = (σ i 2 +τ 2 ) −1 andτ 2 is the maximum likelihood estimate of τ 2 . The estimated standard error for the estimator of the overall scoreα can be computed using the square root of the inverse Fisher information about α, which ends up being ( The assumptions of the random-effects meta-analysis model (independence, normality and fixed variances) are all subject to direct criticism for the validation framework depicted in Figure 3 and the criteria we consider; for example, the training and validation sets defined in the sequence of experiments in Figure 3 are overlapping and ordered in time, so the summaries resulting from the experiment are generally correlated. We proceed under the assumption that these departures are not severe enough to influence inference and conclusions about α.

Implementation
Algorithm 1 is an implementation of the validation framework in pseudo-code. Each model is expected to have a training method which trains the model on data, and a prediction method which returns predicted outcome probabilities for the prediction set. We refer to these methods as train and predict in the pseudo-code.

Results
In this section we compare the predictive performance of the various models we implemented as measured by the validation framework described in Section 5. Table 4 gives the details of each model in terms of features used, the handling of draws (ordinal and Davidson, as in Subsection 3.2), the distribution whose parameters are modeled, and the estimation procedure that has been used.
The sets of features that were used in the LF, TVC, AFD and HPL specifications in Table 4 resulted from ad-hoc experimentation with different combinations of features in the LF specification. All instances of feature 13 refer to the least squares ordinal rank (see Subsection 2.5 of the supplementary material). The features used in the HPL specification in (7) have been chosen prior to fitting to be home and newly promoted (features 1 and 2 in Table 1), the difference in form and points tally (features 4 and 6 in Table 1) between the two teams competing in match g, and season and quarter (features 15 and 16 in Table 1) for the season that match g takes place. Table 4: Description of each model in Section 3 and Section 4 in terms of features used, the handling of draws, the distribution whose parameters are modeled, and the estimation procedure that was used. The suffix (t) indicates features with coefficients varying with matches played (feature 5 in Table 1 (4) Ordinal 1, 6(t), 7(t), 12(t), 13 Multinomial ML AFD (5) Davidson 1, 6(t), 7(t), 12(t), 13 Multinomial MPL HPL (7) 1, 2, 4, 6, 15, 16 Poisson INLA ( †) TVC (4) Ordinal 1, 2, 3, 4, 6(t), 7(t), 11(t) Multinomial ML For each of the models in Table 4, Table 5 presents the ranked probability score and classification accuracy as estimated from the validation framework in Algorithm 1, and as calculated for the matches in the test set for the challenge.
The results in Table 5 are indicative of the good properties of the validation framework of Section 5 in accurately estimating the performance of the classifier on unseen data. Specifically, and excluding the baseline model, the sample correlation between overall ranked probability score and the average ranked probability score from the matches on the test set is 0.973. The classification accuracy seems to be underestimated by the validation framework.
The TVC model that is indicated by † in Table 5 is the model we used to compute the probabilities for our submission to the MLS challenge. Figure 4 shows the estimated time-varying coefficients for the TVC model. The remaining parameter estimates are 0.0410 for the coefficient of form, −0.0001 for the coefficient of days since previous match, and 0.0386 for the coefficient of newly promoted. Of all the features included, only goal difference and point tally last season had coefficients for which we found evidence of difference from zero when accounting for all other parameters in the model (the p-values from individual Wald tests are both less than 0.001).
After the completion of the MLS challenge we explored the potential of new models and achieved even smaller ranked probability scores than the one obtained from the TVC model. In particular, the best performing model is the HPL model in Subsection 4.1 (starred in Table 5), followed by the AFD model which achieves a marginally worse ranked probability score. It should be noted here that the LF models are two simpler models that achieve performance that is close to that of HPL and AFD, without the inclusion of random effects, time-varying coefficients, or any non-parametric specifications.
The direct comparison between the ordinal and Davidson extensions of Bradley-Terry type models indicates that the differences tend to be small, with the Davidson extensions appearing to perform better.
We also tested the performance of HPL in terms of predicting actual scores of matches using the validation framework, comparing to a baseline method that always predicts the average goals scored by home and away teams respectively in the training data it receives. Using root mean square error as an evaluation metric, HPL achieved a score of 1.0011 with estimated standard error 0.0077 compared to the baseline which achieved a score of 1.0331 with estimated standard error 0.0083.  Table 5, which is the model we used to compute the probabilities for our submission to the MLS challenge. Table 5: Ranked probability score and classification accuracy for the models in Table 4, as estimated from the validation framework of Section 5 (standard errors are in parentheses) and from the matches in the test set of the challenge. The model indicated by † is the one we used to compute the probabilities for the submission to the MLS challenge, while the one indicated by * is the one that achieves the lowest estimated ranked probability score. Amongst the Bradley-Terry specifications, the best performing one is AFD, which models strength differences through a semi-parametric specification involving general smooth bivariate functions of features and season time. Similar but lower predictive performance was achieved by the Bradley-Terry specification that models team strength in terms of linear functions of season time. Overall, the inclusion of features delivered better predictive performance than the simpler Bradley-Terry specifications. In effect, information is gained by relaxing the assumption that each team has constant strength over the season and across feature values. The fact that the models with time varying components performed best within the Bradley-Terry class of models indicates that enriching models with time-varying specifications can deliver substantial improvements in the prediction of soccer outcomes.
All models considered in this paper have been evaluated using a novel, context-specific validation frame-work that accounts for the temporal dimension in the data and tests the methods under gradually increasing information for the training. The resulting experiments are then pooled together using meta-analysis in order to account for the differences in the uncertainty of the validation criterion values by weighing them accordingly. The meta analysis model we employed operates under the working assumption of independence between the estimated validation criterion values from each experiment. This is at best a crude assumption in cases like the above where data for training may be shared between experiments. Furthermore, the validation framework was designed to explicitly estimate the performance of each method only for a pre-specified window of time in each league, which we have set close to the window where the MLS challenge submissions were being evaluated. As a result, the conclusions we present are not generalizable beyond the specific time window that was considered. Despite these shortcomings, the results in Table 5 show that the validation framework delivered accurate estimates of the actual predictive performance of each method, as the estimated average predictive performances and the actual performances on the test set (containing matches between 31st March and 10th April, 2017) were very close.
The main focus of this paper is to provide a workflow for predicting soccer outcomes, and to propose various alternative models for the task. Additional feature engineering and selection, and alternative fitting strategies can potentially increase performance and are worth pursuing. For example, ensemble methods aimed at improving predictive accuracy like calibration, boosting, bagging, or model averaging (for an overview, see Dietterich, 2000) could be utilized to boost the performance of the classifiers that were trained in this paper.
A challenging aspect of modeling soccer outcomes is devising ways to borrow information across different leagues. The two best performing models (HPL and AFD) are extremes in this respect; HPL is trained on each league separately while AFD is trained on all leagues simultaneously, ignoring the league that teams belong to. Further improvements in predictive quality can potentially be achieved by using a hierarchical model that takes into account which league teams belong to but also allows for sharing of information between leagues.

Supplementary material
The supplementary material document contains two sections. Section 1 provides plots of the number of matches per number of goals scored by the home and away teams, by country, for a variety of arbitrarily chosen countries. These plots provide evidence of a home advantage. Section 2 details approaches for obtaining team rankings (feature 13 in Table 2) based on the outcomes of the matches they played so far.

Acknowledgements and authors' contributions
This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1.
The authors are grateful to Petros Dellaportas, David Firth, István Papp, Ricardo Silva, and Zhaozhi Qian for helpful discussions during the challenge. Alkeos Tsokos, Santhosh Narayanan and Ioannis Kosmidis have defined the various Bradley-Terry specifications, and devised and implemented the corresponding estimation procedures and the validation framework. Gianluca Baio developed the hierarchical Poisson log-linear model and the associated posterior inference procedures. Mihai Cucuringu did extensive work on feature extraction using ranking algorithms. Gavin Whitaker carried out core data wrangling tasks and, along with Franz Király, worked on the initial data exploration and helped with the design of the estimation-prediction pipeline for the validation experiments. Franz Király also contributed to the organisation of the team meetings and communication during the challenge. All authors have discussed and provided feedback on all aspects of the challenge, manuscript preparation and relevant data analyses.

Supplementary material for "Modeling outcomes of soccer matches"
Alkeos Tsokos 1 , Santhosh Narayanan 2 , Ioannis Kosmidis 2,3 , Gianluca Baio 1 , Mihai Cucuringu 3,4 , Gavin Whitaker 1 , and Franz Király 1 Home advantage Figure 1 displays plots of the number of matches per number of goals scored by the home (dark grey) and away teams (light grey), by country, for a variety of arbitrarily chosen countries. The plots demonstrate that the home advantage that appears in terms of goals scored when looking at the data for all countries as a whole, holds when looking at individual countries as well.

Extracting features via ranking from noisy pairwise comparisons
This section details our approach to deriving features based on the ranking of teams obtained from aggregated historical matches between pairs of teams. The main purpose of this section is to provide a high level overview of a number of algorithms for ranking from pairwise comparison data, including several state-ofthe-art approaches for this task, which are summarized in Table 1. Broadly speaking, there are two choices to make: one is the particular ranking algorithm used, and the other is the input matrix whose individual entries serve as a proxy for the rank offset between pairs of teams. In our numerical experiments, we chose to consider five different algorithms, and two different types of input matrices, detailed in the next paragraph. These choices are by no means exhaustive, especially in the latter direction, however they serve as a good basis towards exploring this approach. We summarize in Table 2 the numerical results obtained from each of the algorithms considered in this section, when the resulting features were used on their own in the LF Bradley-Terry formulation (see Section 3.1 in the main text). All algorithms take as input a skew-symmetric matrix M , whose entries are interpreted as a proxy for the pairwise rank offset between pairs of teams. To each ranking algorithm, we append the suffix "-card" when the entries in the pairwise comparison matrix M are based on the average goal differential computed over the considered historical window, which in our case was chosen to be the previous three seasons and the current season (if a pair of teams did play a game this season), using equal weights. For example, if a pair of teams played a total of four matches in the previous three seasons, and one match in the current season thus far, then M card ij holds the aggregated goal differential over the five matches. We append the suffix "-ord" if only ordinal information is available, i.e. if the input to the ranking algorithms is the matrix M ord ij = sign(M ord ij ), that captures, for a pair of teams, which team scored more goals on aggregate over the previous direct matches. We denoted the resulting time dependent comparison matrices by M card and M ord , respectively, and remark that there are numerous other options for building such matrices. For example, instead of relying on the goal differentials over the previous three seasons and the current season, one could • pool data only from the previous season and the current one, • aggregate historical data from the past k seasons, where the weights of the matches decay harmonically with time, • take into account only counts of the number of wins and losses, as opposed to the actual goal differentials.
Due to time considerations, we have not explored all of these possibilities in our simulations. We build on the recent work of Cucuringu (2016), which considers the classical problem of establishing a statistical ranking of a set of n items given a set of inconsistent and incomplete pairwise comparisons between such items. Instantiations of this problem occur in numerous applications in data analysis, including analysis of sports data. We formulate the above problem of ranking with incomplete noisy information as an instance of the group synchronization problem over the group SO(2) of planar rotations, whose usefulness has been demonstrated in numerous applications in recent years in areas such as computer vision and graphics, sensor network localization and structural biology. Its least squares solution can be approximated by either a spectral or a semidefinite programming (SDP) relaxation, followed by a rounding procedure analogous to the approximation algorithms of the popular MAX-CUT problem. As an example, one of the noise models we considered in Cucuringu (2016) is an Erdős-Rényi Outliers model, abbreviated by ERO(n, p, η), where the available measurements are given by the following mixture where r i denotes the unknown ground truth rank of team i, n the number of teams, p the probability that a pair of teams play a match against each other, and η is the noise level. The remainder of this section details our synchronization-based ranking algorithm, as well as a number of other state-of-the art methods from the literature. We briefly summarize the Serial-Rank algorithm recently introduced in Fogel et al. (2014), which performs spectral ranking via seriation, and was shown to compare favorably to other classical ranking methods. We also discuss the Rank-Centrality algorithm proposed by Negahban et al. (2012), in the context of rank aggregation. Finally, we consider two other approaches for obtaining a global ranking based on Singular Value Decomposition (SVD) and the popular method of Least Squares (LS). For ease of reference, we summarize in Table 1   2.1 Sync-Rank: Robust ranking via eigenvector and SDP synchronization Cucuringu (2016) considered the problem of ranking with noisy incomplete information and made an explicit connection with the angular synchronization problem, for which spectral and SDP relaxations already exist in the literature with provable guarantees. This approach leads to a computationally efficient (as is the case for the spectral relaxation), non-iterative algorithm that is model independent and relies exclusively on the available data.

The group synchronization problem
Finding group elements from noisy measurements of their ratios is known as the group synchronization problem. It can be applied in settings where the underlying problem exhibits a group structure and one has observed noisy measurements of ratios of group elements. For example, the synchronization problem over the special orthogonal group SO(d) consists of estimating a set of n unknown d × d matrices R 1 , . . . , R n ∈ SO(d) from noisy measurements R ij of a subset of their pairwise ratios R −1 i R j . The least squares solution to synchronization aims to minimize the sum of squared deviations minimize R1,...,Rn∈SO(d) where || · || F denotes the Frobenius norm, and w ij are non-negative weights representing the confidence in the noisy pairwise measurements R ij . Singer (2011) proposed spectral and semidefinite programming (SDP) relaxations for solving an instance of the above synchronization problem in the context of angular synchronization, over the group SO(2) of planar rotations, where the goal is to estimate n unknown angles θ 1 , . . . , θ n ∈ [0, 2π), given m noisy measurements Θ ij of their offsets The challenges stem from the amount of noise in the offset measurements, and from the fact that m n 2 , i.e. only a very small subset of all possible pairwise offsets are measured. In general, one may consider other groups G (such as SO(d), O(d)) for which there are available noisy measurements g ij of ratios between the group elements g ij = g i g −1 j , g i , g j ∈ G. The set E of pairs (i, j) for which a ratio of group elements is available can be realized as the edge set of a graph G = (V, E), |V | = n, |E| = m, with vertices corresponding to the group elements g 1 , . . . , g n , and edges to the available pairwise measurements g ij = g i g −1 j . For the case of angular synchronization (when G = SO(2)), we start by building the n × n sparse Hermitian matrix H = (H ij ) whose elements are either zero or points that lie on the unit circle in the complex plane In order to preserve the angle offsets as best as possible, one aims to solve the optimization problem maximize θ1,...,θn∈[0,2π) n i,j=1 which is incremented by +1 whenever an assignment of angles θ i and θ j perfectly satisfies the given edge constraint Θ ij = θ i − θ j mod 2π (i.e. for a good edge). The contribution of an incorrect assignment (i.e. of a bad edge) is uniformly distributed on the unit circle in the complex plane.
Since (3) is a non-convex and computationally difficult optimization problem (Zhang and Huang, 2006), an alternative is to consider the spectral relaxation maximize z1,...,zn∈C; by replacing the individual constraints z i = e ιθi having unit magnitude by the much weaker single constraint n i=1 |z i | 2 = n. The resulting maximization problem in (4) amounts to maximizing a quadratic form whose solution is known to be given by the top eigenvector of the Hermitian matrix H, which has an orthonormal basis over C n , with real eigenvalues λ 1 ≥ λ 2 ≥ . . . ≥ λ n and corresponding eigenvectors v 1 , v 2 , . . . , v n . In other words, the spectral relaxation of the non-convex optimization problem in (3) is given by maximize ||z|| 2 =n z * Hz, which can be solved via a simple eigenvector computation, by setting z = v 1 , where v 1 is the top eigenvector of H, satisfying Hv 1 = λ 1 v 1 , with ||v 1 || 2 = n, corresponding to the largest eigenvalue λ 1 . As a final step, prior to extracting the final estimated angles, we normalize H by using the diagonal matrix D, whose diagonal elements are given by D ii = N j=1 |H ij |, and define which is similar to the Hermitian matrix D −1/2 HD −1/2 , since Thus, H has n real eigenvalues λ H 1 > λ H 2 ≥ · · · ≥ λ H n with corresponding n orthogonal (complex valued) We define the estimated rotation anglesθ 1 , ...,θ n using the top We remark that the estimation of the rotation angles θ 1 , . . . , θ n is unique up to an additive phase since e iα v H 1 is also an eigenvector of H for any α ∈ R. This motivates the final post-processing step in our proposed algorithm (Algorithm 1), in which we remove the best circular permutation, as depicted in Figure 2.

Ranking via angular synchronization
Let us denote the true ranking of the n teams by r 1 < r 2 < . . . < r n , and assume without loss of generality that r i = i, i.e. the rank of the i th team is i. In the absence of noise, the ranks can be imagined to lie on a one-dimensional line, sorted from 1 to n, with the pairwise rank comparisons given, in the noiseless case, by C ij = r i − r j (for cardinal measurements) or C ij = sign(r i − r j ) (for ordinal measurements). In the angular embedding space, we consider the ranks of the teams mapped to the unit circle, say fixing r 1 to have a zero angle with the x-axis, and the last team r n corresponding to an angle equal to π. In other words, we imagine the n team wrapped around a fraction of the circle, interpret the available rank-offset measurements as angle-offsets in the angular space, and thus arrive at the setup of the angular synchronization problem previously described.
The modulus used to wrap the teams around the circle plays an important role in the recovery process. Choosing to map the teams across the entire circle would cause ambiguity at the end points, since the very highly ranked teams would be positioned very close to (or perhaps even mixed with) the very poorly ranked teams. To avoid this issue, we simply choose to map the n teams to the upper half of the unit circle [0, π]. This mapping is, in essence, our approach to making the problem of dealing with a non-compact group, such as the real line, amenable to a group synchronization approach. This way, the line is "compactified" by simply mapping it to the unit circle (or part of it), making the approach amenable to synchronization methods and less sensitive to outliers.
As previously discussed, since the solution to the angular synchronization problem is computed up to a global shift, one needs to perform an additional post-processing step to accurately extract the ordering of the teams that best matches the given data. For example, as shown in the right plot of Figure 2, Chelsea, Tottenham and Manchester City would lie at the bottom of the ranking (right after Sunderland), while in fact they were the highest ranked teams. To this end, we mod out the best circular permutation of the initial rankings obtained from synchronization, that minimizes the number of upsets in the given data. To measure the accuracy of each candidate circular permutation σ, we first compute the pairwise rank offsets associated with the induced ranking via where ⊗ denotes the outer product of two vectors x ⊗ y = xy T , • denotes the Hadamard product of two matrices (entrywise product), and A is the adjacency matrix of the measurement graph G. We summarize in Algorithm 1 the main steps of the Sync-Rank algorithm. Finally, we remark that, throughout the numerical experiments, we only relied on the spectral relaxation, and did not experiment with the semidefinite-programming relaxation. Our current implementation in CVX is computationally costly for large data sets, however the SDP program could be solved efficiently via a Burer and Monteiro (2003) approach, whose surprisingly good empirical performance has only recently been understood theoretically (Boumal et al., 2016). Fogel et al. (2014) proposed a seriation algorithm for ranking a set of teams given noisy incomplete pairwise comparisons. Their approach starts by assigning similar rankings to teams that compare similarly with all other teams. The intuition is that teams that beat the same teams and are beaten by the same teams should have a similar ranking in the final solution. They do so by constructing a similarity matrix from the available pairwise comparisons, relying on existing seriation methods to reorder the similarity matrix and thus recover the final rankings.

Serial rank and generalized linear models
They make an explicit connection with another related classical ordering problem, namely seriation, where one is given a similarity matrix between a set of n items under the assumption that the items have an underlying ordering on the line, such that the similarity between items decreases with their distance. A spectral algorithm that exactly solves the noiseless seriation problem was proposed by Atkins et al. (1998), derived from the observation that, for a given similarity matrix computed from such serial variables, the ordering induced by the second eigenvector of the associated Laplacian matrix (i.e. the Fiedler vector) matches that of the variables. Fogel et al. (2014) adapted the above seriation procedure to the ranking problem, and Algorithm 1 Summary of the Synchronization-Ranking (Sync-Rank) Algorithm. The gist of the approach can be described as: (1) make the ansatz that the teams are embedded on the upper half of the unit circle, (2) map the resulting goal differentials between pairs of teams to an angle offset in [0, π), (3) solve the resulting angular synchronization problem (via the spectral relaxation) that amounts to finding an assignment of angles that best match the given angle offsets (4) mod out the best circular permutation by choosing the ordering that minimizes the number of upsets.
2: r ← Output as a final solution the ranking induced by the circular permutation σ. The recovered solution at some random rotation, motivating the step that computes the best circular permutation of the recovered rankings, chosen to minimize the number of upsets with respect to the initially given pairwise measurements.
proposed an efficient polynomial-time algorithm with provable recovery and robustness guarantees, which under certain conditions, is able to perfectly recover the underlying true ranking, even when a fraction of the comparisons are either corrupted by noise or completely missing.
In the case of ordinal measurements (only win-lose information), the proposed similarity measure counts the number of matching comparisons. For a given skew symmetric matrix C of size n × n of pairwise comparisons C ij = {−1, 0, 1} (denoting lose, tie or a win), with C ij = −C ji , given by the following model Assuming the diagonal of C is set to C ii = 1, ∀i = 1, 2, . . . , n, the similarity matrix takes the form where C ik C jk = 1 whenever i and j have the same signs, and C ik C jk = −1 whenever they have opposite signs. In other words, the similarity S match ij counts the number of matching comparisons between i and j with a third reference team k. Written in a compact form, the final similarity matrix is given by The final ranking is the one induced by the Fiedler vector of S. The main steps of Serial-Rank algorithm are summarized in Algorithm 2. Fogel et al. (2014) also consider a generalized linear model setting, where one assumes that the paired comparisons are generated according to a generalized linear model, are independent, and team i defeated team j with probability P ij = H(ν i − ν j ), where ν ∈ R n is a vector denoting the strength, rank, or skill level of the n teams. They propose the following similarity matrix where m i,k = 1 if i and j played in a match, and 0 otherwise. The matrix Q of corresponding empirical probabilities is given by the following mixture Here, m ij denotes the number of times teams i and j played against each others, and C s i,j ∈ {−1, 1} is the result of match s. We denote by SER-GLM the Serial-Rank algorithm based on the above GLM model given by (8).
Algorithm 2 Serial-Rank: spectral ranking via seriation (Fogel et al., 2014) Input: A set of pairwise comparisons C ij ∈ {−1, 0, 1} or [-1,1] Output: Final ranking r Compute a similarity matrix as shown in (7). Compute the associated graph Laplacian matrix where D is a diagonal matrix D = diag (S1), i.e. D ii = n j=1 G i,j is the degree of node i in the measurement graph G. Compute the Fiedler vector of S (eigenvector corresponding to the smallest nonzero eigenvalue of L S ). Output the ranking induced by sorting the Fiedler vector of S, with the global ordering (increasing or decreasing order) chosen such that the number of upsets is minimized.

The rank-centrality algorithm
The third ranking algorithm we consider is Rank-Centrality, introduced by Negahban et al. (2012), and is an iterative algorithm proposed for the rank aggregation problem of integrating ranking information from multiple ranking systems, by estimating scores for the items from the stationary distribution of a certain random walk on the graph of items, where each edge encodes the outcome of pairwise comparisons.
For a pair of teams i and j, we let Y (l) ij be equal to 1 if team j beats team i, and 0 otherwise, during the l th match between the two teams, for l = 1, . . . , k.
assumes that P(Y (l) ij ) = wj wi+wj , where w represent the underlying vector of positive real weights associated to each player.
Motivated by the Bradley-Terry model (see Section 3 in the main text), Negahban et al. (2012) start by estimating the fraction of times team i has defeated team j, and denote this by as long as teams i and j competed in at least one match, and 0 otherwise, where y ijl = 1 if team i beat team j in the l th encounter between the two teams, and 0 otherwise. As a next step, they build the symmetric matrix A ij = a ij a ij + a ji , which converges to πi πj +πi , as k → ∞, where π i represents the 'strength' of team i. To define a valid transition probability matrix, one scales all the edge weights by 1/d max and considers the resulting random walk where d max denotes the maximum out-degree of a node, such that each row of P sums to 1. We recover the final rankings by sorting the entries in the corresponding stationary distribution, given by the top left eigenvector of P .

Ranking via singular value decomposition
The fourth ranking algorithm we rely on is based on the traditional Singular Value Decomposition (SVD), and was considered in Cucuringu (2016). What makes SVD applicable in this setting is the observation that, in the case of cardinal measurements C ij = r i − r j , the noiseless matrix of rank offsets C is a skew-symmetric matrix of even rank 2 since R = re T − er T , where e denotes the all-ones column vector. For a noisy problem, C is a random perturbation of a rank-2 matrix, which motivates us to consider its top two singular vectors, order their entries by their size, extract the resulting rankings, and choose between the first and second singular vector based on whichever one minimizes the number of upsets. Note that since the singular vectors are obtained up to a global sign, we choose the ordering which minimizes the number of upsets. For the Erdős-Rényi Outliers ERO(n, p, η) model (1), the following decomposition could render the SVD-Rank method amenable to a theoretical analysis using tools from the matrix perturbation and random matrix theory literature on rank-2 deformations of random matrices. The expected value of the entries of C is given by thus EC is a rank-2 skew-symmetric matrix EC = (1 − η)p(re T − er T ).
The decomposition of the given data matrix C into where R = C − EC is a random skew-symmetric matrix whose elements have zero mean makes this approach amenable to a robustness analysis using tools from random matrix theory, in particular low-rank perturbations of large random matrices.

Ranking via least squares
Finally, we also recover rankings via a more traditional least-squares approach. Assuming the number of edges in G is given by m = |E(G)|, we denote by B the edge-vertex incidence matrix of size m × n whose entries are given by and i < j 0 if (i, j) / ∈ E(G) , and by y the vector of length m × 1 which contains the pairwise rank measurements y(e) = C ij , for all edges e = (i, j) ∈ E(G). The least-squares solution to the ranking problem is obtained by solving minimize x∈R n ||Bx − y|| 2 2 , where L is an array of size m × n, m is the number of edges, L(e, i) = 1 and L(e, j) = −1 whenever edge number e connects nodes i and j, and b(e) = W (i, j) holds the rank offset.