The Open International Soccer Database comprises 216,743 entries, each describing the most commonly available and consistently reported data about the outcome of a league soccer match in terms of the goals scored by each team, teams involved, and league, season and date on which the match was played. The beauty of this type of soccer data is that it is readily available for most soccer leagues worldwide, including lower leagues. Thus, an important research question is to determine the limits of predictability for this type of data. In order to find this out, we invited the machine learning community to develop predictive models based on the version v1.0 of the Database.
The 2017 Soccer Prediction Challenge was part of the special issue on Machine Learning for Soccer in the Machine Learning journal. The Challenge description was published together with the call for papers for this special issue on 17 January 2017 (see supplementary material on the Challenge websiteFootnote 8). Figure 2 shows the overall time frame of the challenge. The participants contacted us by email to express their interest in the Challenge and then received a web link to download the data.
Commonly in data mining competitions, the results of the prediction set are known to the competition organizers. We wanted to create a real-world prediction challenge where the outcomes referred to real, future events that, at the submission deadline, could not be known by anyone. To organize such a “real” prediction problem, we structured the Challenge around two key dates: 22/03/2017 and 30/03/2017 (Fig. 2).Footnote 9 The final version of the Challenge learning setFootnote 10 and the prediction set were available on 22/03/2017. The learning set consists of data from 216,743 matches played on or before 22/03/2017, and the 206 prediction set matches were played after 30/03/2017. The participants’ task was to produce their final model in the time window between the two dates and submit their predictions for the prediction set by midnight CET on 30/03/2017. The particular time frame and deadline were chosen because in many leagues, regular play was suspended due to the World Cup 2018 qualifier games. Thus, there was a time window of about one week in which participants could develop their final models and apply them to the prediction set. Some lower leagues in various countries did not suspend play during this period, thus, no games from these leagues were used in the prediction set.
The final Challenge learning set is identical to v1.0 of the Open International Soccer Database presented in this article. In the remainder of this text we will use the term (final) Challenge learning set instead of Database.
Challenge data sets
We released the Challenge learning set in two instalments to the participants. The first instalment comprised data of 205,182 games (the most recent entries were matches played on 20/11/2016) and was released together with the public announcement of the Challenge. The participants could use the initial version of the learning set to gain an understanding of the data, try out various models, see what works and what doesn’t, etc. Then, on the 22nd of March 2017, eight days before the submission deadline, the participants received the updated, final version of the learning set, together with the prediction set. The final learning set contains the results of 216,743 matches (the most recent entries were matches played on 22/03/2017); it is identical to the Database presented above. It was necessary to update the learning set from its initial version so that the participants would have the match play time series of each team in the prediction set right up to the last match before their match in the prediction set.
The prediction set covers two seasons, the 2016/17 and the 2017/18 seasons, because some of the leagues covered started in 2016 and others in 2017. In total, there are 206 games for which the participants were asked to make a prediction. The prediction set has the same fields as the learning set, plus additional “x-fields.” The meaning of these fields is as follows:
-
1.
xW, xD, xL: Predicted home win, draw, and away win (loss), expressed as a real number from the unit interval [0, 1], such that \(xW+xD+xL=1\). These are the fields that refer to the mandatory task of the prediction Challenge. For example, the prediction \(xW=0.7\), and \(xD=0.2\) and \(xL=0.1\) means that the model “thinks” that the probability of a home win is 0.7, the probability of a draw is 0.2, and the probability of an away win is 0.1.
-
2.
xHS, xAS: Predicted goals scored by the home and away team, respectively, expressed as a non-negative real number. This was an optional task of the Challenge, which did not count towards the ranking of the submitted predictions.
-
3.
xGD: Predicted goal difference expressed as a real number. This was another optional task of the Challenge. Note that the goal difference may not necessarily be the difference between xHS and xAS because a model might compute the goal difference without explicitly calculating the actual number of goals scored by each team.
-
4.
xID: A unique identifier of the match in the prediction set.
The field Sea in the prediction set is Run for all matches, indicating that the season is in progress (referring to either the 2016/2017 or 2017/2018 season) at the time when the match entry is made. Table 6 illustrates the structure of the prediction set as it was provided to the Challenge participants. The table shows the first ten matches in the prediction set, with the mandatory prediction columns highlighted. The default for the unknown values of HS, AS, xW, xD, xL, xHS, xAS, and xGD was chosen arbitrarily and set to \(-1\).
Table 6 Excerpt of the prediction set showing the first ten matches (grouped by league) In order to facilitate a realistic and hard prediction challenge, the matches in the prediction set had to be carefully selected. First and foremost, we required that at the time the submissions were due (midnight, 30/03/2017 CET), the actual outcomes could not be known to anyone (including us, the organizers). Thus, only matches played after the submission deadline could be used for the prediction set. Second, since several of the leagues appearing in the learning set were not in progressFootnote 11 at the submission deadline, we could not include games from these leagues. Third, as explained in Sect. 1, matches from leagues that did not suspend regular league play in the period from 22/03/2017 to 30/03/2017 could not be included. For example, a full match day was played in the ENG3 league on 25/03/2017 and 26/03/2017. Fourth, each team in the prediction set had to appear only once; otherwise, the participants would have to predict the outcomes of two or more matches involving the same team. Thus, only 28 of the 52 leagues from the learning set could be used to select a total of 206 matches for the prediction set.
Note that originally, we were planning to include 223 matches in the prediction set. However, during the period in which the matches in the prediction set were being played, it turned out that some matches could not take place or were rescheduled. Hence, we contacted all Challenge participants and informed them that these matches had to be excluded due to unforeseeable circumstances. Also, because of rescheduling, some actual match dates were slightly changed.Footnote 12 Thus, the actual dates in which the prediction matches were played were from 31/03/2017 to 11/04/2017.
Table 7 shows the basic statistics of the actual outcomes and scores of the prediction matches.
Table 7 Summary statistics of the prediction set matches grouped by league The prediction set with the outcomes for the 206 matches is provided as supplementary material at the Challenge website (Berrar et al. 2017a). The data reflect the actual outcome (observed) of the games.
Version v1.0 of the Open International Soccer Database and the learning set of the 2017 Soccer Prediction Challenge are identical. The prediction set is unique to the Challenge and not covered by v1.0 of the Database. However, as we will continue to add matches to the Database, the matches of the Challenge prediction set will be subsumed in future versions of the Database.
Performance evaluation
The task of the 2017 Soccer Prediction Challenge was to construct a model that predicts the outcomes of future soccer games based on data describing past games. We were interested in comparing the predicted probabilities for home win, draw, and away win (loss) with the actual outcomes. The commonly used Brier score (Brier 1950), however, is not appropriate in this case because it measures only the deviance between the predicted and actual scores. For example, suppose that the actual outcome is a win of the home team, which is encoded as the vector (1, 0, 0). A model \(M_1\) predicts (0.6, 0.3, 0.1), whereas another model \(M_2\) predicts (0.6, 0.1, 0.3). The Brier score is the same for both models, but clearly, \(M_1\) made the better prediction because it assigned a higher probability to draw than to loss, so the probability mass is shifted towards win.
To account for the intrinsic order in the three outcomes (win, draw, and loss), we used the ranked probability score (RPS) (Epstein 1969; Constantinou et al. 2012), which is defined in Eq. (1),
$$\begin{aligned} \mathrm {RPS} = \frac{1}{r-1} \sum _{i=1}^{r-1} \left( \sum _{j=1}^i (p_j - a_j)\right) ^2, \end{aligned}$$
(1)
where r refers to the number of possible outcomes (here, \(r = 3\) for home win, draw, and loss). Let \(\mathbf p = (p_1, p_2, p_3)\) denote the vector of predicted probabilities for win (\(p_1\)), draw (\(p_2\)), and loss (\(p_3\)), with \(p_1 + p_2 + p_3 = 1\). Let \(\mathbf a = (a_1, a_2, a_3)\) denote the vector of the real, observed outcomes for win, draw, and loss, with \(a_1 + a_2 + a_3 = 1\). For example, if the real outcome is a win for the home team, then \(\mathbf a = (1, 0, 0)\). A rather good prediction would be \(\mathbf p = (0.8, 0.15, 0.05)\). The smaller the RPS, the better the prediction.
The RPS value is always within the unit interval [0, 1]. An RPS of 0 indicates perfect prediction, whereas an RPS of 1 expresses a completely wrong prediction. For example, assume that the actual, observed outcome of a soccer match was a win by the home team, coded as \(A = (1, 0, 0)\). Let’s further assume two predictions for that match: (1) a “crisp” draw prediction, B, encoded as \(B = (0, 1, 0)\), and (2) a probabilistic prediction, C, with a home win trend, encoded as \(C = (0.75, 0.20, 0.05)\). Then, by applying Eq. (1), we obtain a ranked probability score of \(\mathrm {RPS} = 0.500\) for prediction B and \(\mathrm {RPS} = 0.0325\) for prediction C. So, according to the RPS, the prediction C is better than B, which is also intuitively plausible. Or consider the prediction \(D = (0.10, 0.80, 0.10)\), which leads to \(\mathrm {RPS} = 0.410\). This prediction is better than B but not as good as C.
The goal of the Challenge was to minimize the average over all ranked probability scores for all \(n = 206\) matches in the Challenge prediction set,
$$\begin{aligned} \mathrm {RPS}_{\mathrm {avg}} = \frac{1}{n}\sum _{i=1}^n \mathrm {RPS}_i. \end{aligned}$$
(2)