SoccerFootnote 1 is the biggest global sport and a fast-growing, multi-billion dollar industry. Advanced data analytics are being more frequently employed on both the club and national levels to improve performance, equipment, marketing, scouting, etc. Soccer therefore offers interesting challenges for the machine learning community. This special issue solicited articles on all aspects of data analysis and machine learning for soccer.

As part of the special issue, we posed the 2017 Soccer Prediction Challenge that revolved around predicting the outcomes of future soccer matches. This is an interesting task for the general public, researchers, clubs, media, news and advertising companies, and professional odds setters. Soccer outcome prediction has been the subject of research since at least the 1960s (Reep and Benjamin 1968; Hill 1974; Maher 1982; Dixon and Coles 1997; Angelini and Angelis 2017). Various statistical techniques have been used for outcome prediction, including Poisson models (Karlis and Ntzoufras 2003), Bayesian models (Baio and Blangiardo 2010; Rue and Salvesen 2000), rating systems (Hvattum and Arntzen 2010), and more recently also machine learning methods, such as kernel-based relational learning (Van Haaren and Van den Broeck 2011). O’Donoghue et al. (2004) used machine learning and statistical methods to predict the results of the 2002 FIFA World Cup but achieved the best prediction with a simulation on a commercial game console.

The fundamental research question of the 2017 Soccer Prediction Challenge was the following: “To what extent is it possible to predict the outcome of a soccer match, given commonly available match data?” The competition’s task was to use machine learning to predict the outcome of future soccer matches. To do so, the participants received v1.0 of the International Open Soccer Database, which has been under development since 2001. The database contains the match reports of 216,743 past matches from 52 soccer leagues in 35 countries covering the years 2000–2017. Each match report specifies the name of the home and away team, respectively, the goals scored by each team, the date on which the match was played, as well as the corresponding soccer league and season. Such match reports represent the most commonly available data about soccer matches around the world. Thus, models learned from this data can be applied to future soccer matches without requiring special arrangements with commercial entities that collect and sell more sophisticated data about soccer matches.

Using only the provided data, the Challenge participants had to develop machine learning models in order to predict the outcome of 206 future matches that took place after the submission deadline. Thus, when the participants submitted their predictions, the outcomes for these matches were not known to anyone. The goal was to minimize the average ranked probability score (\(\mathrm {RPS}_\mathrm {avg}\)) (Epstein 1969) of the predictions. These conditions highlight two goals of the challenge, which were (1) to gauge the limits of predictability with these commonly available data, and (2) to pose a real-world machine learning challenge with a fixed time line involving the prediction of real future events. The last point is a key factor that distinguishes the 2017 Soccer Prediction Challenge from other data mining challenges.

Table 1 summarizes the results. Usually, data mining competitions prohibit the organizers from participating. Because this competition involved predicting the outcomes of real future events that were unknown to us, too, we adhered to the same rules and submitted our predictions as Team DBL.Footnote 2 Nonetheless, we considered our predictions to be out-of-competition.

We congratulate the winners of the 2017 Soccer Prediction Challenge:

  1. 1.

    First place: Team OH (Hubáček et al. 2018).

  2. 2.

    Second place: Team ACC (Constantinou 2018).

  3. 3.

    Third place: Team FK (Tsokos et al. 2018).

The Database, the 2017 Soccer Prediction Challenge and its results are described in Dubitzky et al.’s article entitled “The Open International Soccer Database for Machine Learning” (Dubitzky et al. 2018). All materials related to the Database and Challenge are publicly available under the CC0 1.0 Universal license through the Open Science Framework project sites.Footnote 3

Table 1 Summary of the results for the 2017 Soccer Prediction Challenge

This special issue features selected papers of the top-performing teams that participated in the Challenge. In total, the special issue received ten submissions from participating teams, and four of these submissions were accepted. Seven further submissions reporting on machine learning methods for soccer were unrelated to the Challenge. Of these seven papers, one was accepted.

This special issue consists of six papers that are briefly discussed as follows. The article “Learning to predict soccer results from relational data with gradient boosted trees” by Hubáček et al. describes the winning approach for the Challenge. Their model is based on manually engineered features and extreme gradient boosted trees.

In “Dolores: A model that predicts football match outcomes from all over the world”, Constantinou presents a model for soccer outcome prediction based on hybrid Bayesian networks and dynamic performance rating that placed second in the Challenge. A comparison with bookmakers’ odds revealed that Dolores could also increase profitability in terms of return of investment, albeit only marginally.

The article “Modeling outcomes of soccer matches” by Tsokos et al. compares various extensions of Bradley–Terry models and a hierarchical log-linear Poisson model for the prediction of soccer outcomes. Their best model achieved third place in the Challenge.

The article titled “Incorporating domain knowledge in machine learning for soccer outcome prediction” by Berrar, Lopes, and Dubitzky presents two new feature engineering methods for match outcome prediction: recency feature extraction and rating feature learning. With the latter method, we constructed a learning set and trained a k-nearest neighbor model, which achieved the best performance among all models submitted to the Challenge. We conclude that the key challenge in soccer prediction lies in domain knowledge integration.

The article “Probabilistic movement models and zones of control” by Brefeld et al. is a submitted paper not directly related to the Challenge. The authors present a probabilistic, data-driven movement model to estimate positions, directions, and velocities of players at observed timestamps. Using their model, the authors derive zones of control, also known as dominant regions. If the ball falls into the zone of control of a player, then this player is most likely to gain control over the ball; consequently, the more space a team controls, the more dominant the team is. A comparison with existing movement models suggests that this model leads to a more realistic estimation of zones of control. This model might give useful insights into game tactics and team performance, not only for soccer but also other, similar team sports.

Soccer provides many fascinating challenges for machine learning, and we hope that this special issue will spur further research. Particularly interesting new data and challenges are the following:

  • Event streams This type of data annotates specific events that occur in a soccer match (Opta Sports 2018; Wyscout 2018; STATS’ SportVU 2018). The precise number of events, each event’s definition, and what information is available about the event varies according to the provider. Typically, there are around 40 different types of events, and each event lists the type of the event, the players who are involved, a timestamp, the start location of the event, and the end location of the event (if applicable). Sometimes additional information may be available, for example, if a shot was made by the head or foot. Typical events include passes, clearances, fouls, shots, cards, and substitutions.

  • Optical tracking A variety of companies, such as ChyronHego (2018), Stats LLC (2018), SciSports (2018), and Second Spectrum (2018) record the locations of the players and the ball at a high frequency using optical tracking systems during matches.

  • Player monitoring Players are often outfitted with sensor systems (STATSports 2018; Catapult 2018) including accelerometers, gyroscopes, heart rate monitors, and GPS during training sessions and games (if permitted). Furthermore, the data generated by these devices may be augmented with additional data, such as fatigue ratings (e.g., the rating of perceived exertion) or general wellness scores (e.g., muscle soreness). Additionally, clubs record and store information from physical testings (e.g., flexibility, speed, maximum rate of oxygen consumption during exercise, etc.).

These types of data are of interest to a variety of different parties. Clubs and national teams are continually trying to exploit these types of data to improve performance, equipment, marketing, scouting, etc. Fans may be interested in analyzing and debating the performances of their favorite teams. Bettors and oddsmakers are interested in how these data can be exploited to turn a profit. This has lead to an explosion of interest in data science and analytics, specifically for the following tasks:

  • Evaluating actions One of the most prominent new metrics is known as expected goals (Eastwood 2015; Eggels 2016; Lucey et al. 2015; Ijtsma 2015; Caley 2015). The objective is to quantify the quality of a shot by training a statistical model that predicts the probability of scoring based on the features of the shot (e.g., location, angle to the goal, etc.). More recently, there have been attempts to move beyond simply evaluating shots by assigning values to other actions on the pitch, such as shots or even individual movements based on event streams and/or optical tracking data (Decroos et al. 2018; Spearman 2018; Gyarmati and Stanojevic 2016; Bransen and Van Haaren 2018; Pappalardo et al. 2018). By evaluating all actions, it is possible to derive rankings or overall ratings of players.

  • Identifying tactics and strategy One line of work looks at trajectory data produced by optical tracking to try to understand tactics, such as how play is built up from the back (Knauf et al. 2016), analyzing how effective a team is at creating scoring opportunities (Fernando et al. 2015), or using data-driven ghosting methods to understand how a team should have addressed certain situations (Le et al. 2017). Researchers have also analyzed tactics from event stream data to find commonly occuring sequences of events that lead to attempts on goal (Van Haaren et al. 2015) or identify whether an attempt is likely in the near future (Decroos et al. 2017). Substantial attention has been devoted to understanding passing behaviors, particularly in terms of finding different types of recurrent passing patterns (Gyarmati et al. 2014; Gyarmati and Anguera 2015; Bekkers and Dabadghao 2017). Other tasks include predicting if a pass will succeed (Spearman et al. 2017), classifying different types of passes (Chawla et al. 2017), and predicting whom a player may pass to (Vercruyssen et al. 2016). Finally, researchers have built occupancy maps based on ball movements (Lucey et al. 2013) and attempted to recognize team formations (Bialkowski et al. 2014).

  • Monitoring players’ health Currently, professional soccer clubs monitor training and match load of all players. From a sports science perspective, both the external load and internal load are of interest. Intuitively, the external load captures the level of activity (e.g., amount, intensity, etc.) performed by players, and it is often measured by having players wear sensors (e.g., GPS and accelerometer). The internal load captures the body’s physiological response to the activity, and it is measured by having the players report the rating of perceived exertion. Researchers have explored using machine learning techniques to investigate the relations between these two loads as well as perceived wellness (Rossi et al. 2017; Vandewiele et al. 2017; Jaspers et al. 2018a, b), which could help optimize training routines. Another promising but very challenging task is to build models to assess a player’s risk of a non-contact injury based on physical and testing data collected from the players (Kampakis 2016; Rossi et al. 2018).

We would like to thank everyone who was involved in this special issue, particularly all contributing authors and the reviewers. We also thank the editorial and publishing staff at Springer for their support. Special thanks also goes to Peter A. Flach, Editor-in-Chief of Machine Learning, and Dragos D. Margineantu, Editor for Special Issues.