Data Science: Present and Future Analysis and visualization of team performances of football games

It is well known that the number of goals in a football game follows a Poisson distribution very well. Therefore, the intensity parameters of Poisson distributions, followed by the number of goals gained/lost by a team, can be regarded as the indices of the offensive/defensive performance of the team. Teams belonging to a professional foot-ball league have many games throughout the season. The performances of such teams should be considered not to be constant throughout the season. A Poisson regression model with varying coefﬁcients is proposed to analyze and visualize the time-varying performance indices of football teams. It is also applicable to estimating the performances up to the middle of a season, and estimated performances can be used to predict the future game’s outcome probabilities. These methods are demonstrated with the data of the Japanese professional football league.


Introduction
It is well known that the number of goals by a team in a football game follows a Poisson distribution very well (e.g., Chu, 2003). Based on it, many statistical models which assume that scores gained are generated from a Poisson distribution with a particular intensity parameter have been discussed (Izumi & Konaka, 2016;Saraiva et al., 2016; B Shizue Izumi shizue-izumi@biwako.shiga-u.ac.jp Tsuneshi Obata t-obata@oita-u.ac.jp Koopman & Lit, 2015). The intensity parameter of the Poisson distribution can be considered as the index of team performance. Izumi and Konaka (2016) discussed the two-stage championship held in the 2015 and 2016 seasons in Japan Professional Football League (J.League) Division 1. They proposed some probability models that determine the number of goals. Saraiva et al. (2016) proposed a model that the performances of teams and the effect of home advantage determine the number of goals and applied it to the professional football leagues in England and Brazil. However, they both assumed that teams' performances never vary through a season. Koopman and Lit (2015) proposed the model that the intensity parameter of the next game is determined stochastically from that of the previous game, like a Markov model. They treated the time discretely.
The varying coefficient models proposed by Hastie and Tibshirani (1993) are applied to the fields of medical science and social science. For example, Izumi et al. (2015Izumi et al. ( , 2017 proposed the method to visualize the transition of effects of binary covariates for count data and demonstrated it with the Peace Declaration of Hiroshima and Nagasaki. The varying coefficient models can be applied to estimate the effect depending on time or spatial position as seen in the literature, including them. Izumi and Obata (2018) discussed introducing the varying coefficient model into the analysis of football data. This paper considers the continuous time-varying transition of the team performance by introducing the varying coefficient model, based on their discussion. It can be used smoothing for counting data by taking the observed time, and the observed number of goals gained/lost as x-axis and y-axis, respectively, to extract continuous performance. We propose the visualization of the estimated transition of team performance from semi-parametric regression with linear spline basis functions.
We can also use our model for the prediction of game results. In the middle of the season, our model can estimate the performance transition function up to that time. Based on this estimated performance function, both teams' offensive/defensive performances in the next game can be predicted. The predicted performances enable us to evaluate the probabilities of the number of goals and the game outcome. Obata and Izumi (2018) considered using the estimated performances to predict the outcome of the game outcome. This paper proposes a method to predict the outcome probabilities of football games. This paper consists as follows. Section 2 reviews the varying coefficient model. In Sect. 3, we propose the model to estimate the time transition of team performances through a season. An example of an application to the data of J.League Division 2 in the 2018 season is shown. In Sect. 4, our model is applied to estimate the performances up to the middle of the season. We propose the method for predicting the probabilities of the outcomes of the succeeding game using these estimated performances. The overall appropriateness of the results of the prediction is also discussed. Section 5 concludes this paper. y(t 1 ), . . . , y(t n ) can be considered as observed values of time-varying random variable Y (t) following a Poisson distribution Po(λ(t)), where intensity parameter λ(t) varies through the time t. If responses depend on time-varying p covariates a 1 (t), . . . , a p (t), Poisson regression model can be written as where β j (t) are the time-varying effects and called as varying coefficients. The varying coefficient model was proposed originally by Hastie and Tibshirani (1993). If the number of observation times n is sufficiently large, linear spline functions with r (< n − 2) knots κ 1 , . . . , κ r can be considered as varying coefficients β j (t). Linear spline function is continuous polyline, and its basis are represented as where .
Positions of knots are set so that each interval includes a sufficient number of data. Izumi et al. (2015) propose arranging knots so that each interval contains the same number of observations using quantile of data. Varying coefficients β j (t) are represented as where b j is the vector of regression coefficients.

Model for estimation
If the number of goals gained Y O by team k is assumed to follow a Poisson distribution, there exists an intensity parameter λ These parameters may differ from team to team. So, the magnitude of λ (k) O can be interpreted as the index of the offensive performance of team k. In the same manner, where Y indicates high defensive performance. However, we call λ (k) D as the index of defensive performance. We omit the superscript of team k unless misleading hereafter.
In most professional football leagues, teams play many games in a long season. It must be natural that the offensive/defensive performances of a team vary throughout a season. Therefore, it can be supposed that there exist time-varying intensity parameters where Y O (t) and Y D (t) denote the numbers of goals gained and lost by a team at the game held at time t, respectively. In this paper, we regard these time-varying offensive as the transitions of team performances.
To estimate these indices from actually observed numbers of goals gained and lost by a team, Poisson regression models with varying coefficients are suitable. We consider the simplest models in which we set p = 1 and a 1 (t) = 1 in Eqs. (1) and (3) as the simplest model, i.e., we fit the observed number of goals gained and lost to the following model: Let the number of games that each team has in a season be n, then the actual number of goals gained and lost y u (t 1 ), y u (t 2 ), . . . , y u (t n ) in games held at t 1 < t 2 < · · · < t n are observed. By fitting these observed data to the model (4), we can obtain the estimatesb u of the coefficients b u and the estimate of the team performance indiceŝ By drawing the graphs of these estimatesλ O (t), −λ D (t), andλ T (t), the transitions of team performances can be visualized. These three curves run along with the observed numbers of goals gained, the observed numbers of goals lost, and the observed goal differentials, respectively.

Example
Let us see an example. We apply the previously mentioned model to the data of the 2018 season of J.League Division 2 (J2). J2 consists of 22 teams. All teams had two round-robin games; therefore, each played 42 games from Feb. 25 (0th day) to Nov. 17 (265th day). We use days from the season's opening day as the unit of time. Table 1 summarizes the data of the games by Oita Trinita in the 2018 season. The meanings of each column are as follows: Date: the date when the game was held, Sec.: the section number, Days: the number of days since the opening day of the season, Gained: the number of goals gained, Lost: the number of goals lost, Opponent: the name of the team against.
We apply our model mentioned in the previous subsection to such data. The number of games n is 42. All games are numbered as sections ("Sec." column in Table 1). Ordinarily, sections are numbered in the order of the date games are scheduled. However, games may be held against the order of the sections due to inclement weather or others. The "Days" column corresponds to a series of observation times, t 1 = 0, t 2 = 7, . . . , t 42 = 265. The observation times are ordered according to the actual time passage, even if the games are held in irregular order. The "Gained" column corresponds to the observed number of goals gained. That is The "Lost" column corresponds to the observed number of goals lost. That is y D (t 1 ) = 2, y D (t 2 ) = 2, . . . , y D (t 42 ) = 1.
As for Eq.
(2), we use three knots such as κ q = q 4 · 265 (q = 1, 2, 3) that each interval contains around ten games. Figure 1 shows the estimated transition of performances of Oita Trinita. The top figure illustrates the estimated offensive index functionλ O (t) (solid line) with three knots (vertical bars). Solid dots express the observed numbers of goals gained. In the same way, the central illustrates the estimated defensive index functionλ D (t) and the observed numbers of goals lost. These are plotted by multiplying a minus one so that

Prediction of outcome probabilities
In the previous section, we obtained the transitions of team performances. In this section, we propose the method to obtain the outcome probabilities of the succeeding game using a proposed model.
Let us consider the situation to predict the game's outcome-which team will win?-in which competing teams k and l at the time t * , in advance.
Suppose that team k have finished i (k) games held at t 1 < · · · < t i (k) so far. By fitting the Poisson regression model (4) to the observed number of goals gained/lost (k) ), the transitions of performances λ (k) u (t) of the team up to the time t i (k) can be estimated (u = O, D). Similarly for team l, the transitions of performances λ (l) u (t) can be estimated (u = O, D). As we will see later, the allocation of the knots should be considered carefully.
The number of goals gained by team k in the next game can be supposed to follow a Poisson distribution with certain intensity λ (k) . If trends of performances of team k continue, this parameter λ (k) will depend on the intensity of goals gained by team k at time t * , λ (k) O (t * ). However, if the defensive performance of the opponent, team l, is high, team k may not fully show their offensive performance. Conversely, if team l has low defensive performance, team k may be able to score more than expected. So that, the intensity of goals lost by team l at time t * , λ (l) D (t * ), also has influence to λ (k) . Consequently, it is natural to suppose that the value of λ (k) can be represented as some kind of average of λ So, we assume that the number of goals gained by team k-this equals to the number of goals lost by team l-at the next game Y (k) follows the Poisson distribution whose intensity parameter equals a geometric mean of λ (k) O (t * ) and λ (l) D (t * ) as follows: The reason why we use a geometric mean is the structure of λ u (·) = exp [β u (·)]. In the same way, the number of goals gained by team l at the next game Y (l) is assumed to be If the Y (k) and Y (l) can be considered independent, the joint distribution of Hence, probabilities of the outcome of the next game competing teams k and l can be predicted as

Example
We demonstrate our prediction method using the 2018 season J2 data again. Remind that each team in the league plays n = 42 games.
Estimation of the performances in the middle of the season does not work well if each interval does not contain sufficient data. So, we change the number and the allocation of knots of linear spline according to the number of usable data. Let i * be the number of games up to the current time. Similar to Sect. 3.2, the basis of time is set as follows with the maximum number of knots is three: and the allocation is set as follows: where t i * denotes the time when i * th game held. Therefore, each interval contains more than five pieces of data.
An example of performance estimation of Oita Trinita up to the 19th game is shown in Fig. 2. The performances are estimated using linear spline with one knot (i * = 19 and t i * = 111).
For example, we show the prediction process of the outcome of the game of Oita Trinita vs. Avispa Fukuoka held at time t * = 118. It was the 20th game for both. First, the performance index functions of both teams up to the 19th game,λ   The numbers of goals gained by both teams, Y (Oita) and Y (Fukuoka) , are assumed to follow the following distributions, respectively: The joint distribution of Y (Oita) , Y (Fukuoka) is shown in Table 2  Pr {Fukuoka wins at the 20th game} Thus our method using the data up to the 19th game predicts that Oita Trinita has a slightly better chance of winning against Avispa Fukuoka at the 20th game. The actual result of this game was that Oita Trinita achieved a 1-0 victory over Avispa Fukuoka. This case suggests that our method may be valid. Thus, we decided to examine other cases.
We applied such a process to all games after the sixth section in the 2018 season of J2. Table 3 shows a part of the predicted probabilities of outcomes using the data up to the just previous game. The meanings of each column are as follows: Sec.: the section number, Home team: the home (H) team, Away team: the home (A) team, Prob. H wins: the predicted probability that the home team wins, Prob. draw: the predicted probability that the game ends in a draw, Prob. A wins: the predicted probability that the away team wins, Actual outcome: the actual outcome of the game.
In some cases, like Chiba vs. Kyoto in Section 6, Okayama vs. Omiya in Section 42, and Kumamoto vs. Ehime in Section 42, our prediction was appropriate, and in others was not.

Overall appropriateness of our prediction
To see the overall appropriateness of our prediction method, suppose the situation of betting on the outcome of the next game using our predicted probabilities. There are three alternatives: "home win", "draw" or "away win", and one can bet on one of these outcomes.
We introduce the following ten betting strategies: D1: Bet on the outcome whose predicted probability by our method is the largest. D2: Bet on "home win" always. D3: Bet on "draw" always.

Home won
Underline indicates the largest among three predicted probabilities Boldface indicates that our prediction is consistent with the actual outcome D4: Bet on "away win" always. D5: Bet on the outcome whose probability is the largest based on the recent three games. D6: Bet on the outcome whose probability is the largest based on the recent five games. S1: Try a random sampling of three outcomes with predicted probabilities by our method and bet on the result. S2: Try a random sampling of three outcomes with equal probabilities and bet on the result. S3: Try a random sampling of three outcomes with probabilities based on the recent three games and bet on the result. S4: Try a random sampling of three outcomes with probabilities based on the recent five games and bet on the result.
"Probabilities based on the recent games" used in D5, D6, S3, and S4 are determined by the following procedure: Six (D1-D6) are deterministic, and four (S1-S4) are stochastic. The strategies D1 and S1 use our predicted probabilities.
We applied these strategies to all games after the sixth section in the 2018 J2 season previously mentioned. For deterministic strategies, Table 4 shows the proportions of hitting the correct answers. Strategy D1, which uses our prediction, seems better than others, though not significantly.
For stochastic strategies, we had 10,000 trials each. Figure 3 shows the distributions of the proportions of hitting the correct answers by four strategies. The medians are 0.3686, 0.3342, 0.3612 and 0.3612 for S1, S2, S3 and S4, respectively. Strategy S1, which uses our prediction, seems slightly better than the others. Fig. 3 Distributions of the proportion of hitting the right answers by S1-S4

Conclusions
If the number of goals in a football game follows a Poisson distribution, we can regard the intensity parameter of the distribution as an indicator of the team's performance. Besides, the intensity can be supposed to vary over time.
In this paper, at first, a method for estimating the time-varying performance of football teams was proposed. We introduced the varying coefficient model into sports data analysis and demonstrated that it could be used to estimate the time transition of team performances. For the estimation of semi-parametric varying coefficient, the mixed effect model can be used as in Izumi et al. (2015Izumi et al. ( , 2017. However, we used the simple generalized linear model here because it is necessary to carefully interpret the meaning of estimated coefficients for the mixed effect model.
Our performance estimation is also available for the data up to the middle of the season. We proposed a method to predict the probabilities of the next game's outcome based on estimated performance up to the previous game. It was applied to the one-year data of the Japanese professional football league, and we examined the appropriateness of our method.
We used the simplest varying coefficient model, which does not contain other covariates than time. The effect of home advantage in a football game is discussed in several works of literature (e.g., Goumas, 2014;Koopman & Lit, 2015;Saraiva et al., 2016). Other factors, e.g., weather or the number of audiences, which may influence the football team's performance, can also be considered. The addition of these environment covariates is worth considering.
The combination of players and team formation are also essential factors in the performances of football teams (Hirotsu & Ueda, 2015). Considering these factors may improve the results in the phases of performance estimation or outcome prediction.
In the prediction phase, we assume independence between the number of goals of both teams in a game. However, the appropriateness of this assumption should be discussed more carefully.
Finally, careful investigation of errors and confidence intervals of estimation and prediction remains for future research.