Keywords

1 Introduction

In a football game, players pass the ball to their teammates in order to create good shooting opportunities or prevent the opposing team from getting the control of the ball. Accurately passing the ball to the right player is essential for winning a football game [1, 6].

Prior work [6, 9] studies how passing sequences lead to goals. Their findings have shaped the tactics of many football coaches. In this work, we build a learning to rank model [8] to anticipate the receiver of a football pass. We believe that football coaches and players can take our results into consideration when they make their tactics or make their passes/runs. For example, a player could learn from the important factors for explaining the receiver of a pass to improve his/her chance of receiving the ball. Anticipating receivers of passes can also help automated cameras always focus on the ball in a game.

This work analyzes a dataset which contains 12,124 passes performed by a Belgian football club in 14 gamesFootnote 1. We want to answer the following research questions (RQs):

  • RQ1: How well can we model the receiver of a pass? We build a learning to rank model to predict the receiver of a football pass. An accurate model can help coaches and players make informed tactical decisions in a game.

  • RQ2: What are the important factors that explain the receiver of a pass? We analyze the model to find the most influential factors that explain the receiver of a pass. Understanding such influential factors can help coaches and players improve their tactics and passes/runs according to these factors.

Paper organization. The remainder of the paper is organized as follows. Section 2 explores the dataset that we use. Section 3 discusses our approaches for building and evaluating our prediction model. Section 4 presents the results for answering our research questions. Finally, Sect. 5 draws conclusions.

2 Data Exploration

Dataset overview. The dataset contains information about 12,124 football passes. For each pass, the dataset provides the information about the time of the pass since the start of the half, the coordinates of all the players on the pitch, the identifier of the player who passes the ball, and the identifier of the player who receives the ball.

Dealing with missing values. Our goal is to predict the receiver of a pass based on the information about the sender and other players on the pitch (i.e., the candidate receivers). Among the 12,124 passes, there is one pass that misses the coordinates of the sender, and one pass that misses the coordinates of the receiver. For another six passes, the senders and the receivers are the same players. We remove the above-mentioned eight data instances from our dataset. We end up with 12,116 valid passes in our dataset.

Table 1. A summary of players’ passing statistics.

Overall, players’ passing accuracy is 83%, and the passing accuracy decreases from the back field to the front field. Table 1 shows a summary of players’ passing statistics. We define the passing accuracy as the ratio of the passes that reach a teammate. We divide the field into three equally sized areas along the long side of the field, namely back field, middle field and front field. We define a pass as a back-field pass, middle-field pass, or front-field pass when the sender is within the back field, the middle field and the front field, respectively.

The median passing distance is 14 m, and the passing distance decreases from the back field to the front field. Table 2 shows the five-number summary of players’ passing distance. While the maximum passing distance is 70 m, 75% of the passes are within 20 m. As shown in Table 1, the median passing distance for the back field, middle field, and front field is 17, 14, and 11 m, respectively.

Table 2. Five-number summary of players’ passing distance.

Players pass the ball forwards in 62% of the passes, and the ratio of forward passes decreases from the back field to the front field. In the back field, players pass the ball forwards in 74% of the passes, and the ratio decreases to 61% and 50% for middle-field passes and front-field passes, respectively.

figure a

3 Methodology for Predicting the Receivers of Football Passes

This section discusses our overall methodology, including our feature extraction process, modeling and evaluation approaches.

3.1 Feature Extraction

From the dataset that we explain in Sect. 2, we extract five dimensions of features to explain the likelihood of passing the ball to a certain receiver. In total, we extract 54 features. A full list of our features is available at our public github repositoryFootnote 2. We also share our extracted feature values onlineFootnote 3.

  • Sender position features. This dimension of features capture the position of the sender on the field, such as the sender’s distance to the other team’s goal. We choose this dimension of features because players have different passing strategies at different positions, for example, players may pass the ball more conservatively in their own half.

  • Candidate receiver position features. This dimension of features capture the position of a candidate receiver, such as the candidate receiver’s distance to the sender. Senders always consider candidate receivers’ positions when they decide to whom to pass the ball.

  • Passing path features. This dimension of features measure the quality of a passing path (i.e., the path from the sender to a candidate receiver), such as the passing angle. The quality of a passing path can predict the outcome (success/failure) of a pass.

  • Team position features. This dimension of features capture the overall position of the team in control of the ball, such as the front line of the team. Team position might also impact the passing strategy, for example, a defensive team position might be more likely to pass the ball forwards.

  • Game state features. This dimension of features capture the state of the whole game, such as the time when the sender passes the ball. We do not use the time when the receiver receives the ball as a feature in our model, as it exposes information about the actual pass (e.g., pass duration).

3.2 Modeling Approach

We formulate the task of predicting the receiver of a football pass as a learning to rank problem [8]. For each pass, our learning to rank model outputs a ranked list of the candidate receivers. A good model should rank the correct receiver at the front of the ranked list. LambdaRank [2] is a general and widely-used learning to rank framework. LambdaRank relies on underlying regression models to provide ranking predictions. LambdaMART [2] is the boosting tree version of LambdaRank. It relies on a gradient boosting decision tree (GBDT) [5] to provide ranking predictions. There are quite a few effective implementations of LambdaMART, such as XGBoost and pGBRT, which usually achieve state-of-the-art performance in learning to rank tasks.

In this work, we use an efficient implementation of LambdaMART, LightGBM [7], which speeds up the training time of conventional LambdaMART implementations (e.g., XGBoost and pGBRT) by up to 20 times while achieving almost the same accuracy. We use an open source implementation of LightGBM that is contributed by MicrosoftFootnote 4.

We use a 10-fold cross-validation to build and evaluate our model. The passes data is randomly partitioned into 10 subsets of roughly equal size. We build our model using nine subsets (i.e., the model building data) and evaluate the performance of our model on the held-out subset (i.e., the testing data). The process repeats 10 times until all subsets are used as testing data once.

In each fold, we further split the model building data into the training data and validation data. We train the model on the training data and use the validation data to tune the hyper-parameters of the model. We do a grid search to get the top three sets of hyper-parameter values according to the performance of the model on the validation data. Then, we build three models with these three set of hyper-parameters using the training data. We apply these three models on the testing data and get three sets of results. We then average the results for each receiver candidate and use the averaged results to rank the receiver candidates. We find that with such an ensemble modeling approach, the accuracy of our model improves up to 2%.

3.3 Baseline Models

In order to evaluate the performance of our LightGBM ranking model, we compare it with several baseline models. As discussed in Sect. 2, 75% of the passes are within 20 m (i.e., short passes), and 62% of the passes are forward passes. Therefore, we derive baseline models that tend to select the nearest teammates and the teammates in the forward direction as the receiver.

  • The RandomGuess model selects the receiver of a pass by a random guess. It randomly ranks the candidate receivers.

  • The NearestPass model selects the nearest teammate of the sender as the top candidate receiver. It ranks the candidate receivers by their distance to the sender, from the teammates of the sender to the opponents, and then from the closest to the furthest.

  • The NearestForwardPass model selects the nearest teammate of the sender that is in the forward direction (relative to the sender) as the top candidate receiver. It ranks the candidate receivers by their relative position to the sender, from the teammates of the sender to the opponents, then from the players in the forward direction to the players in the backward direction, and finally from the closest to the furthest.

3.4 Evaluation Approaches

We use top-N accuracy and mean reciprocal rank (MRR) to measure the performance of our model. Top-N accuracy measures the accuracy of the model’s top-N recommendations, i.e., the probability that the correct receiver of a pass appears in the top-N receiver candidates that are predicted by the model. For example, top-1 accuracy measures the probability that the correct receiver of a pass is the first player in the predicted list of receiver candidates.

Reciprocal rank is the inverse of the rank of the correct receiver of a pass in an ranked list of candidate receivers predicted by the model. MRR [3] is the average of the reciprocal ranks over a sample of passes P:

$$\begin{aligned} \text {MRR} = \frac{1}{|P|}\displaystyle \sum _{p=1}^{|P|}\frac{1}{\text {rank}_p} \end{aligned}$$
(1)

where \(\text {rank}_p\) is the rank of the correct receiver for the pth pass. The reciprocal value of MRR corresponds to the harmonic mean of the ranks. MRR ranges from 0 to 1, the larger the better. While top-N accuracy captures how likely the correct receiver appears in the top-N predicted receivers, MRR captures the average rank of the correct receiver in the predicted list of receiver candidates.

As discussed in Sect. 3.2, we use a 10-fold cross-validation to build and evaluate our model. Therefore, we use a mean top-N accuracy and MRR across the 10 folds in Sect. 4.

3.5 Feature Importance

In order to understand the importance of the features in our model, we use the feature importance scores that are automatically provided by a trained LightGBM model. Gradient boosting decision trees (e.g., LightGBM) provide a straightforward way to retrieve the importance scores of each feature [4].

Table 3. The accuracy of our model for predicting the receiver of a pass (excluding false passes).

After the boosting decision trees are constructed, for each decision tree, the importance of a feature is calculated by the amount that the feature improves the performance measure at its split point (i.e., split gains). The importance of each feature is then accumulated across all of the decisions trees in the model.

4 Results

This section discusses the answers to our research questions.

4.1 RQ1: How Well Can We Model the Receiver of a Pass?

Our model can predict the receiver of a pass with a top-1, top-3 and top-5 accuracy of 50%, 84%, and 94%, respectively, when we exclude false passes (i.e., passes to the other team). Table 3 shows the performance of our model when we exclude false passes. The “Back-field”, “Middle-field”, “Front-field” and “Overall” columns show the performance of our model for back-field passes, middle-field passes, front-field passes and all passes, respectively. A top-3 accuracy of 84% for all passes means that the actual receiver of a pass has a 84% chance to appear in our top-3 predicted candidates. The MRR value for all passes is 0.68, which means on average, the correct receiver is ranked 1.5th (i.e., 1/0.68) out of 10 or less receiver candidates (i.e., all teammates of the sender).

Table 4. Comparing the accuracy of our model with baseline models (excluding false passes).
Table 5. The accuracy of our model for predicting the receiver of a pass (considering all passes including passes to the other team).
Table 6. Comparing the accuracy of our model with baseline models (considering all passes including passes to the other team).

Our model can predict the receiver of a pass with a top-1, top-3 and top-5 accuracy of 41%, 70%, and 81%, respectively, when we consider all passes. Table 5 shows the performance of our model when we consider all passes (including false passes). The performance of our model decreases when we consider false passes (i.e., passes to the other team). False passes are very difficult to predict because it is not the sender player’s intention to pass the ball to the other team. The MRR value for all passes is 0.58, which means the correct receiver is averagely ranked 1.7th (i.e., 1/0.58) out of all 21 or less candidate receivers (i.e., all players excluding the sender).

Our model performs better for back-field and front-field passes, while performing worse for middle-field passes. Tables 3 and 5 also shows the performance of our model for back-field, middle-field and front-field passes, separately. Surprisingly, the performance of our model is the worst for middle-field passes. A player in the middle area may have more passing options, thereby increasing the difficulty to predict the right receivers.

Our model perform better than the baseline models. Tables 4 and 6 compare the performance of our LightGBM model with the RandomGuess, NearestPass, and NearestForwardPass models which are described in Sect. 3. Our LightGBM model consistently show much better performance than the three baseline models in terms of the top-N accuracy and MRR. The NearestPass model, which tends to pass the ball to the nearest teammates, achieve a better performance than the NearestForwardPass, which tends to pass the ball to the nearest teammates in the forward direction relative to the sender. Both of the NearestPass and NearestForwardPass baseline models achieve a much better performance than randomly guessing the receiver of a pass.

Table 7. The combined importance of each feature dimension.
Table 8. The ten most important features and their importance scores.
figure b

4.2 RQ2: What Are the Important Factors that Explain the Receiver of a Pass?

The features that capture the candidate receivers’ positions play the most important roles in explaining the receiver of a pass. Table 7 shows the combined importance of each feature dimension in our model. The combined importance is a sum of the importance scores of all the individual features in a dimension. The features from the dimension of candidate receiver position, which capture a candidate receiver’s position on the pitch and his/her position relative to the teammates and opponents, have the biggest combined importance score in our model. The team position features, which captures the overall position of the team in control of the ball, are the second important dimension in explaining the receiver of a pass. The third important feature dimension (i.e., sender position) captures the sender’s position on the pitch and his/her position relative to the teammates and opponents. The passing path features, which captures the characteristics of a passing path (i.e., the path from the sender to a candidate receiver), also play a significant role in explaining the receiver of a pass.

The most important features capture the candidate receivers’ positions relative to the sender and the opponents. Table 8 lists the top ten features that are most important in our model and their respective importance scores. A full list of our features’ important scores is available online (See footnote 4). Among the top 10 important features, there are eight features from the dimension of candidate receiver position. All of the top six features are from the dimension of candidate receiver position, among which three features capture a candidate receiver’s relative position to the sender, and the other three features capture a candidate receiver’s relative position to the components. The other two features from the dimension of candidate receiver position capture a candidate receiver’s position on the pitch and his/her relative position to the teammates, respectively. Among the top 10 important features, there are also one from the sender position dimension (i.e., the sender_closest_opponent_dist feature), and one from the passing path dimension (i.e., the min_pass_angle feature).

figure c

5 Conclusions

This work proposes a novel approach to predict the receivers of football passes. We analyze a dataset containing 12,124 passes from 14 real-world football games and discuss players’ passing characteristics. We find that players present different passing characteristics in different areas of the field. We then extract 54 features along five dimensions and build a LightGBM model to predict the receiver of a pass. Our model achieves a top-1, top-3, and top-5 accuracy of 50%, 84%, and 94%, respectively, when we exclude false passes. Our model outperforms three baseline models that we use to rank the candidate receivers of a pass. We find that the features that capture the positions of the candidate receivers play the most important roles in explaining the receiver of a pass. We believe that our approaches and findings can help football practitioners better understand the factors that impact the receiver of a pass and make informed tactical decisions.