1 Introduction and literature review

Data is now extensively gathered and scrutinized in the field of sports through the amalgamation of physical and digital sources. This integration is significantly augmenting the understanding of professional sports for all stakeholders. The statistical analysis of sports data has the potential to refine decision-making processes concerning player and team performance, player health and safety, fan engagement, marketing strategies, revenue generation, sports economics, practice, and overall well-being. Data sources encompass both private and public institutions, the Internet of Things, and social networks. The application of distinct statistical learning methods varies based on the sport, the nature of the data, and the specific objectives of the analysis. Data is systematically collected during both training sessions and matches to extract valuable insights into factors influencing the success of players and teams. These factors encompass the fairness of competition, player assessment, scheduling, tactics, identification of key performance indicators, drafting, rule-making, and ranking. The availability and analysis of data contribute significantly to enhancing the accuracy of forecasting the outcomes (winner) in matches, and understanding the underlying factors influencing these outcomes. The playing characteristics of players are important both from a technical and economic point of view. From the technical point of view, they allow to evaluate the playing characteristics that lead the player and the team to achieve winning results; from the economic point of view, they allow to establish the value of a player.

In the literature, empirical studies and methodological proposals based on data science and data-driven approaches have been carried out on many sports disciplines to analyze the large mass of sport data both in the field of performance and in the medical, social or economic fields. Recent contributions can be found in the special issues “Statistical Modelling for Sports Analytics” by Groll et al. (2018) and “Big data and data science in sport” by D’Urso et al. (2023).

Clustering of sport data has been proposed based on traditional clustering approaches and on the theory of networks, either based on modularity (Fortunato 2010; de Arruda et al. 2012) or on a mixture model and the expectation-maximization technique (Snijders and Nowicki 1997). See Ribeiro et al. (2017), Ramos et al. (2018) for the use of network-based approaches in sport.

Papers specifically on clustering of sports data in football are Lu and Tan (2003), Gates et al. (2017), Narizuka and Yamazaki (2019), Zachary et al. (2020), D’Urso et al. (2023), Carpita et al. (2023) and in basketball, Behravan and Razavi (2021), Zuccolotto et al. (2018), Ulas (2021), Chessa et al. (2023). In Lu and Tan (2003) an unsupervised clustering of dominant scenes in sports video is presented, in which data are preprocessed by Principal Components and Linear Discriminant Analysis. Gates et al. (2017) propose an unsupervised classification method that defines subgroups of individuals that have similar dynamic models. They apply this method on functional MRI from a sample of former American football players. Narizuka and Yamazaki (2019) develop a clustering algorithm to extract transition patterns of the formation of a given team during the game. Zachary et al. (2020) use K-means Clustering to Create Training Groups for Elite American Football Student-athletes Based on Game Demands. In D’Urso et al. (2023) the authors develop a robust fuzzy clustering model for mixed data. For each variable, or attribute, a dissimilarity measure is proposed, and the clustering procedure combines the dissimilarity matrices with weights objectively computed during the optimization process. The weights reflect the relevance of each attribute type in the clustering results. The model is used to cluster football players with respect to mixed data on performances. In Carpita et al. (2023) the authors investigate the ability of various composite indicators to define a measurement structure for global football performance. The theoretical football performance dimensions are based on a set of 29 players’ attributes periodically produced by Electronics Arts (EA) Sports experts. The players’ performance attributes or variables are considered and processed with three different techniques: the Cluster of variables around Latent Variables (CLV), the Principal Covariates Regression (PCovR) and Bayesian Model-Based Clustering (B-MBC), and the resulting clusters have been embedded into structural equation models with Partial Least Squares (PLS-SEMs) with a Higher-Order Component (that is, the overall football performance). Results show the validity of composite indicators.

In Zuccolotto et al. (2018) the authors use random forests and extremely randomized trees to represent maps of the court visualizing areas with different levels of scoring probability of the analysed player or team. The approaches are demonstrated by the analysis of data from the NBA regular season 2020/2021. In Ulas (2021) NBA teams’ characteristics and similarities were assessed firstly with Machine Learning techniques (K-means and Hierarchical clustering) and secondly with Ordinary Linear Regression (OLS) to investigate the factors that affect the NBA team values. In Chessa et al. (2023) the authors propose the use of a weighted complex network to detect communities of basketball players on the basis of their performances. A sparsification procedure to remove weak edges is also applied, confirmed by the normalized mutual information, so that not only the best distribution of nodes into communities is found, but also the ideal number of communities as well. An application to community detection of basketball players for the NBA regular season 2020–2021 is presented.

Tennis is an individual sport, besides the premier international team event in men’s tennis, the Davis Cup. A review of methods of data collection in tennis can be found in Takahashi et al. (2023). A review of models of data analysis in tennis is given in Kovalchik (2021). Kovalchik observes that despite the extensive historical application of statistical methods to tennis, the current state of analytical work in the sport appears to be trailing behind most professional sports, and delves into the reasons why data-driven methods in tennis have struggled to gain popularity. Unlike baseball, where statistical tabulation has been a staple since the introduction of the first box score in 1845, organizers of tennis competitions have historically neglected to quantify their sport. This oversight stems directly from the decentralized structure and fragmentation among multiple promoters of tennis events, including the International Tennis Federation, the Grand Slam Board, the ATP Tour, and the WTA.

Probabilistic models in tennis were starting to be utilized to assess strategy, with an early example being the examination of optimal service strategy by George (1973). This work demonstrated that the expected point value of a standard two-service strategy could be formalized as the weighted sum of winning a point on a strong serve and a weak serve, each weighted by the probability that the serve was played and was good.

Two decades after the initial mathematical studies in the 1970s, tennis experienced a significant surge in statistical research. Leading this wave were Franc Klaassen, a former national junior player for the Netherlands who entered the economics doctoral program at the University of Tilburg in 1995, and Jan Magnus, a Tilburg professor. Their enduring research partnership began with the application of quantitative analysis to various "tennis myths" that had never been scientifically tested (Magnus and Klaassen 1996). In the process, they delved into the independent identically distributed model more extensively than any study before. By 2001, Klaassen and Magnus had validated the model against outcomes from 90,000 points played at Wimbledon, marking the first instance of large-scale statistical analysis in tennis (Klaassen and Magnus 2001).

Throughout two decades of research into the statistical aspects of tennis, prediction emerged as a predominant theme. Using a paired comparison framework, Klaassen and Magnus pioneered model-based approaches to predict the most likely winner of points in tennis matches and explore contextual factors influencing a player’s win probability (Klaassen and Magnus 2003). Their work propelled tennis prediction beyond the mathematical models of the 1970s, establishing the fitting of statistical models to large competitive datasets as the new standard. Subsequent researchers built on this foundation, measuring and testing predictors of tennis outcomes and creating more sophisticated models of tennis performance (Kovalchik 2016).

By the 2000s, there was a growing body of statistical research on tennis, highlighting a prevalent data problem in the sport. The introduction of the Hawk-Eye player challenge system at the 2006 U.S. Open marked a pivotal moment. This multi-camera tracking system for line-call review not only addressed the data problem but also positioned tennis at the forefront of officiating innovations, being among the first to adopt a positional tracking system presented tennis with an opportunity to compete in the big data race in sports, aligning it with major leagues in terms of technological advancements.

In the recent literature aimed at predicting the outcomes of sporting events, tennis still plays a prominent role with a variety of methods. Arcagni et al. (2023) extend the class of paired comparison approaches models by using indicators derived from the theory of complex networks for the predictions. They propose a measure based on eigenvector centrality. Unlike what happens for the standard paired comparisons class (where the rates or latent abilities only change at time t for those players involved in the matches at time t), the use of a centrality measure allows the ratings of the whole set of players to vary every time there is a new match. The resulting ratings are then used as a covariate in a simple predictive logit model. In Tea and Swartz (2023) the authors investigate intended serve direction with Bayesian hierarchical models applied on an extensive data source of professional tennis players at Roland Garros. They find discernible differences between men’s and women’s tennis, and between individual players. General serve tendencies such as the preference of serving towards the body on second serve and on high pressure points are revealed.

The presented literature has shown the importance of partitioning and clustering of players on the basis of performance, position, competitions attended and other variables. This paper proposes a clustering model in tennis. The model aims at targeting some relevant issues for clustering tennis players and tournaments: (i) it considers players, tournaments and the relation between them; (ii) the relation is taken into account in the fuzzy clustering model based on the Partitioning Around Medoids (PAM) algorithm through spatial constraints; (iii) the attributes of the players and of the tournaments are of different nature, qualitative and quantitative.

The proposal is novel for the methodology used, a spatial fuzzy clustering model (cfr Coppi et al. 2010) for players and for tournaments (based on related attributes), where the spatial penalty term in each clustering model depends on the relation between players and tournaments described in the adjacency matrix. The proposed model is compared with a clustering model based on fitting a bipartite players-tournament complex network model (the Degree-Corrected Stochastic Blockmodel) to the adjacency matrix that considers only the relation between players and tournaments, described in the adjacency matrix, to obtain communities on each side of the bipartite network.

Even though communities form around nodes that have common edges and common attributes, typically, algorithms have only focused on one of these two data modalities: community detection algorithms traditionally focus only on the network structure, while clustering algorithms mostly consider only node attributes.

The paper is structured as follows. In Sect. 2 the data and models used are presented. Section 3 reports the results of the application of the models to clustering of tennis players and tournaments. Section 4 concludes the paper and provides directions for future work.

2 The models

In this section, we give an overview of the data used in the paper in Sect. 2.1 and then define and explain the two clustering algorithms we are going to apply to these data: the Spatially-corrected fuzzy Partition Around Medoids (Sect. 2.2) and the Degree-Corrected Stochastic Blockmodel (Sect. 2.3).

2.1 The data

For the analysis in this paper, we use data taken from the ATP official website ATP (2023) with regards to the draws of the tournaments, and from the sport statistics website Wheelo Ratings Wheelo (2023) for the performance data of players and tournaments.

The data is organized as follows:

  • Matrix \(\textbf{X}=\{x_{ni}, n \le N,i \le I\}\) of player data, recording \(I=21\) attributes for each of the \(N=136\) players that played at least 10 matches on the ATP Tour.

  • Matrix \(\textbf{Y}=\{y_{sj},s\le S,j\le J\}\) of tournament data, recording \(J=18\) attributes for each of the \(M=64\) individual tournaments of the ATP tour, after excluding the ATP finals and the team competitions.

  • Adjacency matrix \(\textbf{A}\), in which the rows correspond to players and columns to tournaments. Here, \(a_{n,s}=1\) if player n participated to the main draw of tournament s and \(a_{n,s}=0\) otherwise.Footnote 1 The matrix \(\textbf{A}\) is visualized in Fig. 1.

2.2 Spatially-corrected fuzzy partition around medoids

The first analysis we perform is based on the application of two distinct versions of Fuzzy Partition around Medoids (PAM) with Spatial Penalty using different distances for player and tournament attributes due to the different nature of the data. Note that we use the term spatial to refer to the correction to the model due to the network structure as this is the standard term used in the literature, but, as can be deduced by the nature of the data, it is not to be intended as adjacency in a physical space but on in an abstract sense in the bipartite network. The goal is to find an optimal fuzzy partition of the sets that clusters together units that are similar both with regards to the attributes in the matrices \(\textbf{X}\) and \(\textbf{Y}\) and the adjacency structure in the matrix \(\textbf{A}\). We follow an approach similar to the one outlined in Pham (2001), but with some modifications necessary to take into account the bipartite structure of the adjacency matrix \(\textbf{A}\). Here, in the data there is no direct measure of adjacency among players or among tournaments, but only between players and tournaments, based on participation, as encoded in \(\textbf{A}\).

To compute the similarity in the adjacency relations between units on the same side, from the adjacency matrix \(\textbf{A}\), we create two distinct similarity matrices \(\mathbf {B^{(p)}}\) for players and \(\mathbf {B^{(t)}}\) for tournaments, applying the cosine similarity (see Wael and Aly 2013) respectively to the rows and columns of the matrix \(\textbf{A}\).

Fig. 1
figure 1

Visual representation of the adjacency matrix \(\textbf{A}\). Black squares corresponds to 1s

That is, we have for every two players h and l

$$\begin{aligned} b^{(p)}_{hl}=\frac{\sum _{s=1}^{S} a_{hs}a_{ls}}{\Big (\sum _{s=1}^{S} a_{hs}\sum _{s=1}^{S} a_{ls}\Big )^{1/2}}. \end{aligned}$$
(1)

By definition \(b^{(p)}_{hl}\in [0,1]\), and \(b^{(p)}_{hl}=1\) if both player have played exactly the same tournaments and \(b^{(p)}_{hl}=0\) if the tournaments they played have no overlap.

Similarly, for every two tournaments h and l we have

$$\begin{aligned} b^{(t)}_{hl}=\frac{\sum _{n=1}^{N} a_{nh}a_{il}}{\Big (\sum _{n=1}^{N} a_{nh}\sum _{n=1}^{N} a_{nl}\Big )^{1/2}}. \end{aligned}$$
(2)

Also here, \(b^{(t)}_{hl}=1\) means that the draws of tournaments h and l contained exactly the same players and \(b^{(t)}_{hl}=0\) means that no player competed in both tournaments h and l. The reasons why the cosine similarity is viable as a metric for the proximity of units is that the resulting matrices \(\mathbf {B^{(p)}}\) and \(\mathbf {B^{(t)}}\) are symmetric, and, since the original matrix \(\textbf{A}\) is non-negative, all the entries in \(\mathbf {B^{(p)}}\) and \(\mathbf {B^{(t)}}\) will be non-negative too.

We then apply fuzzy spatial partition-around-medoids algorithms on the matrices \(\textbf{X}\) and \(\mathbf {B^{(p)}}\) on one side, and \(\textbf{Y}\) and \(\mathbf {B^{(t)}}\) on the other.

In practice, we want to find two different matrices of membership degree \(\textbf{U}\) and \(\textbf{W}\)

$$\begin{aligned} \textbf{U}:= \{u_{nc}, n \le N, c\le C\}, \quad \textbf{W}:= \{w_{se}, s \le S, e\le E\}, \end{aligned}$$
(3)

where \(u_{nc}\) represents the degree of membership of player n to cluster c and \(w_{se}\) the degree of membership of tournament s to cluster e. Furthermore, for each of the two partition matrices, are provided C and E prototypes, called medoids, i.e. the subsets \((\textbf{x}_1,\ldots ,\textbf{x}_c,\ldots ,\textbf{x}_C)\) and \((\textbf{y}_1,\ldots ,\textbf{y}_e,\ldots ,\textbf{y}_E)\), whose generic \(\textbf{x}_c\), for \(c \le C\), is chosen among the N observed units \(\textbf{x}_n=(x_{n1},\dots ,x_{nI}) \), with \(n\le N\), and the S observed units \(\textbf{y}_s=(y_{s1},\dots ,y_{sJ}) \), with \(s\le S\), respectively, by solving the following minimization problems.

To cluster the players we optimize

$$\begin{aligned} {\begin{matrix} \min _{\textbf{U}, (\textbf{x}_1,\ldots ,\textbf{x}_c,\ldots ,\textbf{x}_C)}:&{} \sum \limits _{n=1}^{N}\sum \limits _{c=1}^{C}u_{nc}^{m_1}d^2(\textbf{x}_{n},\textbf{x}_{c}) +\frac{\beta _1}{2}\sum \limits _{n=1}^{N}\sum \limits _{c=1}^{C}u_{nc}^{m_1} \sum \limits _{n'=1}^{N}\sum \limits _{{c'\ne c}}b^{(p)}_{nn'}u_{n'c'}^{m_1}\\ s.t. &{} \sum \limits _{c=1}^{C}u_{nc}=1,\;u_{nc}\ge 0. \end{matrix}} \end{aligned}$$
(4)

Here the parameter \(m_1 \ge 1\) tunes the fuzziness of the partition and the parameter \(\beta _1 \ge 0\) the importance of the spatial regularization based on the cosine similarity matrix \(\mathbf {B^{(p)}}\). In this case \(d(\textbf{x}_{n},\textbf{x}_{c})\) is the Euclidean distance in \(\mathbb R^{I}\) between the attributes of the unit n and the medoid of the cluster c.

Similarly, to cluster the tournaments we optimize

$$\begin{aligned} {\begin{matrix} \min _{\textbf{W}, (\textbf{y}_1,\ldots ,\textbf{y}_e,\ldots ,\textbf{y}_E)}:&{} \sum \limits _{s=1}^{S}\sum \limits _{e=1}^{E}w_{se}^{m_2}d_G^2(\textbf{y}_{s},\textbf{y}_{e}) +\frac{\beta _2}{2}\sum \limits _{s=1}^{S}\sum \limits _{e=1}^{E}u_{se}^{m_2} \sum \limits _{s'=1}^{S}\sum \limits _{{e'\ne e}}b^{(t)}_{ss'}u_{s'e'}^{m_2}\\ s.t. &{} \sum \limits _{e=1}^{E}w_{se}=1,\;w_{se}\ge 0. \end{matrix}} \end{aligned}$$
(5)

Here the parameter \(m_2 \ge 1\) tunes the fuzziness of the partition and the parameter \(\beta _2 \ge 0\) the importance of the spatial regularization based on the cosine similarity matrix \(\mathbf {B^{(p)}}\). Here, \(d_G(\textbf{y}_{s},\textbf{y}_{e})\) is the Gower’s distance (see Gower 1971) in the space of attributes between the attributes of the unit n and the medoid of the cluster e. Gower’s distance is chosen as the matrix \(\textbf{Y}\) contains some columns with qualitative attributes.

Here, we have to note that the cosine similarity matrices \(\mathbf {B^{(p)}}\) and \(\mathbf {B^{(t)}}\) are dense matrices, i.e., they have few 0 entries. This puts limits on the admissible values of \(\beta _1\), \(\beta _2\), as the spatial penalty term punishes a partition that separates units with high similarity values but does not punish a partition that puts together units with low similarity values. Consequently, if the weight given to the spatial term is too high, the penalty for separating clusters becomes too high, and the entire output partition collapses in one single cluster.

2.3 Degree-corrected stochastic blockmodel

We next want to analyse the adjacency structure between players and tournaments as a bipartite network. To better understand the underlying structure of the bipartite player-tournament network, before the extraction of the cosine similarity matrices and the addition of attributes we fit to them a Degree-Corrected Stochastic Blockmodel (DCSBM) using the R package greed. The DCSBM was defined in Karrer and Newman (2011) for the goal of community detection, that is, of finding denser subgraphs inside a large network.

In a DCSBM every vertex is assigned an expected degree and the membership to a cluster. Note that when, as in this paper, we deal with bipartite networks, clusters are defined separately on the left and right side of the network. Nodes of the same cluster are expected to have similar patterns in which neighbours they connect to, while also having the prescribe expected value of the degree. We assign two (crisp) membership matrices \( {\tilde{{\textbf{U}}}}:= \{\tilde{u}_{nc}, n \le N, c\le C\}\) and \( {\tilde{\textbf{W}}}:= \{\tilde{w}_{se}, s \le S, e\le E\}\), such that \(\tilde{u}_{nc} \in \{0,1\}\), \(\sum _{c} \tilde{u}_{nc}=1\) for all nc and similarly \(\tilde{w}_{se} \in \{0,1\}\), \(\sum _{e} \tilde{w}_{se}=1\) for all se. We further define a \(C \times E\) matrix \(\Omega =\{\omega _{ce}\}_{c \le C,e\le E}\) of expected connection intensities between clusters, such that \(\omega _{ce}\ge 0\) for all ce, and a vector representing the expected degrees of the vertices on both sides \(\textbf{d}:=\{d_l, l\le N+S\}\). Given two vertices \(n \in [N],s\in [S]\), the number of edges between them is represented by the variable \(X_{ns}\) with

$$\begin{aligned} X_{ns} \sim \textsf{Poi}(\lambda _{ns}), \quad \lambda _{ns}:= \sum _{c \le C,e \le E} \tilde{u}_{nc} \tilde{w}_{se} \omega _{ce}d_n d_s. \end{aligned}$$
(6)

We fit the parameters (both weights and cluster memberships) of this model to the empirical data using the R package greed (Côme and Jouvin 2022). This is done by a variational extension of the expectation-maximization (EM) algorithm. The variational EM algorithm alternates between the optimization of a lower bound on the Integrated Complete-data Likelihood (ICL) of the observed network over the membership matrices \({\tilde{\textbf{U}}},{\tilde{\textbf{W}}}\) for fixed values of the model parameters \(\Omega , \textbf{d}\) (E-step), and over the parameter for fixed values of the membership matrices (M-step). Here we note that using a Poisson distribution for the number of edges between two vertices instead of a Bernoulli distribution, we allow for multi-edges even if the original matrix \(\textbf{A}\) is a binary matrix. This choice is made in order to make it feasible to estimate the parameters of the model during the M-step. Indeed, to estimate the distribution of the number of edges between two clusters c and e in the model, we can exploit the fact that Poisson random variables have a simple additive structure such that

$$\begin{aligned} \sum _{n=1}^N\sum _{s=1}^S \textsf{Poi}\big (\tilde{u}_{nc}\tilde{w}_{se} \omega _{ce}d_n d_s\big )\sim \textsf{Poi}\Big (\sum _{n=1}^N\sum _{s=1}^S \tilde{u}_{nc}\tilde{w}_{se} \omega _{ce}d_n d_s\Big ). \end{aligned}$$
(7)

The distribution of the sum of a large number of Bernoulli variables with different parameters has instead no tractable representation. Given that in the model studied \(\lambda _{ns} <1.5\) uniformly, and its average \(\overline{\lambda }<0.3\), we expect the Poisson approximation not to distort heavily the results.

3 Results

In this section, we provide a more detailed overview of the data in Sect. 3.1, and then present the outputs of the classification algorithms in Sect. 3.2. Finally, in Sect. 3.3 we analyse the properties of the clusters obtained, both with respect to the attributes and the network, and compare the results of the two models.

3.1 Descriptive analysis

In this subsection we present in Tables 1 and 2 the descriptive statistics for all the numeric attributes of players and tournaments, respectively. For the players we extracted a total of 21 numeric attributes from the Wheelo rating website. For the tournaments, we got 13 numeric attributes from the Wheelo rating website, which we supplemented with 5 more attributes, 2 numeric and 3 qualitative (Surface, In.Outdoor and Nation), from the ATP website. We report the names of the attributes from the Wheelo rating website as they were shown there, even if some names might be misleading. Several attributes are referred to as “percentage" which would suggest that they are normalized between 0 e 100, instead for all of them the normalization is between 0 and 1. As we standardize anyway all numeric variables to have mean 0 and variance 1, this has no impact on the outcome of the analysis.

Table 1 Mean, Standard Deviation, Maximum and Minimum for each player attribute over all the 136 players considered
Table 2 Mean, Standard Deviation, Maximum and Minimum for each numerical tournament attribute over all the 64 tournaments considered

3.2 Output of the partition algorithms

In this section we show the outputs of the partition algorithms. We use numbers to identify the PAM clusters, and letters to identify the DCSBM clusters. We start analysing the Fuzzy Spatial Partition Around Medoids. We set the spatial parameters \(\beta _1= 1/10\) and \(\beta _2=1/150\) so that they would be low enough not cause the partition to collapse into only one cluster. Indeed, the spatial term is defined in the objective function in (4) and (5) so that there is a penalty for assigning adjacent units to different clusters, but not for assigning non-adjacent units to the same cluster. Choosing higher values of \(\beta _1\) and \(\beta _2\) would result in the spatial penalty becoming more important than the contribution from the attributes, and make the optimal solution one in which all units are assigned to the same cluster. We limited ourselves to values of \(m_1,m_2\le 1.2\) to prevent the output of too many fuzzy units and make it more natural the comparison with the partition coming from the DCSBM, which are crisp by nature. We used the Fuzzy Silhouette validity index to identify the optimal values of \(m_1,m_2\) and CE for the fuzzy clustering, obtaining the optimal values of \(m_1=1.15\), \(C=3\) (see Table 3) for the players and \(m_2=1.05\), \(E=2\) (see Table 4) for the tournaments. The partition obtained using the optimal choices for the number of clusters and the fuzziness parameter are shown in Table 5. In the analysis we consider a unit a member of a cluster if its fuzzy membership to said cluster is above 0.6 for players, or 0.7 for tournaments (cfr Maharaj and D’Urso 2011). Player and tournaments that do not reach these thresholds for any cluster are considered as fuzzy units. The full tables of the fuzzy memberships are presented in the supplementary material. For the players the optimal clustering produces 3 clusters: cluster 1 (medoid Ruusuvuori) is the largest, containing more than half of the players (74 out of 136), with cluster 2 (medoid Albot) and cluster 3 (medoid Popyrin) containing 27 and 29 players respectively. 6 players are classified as fuzzy units. It is interesting to note that we have fuzzy units with all possible combinations of shared memberships: Thiem and Purcell between clusters 1 and 3, Shang and Gasquet between clusters 1 and 2, Kovacevic between clusters 2 and 3 and Bergs among all of the 3 clusters (see the supplementary material). Tournament clustering instead outputs cluster 1 (medoid Adelaide 2) with 37 units and cluster 2 (medoid Gstaad) with 23 units. 4 units (Delray Beach, Houston, Beijing and Stockholm) are considered fuzzy.

Fig. 2
figure 2

Edge densities (number of present edges divided by maximum possible number of edges) between clusters in the DCSBM

Table 3 Fuzzy silhouette of the player clustering for different choices of C and \(m_1\)
Table 4 Fuzzy silhouette of the tournament clustering for different choices of E and \(m_2\)

The Degree-corrected Stochastic Blockmodel outputs instead 6 clusters, 2 for the player side of the network, and 4 on the tournament side, as shown in Fig. 2. Here there is no external validation index for the clustering, the optimization of the ICL is done over the number of clusters together with the optimization of memberships and parameters. On the player side we have cluster A with 100 players and cluster B with 36. On the tournament side we have the majority of tournaments (42 out of 64) in cluster B, with cluster A counting 15 units, and cluster C and D only 3 and 4, respectively.

Table 5 Cluster membership for players and tournaments using Degree-Corrected Stochastic Blockmodel (BM) and Fuzzy Partition Around Medoids (PAM)

3.3 Labeling of the clusters and comparison of the models

Next, we go deeper into the analysis of the properties of the clusters identified by the algorithms. To understand the intrinsic properties of the clusters found by the Partition Around Medoids (PAM) algorithm we look at the values of the attributes of the medoid players and tournaments. In Fig. 3, we see the normalized values of all attributes for the 3 player medoids, Ruusuvuori (cluster 1), Albot (cluster 2) and Popyrin (cluster 3). We see that cluster 1, which is the largest of the 3 represents some sort of “default” cluster, with its medoid never deviating drastically from the global average in almost all the statistics. Given that the cluster contains almost all the top players, the overall results statistics, (WinPercentage, PointsWonPercentage, GamesWonPercentage, SetsWonPercentage) are above average. Cluster 2 mostly represents clay court specialists and/or lower level players, with serve statistics and overall results below average. Finally, cluster 3 mostly represents big servers, with statistics related to serve games (AcesPerServiceGame, FirstServiceWonPecentage, BreakPointsSavedPercentage, etc...) having much higher values than the global average.

Similarly, in the tournament clustering we observe that cluster 1, with medoid Adelaide2 contains mostly hardcourt and grass tournaments, while cluster 2, with medoid Gstaad contains mostly clay court tournaments. As expected, the statistics in cluster 1 are much more favourable towards the serving players (see Fig. 4). It has to be noted that for the variables Draw.Size and Category, the value of both medoids is greatly below the global average. This might look surprising, but it is due to the fact that ATP 250 tournaments with 28 player draws, the lowest values possible, are by far the most common (36 out of 64) and thus the typical values in both clusters.

We also look at how these clusters behave when we investigate their adjacency structure in the matrix \(\textbf{A}\). In Table 6 we observe the edge densities (number of present edges over maximum number of edges possible) between each of the 3 player clusters and each of the 2 tournament clusters found by PAM. here we do not count the contribution from fuzzy units. As expected, the player from cluster 2, which contains most of the clay court players, compete preferentially in the tournaments from cluster 2 which contains most of the clay court tournaments. On the other hand players from cluster 1 and particularly from cluster 3 compete mostly in tournaments from cluster 1.

Fig. 3
figure 3

Normalized values of the attributes of the player medoids

Fig. 4
figure 4

Normalized values of the attributes of the tournaments medoids

Fig. 5
figure 5

Means of the normalized values of the attributes over the different clusters of tournaments found by the DCSBM

Fig. 6
figure 6

Means of the normalized values of the attributes over the different clusters of players found by the DCSBM

Table 6 Edge densities (number of present edges divided by maximum possible number of edges) in \(\textbf{A}\) between the PAM clusters for players (columns) and tournaments (rows)
Table 7 Contingency table between Tournament Clustering by spatial fuzzy PAM (columns) and DCSBM (rows)
Table 8 Contingency table between Player Clustering by spatial fuzzy PAM (columns) and DCSBM (rows)

If we look at the partitions created by the DCSBM we observe in particular on the tournament side that the partition identifies the periods in which the tennis season splits into different groups of tournaments which differ both for geographic location and playing surface. This happens in particular in February (South American clay court tournaments, US hard court tournaments and European indoor tournaments) and July (European clay court tournaments and US hard court tournaments). We see that clusters C and D are made by clay court tournaments in these periods, while cluster A contains mostly alternative hard court (both indoor and outdoor) tournaments that happen in the same weeks. Cluster B contains the other tournaments of the season and in particular all the mandatory tournaments (1000 and Slam). The two clusters of players are identified based on whether they participated to the tournaments in cluster A or to those in clusters C and D, with both group competing at the same rate in the tournaments from cluster B. If we look at the average values of the attributes over the clusters found by the DCSBM, we observe that, even if said attributes were not used in the clustering algorithms, their values differ significantly across the different clusters. For what concerns the tournament clusters, we see from Fig. 5 that most statistics are more favourable to the serving players in cluster A, more favourable to the returning player in clusters C and D and close to the average in cluster B. This is not surprising, given the surfaces on which the tournaments in different clusters are played, and the fact that cluster B is by far the most numerous. For measures of importance (Category, Draw.Size), instead cluster B is above average, and the others are all below and close to each others. This is due to the fact that cluster B includes all the tournaments of category 1000 and Slam, where participation is mandatory for all those who have high enough ranking to get into the main draw, and thus players of both clusters A and B participate to them equally. If we look at player cluster, we see that in general players from cluster 1 have on average better serve statistics and worse return statistics as seen in Fig. 6. This is not surprising, given that they played preferentially in hard-court tournaments, where serve on average has a bigger impact.

If we compare instead the memberships of players and tournaments respectively given by the two algorithms, we observe that they differ significantly, but exhibit some correlations, as shown in Tables 7 and 8. As far as players go, we see that cluster B (players who participate to the clay-court seasons in South America and Europe) is made mostly of players from clusters 1 and 2. This is unsurprising, as big servers are unlikely yo choose clay-court tournaments. This reflects on the averages of different player attributes over the blockmodel clusters, with most serve statistics being lower for players in cluster B. We see that out of the 3 players which are in cluster 3 in the fuzzy PAM and in cluster B in the DCSBM, 2 of them, Monteiro and Jarry, are South American and thus likely picked the clay court tournaments in February to play in front of their home crowds in Rio de Janeiro and Santiago (tournament which Jarry won), respectively. On the tournament side instead we observe how clusters C and D are completely contained in cluster 2, being made only of clay-court tournaments, while, as expected, the majority of the tournaments in cluster A are also in cluster 1

4 Conclusions

The clustering model proposed in the paper aims at targeting some relevant issues for clustering tennis players and tournaments: (i) it considers players, tournaments and the relation between them; (ii) the relation is taken into account in the Partitiong Around Medoid (PAM) algorithm; (iii) the attributes of the players and of the tournaments are of different nature, qualitative and quantitative.

The paper fills a gap in the use of clustering in tennis. The proposal is novel for the methodology used, a spatial PAM Fuzzy clustering model for players and for tournaments (based on related attributes), where the model is optimized independently to find players and tournaments partitions and the spatial penalty term in each clustering model depends on the relation between players and tournaments described in the adjacency matrix. The proposed model is compared with a clustering model based on a bipartite players-tournament complex network that considers only the relation between players and tournaments, described in the adjacency matrix, to obtain communities on each side of the bipartite network by fitting a Degree-Corrected Stochastic Blockmodel (DCSBM) to the data.

An application on data taken from the ATP official website with regards to the draws of the tournaments, and from the sport statistics website Wheelo ratings for the performance data of players and tournaments shows the performances of the proposed clustering model.

The two models differ substantially both on the form in which the data are fed to them and the way in which the optimization is carried out. The PAM uses data both about the attributes and the adjacency structure, processes the data separately for players and tournaments and finds the optimal number of fuzzy clusters via an a posteriori validity index. The DCSBM only uses adjacency data, optimizes a joint partition of players and tournaments and finds the optimal number of crisp clusters at the same time as the memberships and parameters of the model. For these reasons the two algorithms shed light on different aspects of the data, sometimes confirming each other’s outputs and sometimes highlighting something the other algorithm could not capture.

Future developments involve integrating the adjacency matrix more directly into the optimization procedure of the Fuzzy PAM and not as a spatial penalty term, in comparison with recent proposals in the field of detecting communities in complex networks that use the two possible sources of information one can use: the network structure, and the features and attributes of nodes.