Introduction

Sports analytics has experienced rapid growth both for teams and individual players in recent years. Statistics tools can be particularly useful to assess performances and define efficient gaming strategies. In this paper, we focus on basketball, a sports pioneer in using analytic tools, and data from the National Basketball Association (NBA).

As far as the literature on statistics in basketball is concerned, many of the proposed approaches can be traced back to the Bill James’s Sabermetrics approach in baseball (James, 1984, 1987). The seminal work is, however, considered that of Oliver (2004): he identifies the so-called “Four Factors” that affect the game outcome with different weights and they are: the Effective Field Goal percentage, the Turnovers per Possession, the Offensive Rebounding percentage and the Free Throw Attempt rate, analysed later with more detail.

Kubatko et al. (2007) give a common starting point for research in basketball by providing a detailed overview and description of the main sources of data for basketball statistics; they also define the main basic variables used in what described as the “mainstream of basketball statistics”, firstly emerged outside the academy and due to Oliver (2004), Hollinger (2004), Hollinger and Hollinger (2005). In Nikolaidis (2015) one can find some indicative ideas for the statistical analysis of basketball data; he also shows that any basketball team can improve significantly its decision-making process if it relies on statistical support.

Yang et al. (2014) evaluate the efficiency of National Basketball Association (NBA) teams under a two-stage DEA (Data Envelopment Analysis) framework. Applying the additive efficiency approach, they decompose the overall team efficiency into first-stage wage efficiency and second-stage on-court efficiency, finding out the individual endogenous weights for each stage.

Other recent scientific proposals to model player and team performances make use of network models (Piette et al., 2011; Fewell et al., 2012; Skinner & Guy, 2015; Xin et al., 2017). In the class of models accounting for spatial features, Metulini et al. (2018) try to identify the pattern of surface area in basketball that improves team performance while Zuccolotto et al. (2019) propose a spatial statistical method based on classification trees to assess the scoring probability of teams and players in different areas of a court map.

Sandholtz et al. (2020) propose a new metric for spatial allocative efficiency: they estimate the player’s field goal percentage at every location in the offensive half court using a Bayesian hierarchical model. Then, comparing the estimated field goal percentage with the empirical field goal attempt rate, they identify the areas where the lineup exhibits inefficient shot allocation.

As far as the individual performances are concerned, Fearnhead and Taylor (2011) define a proper statistical methodology to estimating the ability of NBA players by comparing their team performance when they are on the court with respect to when they are off it. Lopez and Matthews (2015) propose an NCAA men’s basketball predictive model while is due to Cervone et al. (2016) a multiresolution stochastic process model for predicting basketball possession outcomes. Engelmann (2017) analyses the possession-based player performance too while Wu and Bornn (2018) model the offensive player movement in professional basketball.

Among the regression-based approach, Deshpande and Jensen (2016) propose a Bayesian linear regression model to estimate and evaluate the impact NBA players have on their teams’ chances of winning while Sill (2010) shows that the Adjusted Plus-Minus (often abbreviated APM, a basketball analytic used to predict the impact of an individual player on the scoring margin of a game after controlling for the strength of every teammate and every opponent during each minute he’s on the court) can be improved by using ridge regression.

Other alternatives based on Bayesian regression models are in Page et al. (2007) and Deshpande and Jensen (2016) while in the work of Casals and Martinez (2013), a mixed model is proposed and applied to model player performances.

In the field of neural networks, one can refer to the works of Loeffelholz et al. (2009), Blaikie et al. (2011), Wang and Zemel (2016), Shen et al. (2020), Babaee Khobdeh et al. (2021). For an extensive literature on the use of Statistics in basketball, an interesting review is provided by Terner and Franks (2021).

But, as stated by Baghal (2012), there is a lack of attention to applying multivariate analyses on basketball metrics even though it is well known that winning a game depends on a plurality of interrelated variables affecting the game outcome each with a different weight. To this aim, in the same work, Baghal (2012) uses a structural equation model to determine whether the Four Factors can be modelled as indicators of individual latent factors for the offensive and defensive quality. Team salaries are added to the model too in order to estimate the relationship between the cost and offensive and defensive quality.

With the same aim of enlarging the analysis by considering a multivariate approach, we propose the use of Bayesian networks (BNs) (Pearl, 1988), i.e. probabilistic expert systems able to identify complex dependence structures, such as that can exist among the set of performance indicators, and to carry out what-if analysis, at the same time.

Based on our knowledge, the use of BN in sports is rather limited to very few works (Razali et al., 2017; Constantinou et al., 2013), the latter using BNs with other scopes.

As deeply discussed later, the attractiveness of a BN mainly stems both from its pictorial representation of the dependency structure through the graph topology and from its inference engine; the latter makes a BN a decision support system since the reasoning (inference) is performed by setting some variables in known states and then updating the probabilities of the remaining variables conditioned by this evidence. The so-called belief updating can thus help to identify the most impactful variables and suggest the best strategies in different contexts.

From a methodological point of view, this proposal copes with the problem of handling non-Gaussian distributions, avoiding the bad practice of variables’ discretization. Therefore, a suitable extension to the Gaussian copula is considered to estimate the network structure under the above assumption, relaxing Guassianity. Specifically, the Non-Parametric BNs (NPBNs) proposed by Hanea et al. (2006) and Kurowicka and Cooke (2006) are applied in this paper.

Based on all these considerations, this work aims at introducing to the use of BNs in sports analytics; their versatility and their natural vocation to be used for prediction purposes could represent a valid instrument to predict team and player performance, to suggest strategies as well as a valid tool for decision-making in sports.

Specifically, we use BNs to identify the main determinants affecting the player winning percentage taking into account the most important performance indicators suggested in the literature. Moreover, we propose a model that combines information from data and experts which represents an important added value in this context.

Data used in this work have been downloaded from the NBA website and refer to 377 players who played at least 10 games and more than 6 min on the court in the regular season 2019–2020.

The paper is organized as follows. In Sect. 2 we provide a theoretical description of BNs and their structural learning methods followed by a brief discussion about the limits of the discretization; in Sect. 3 we introduce the class of Non-Parametric BNs followed by a little digression on the Nonparanormal estimator proposed by Liu et al. (2009) being the Gaussian copula a specific case of the class of Nonparanormal models. After the description of variables used in this study in Sect. 4, we discuss the estimated model in Sect. 5. The last section includes some comments and conclusions.

Basics on Bayesian networks

Bayesian Networks (Cowell et al., 1999; Pearl, 1988) are probabilistic graphical models used to deal with uncertainty in complex probabilistic domains. They bring together graph theory and probability theory to model multivariate dependence relationships among a large collection of random variables. It consists of a qualitative component, in the form of a directed acyclic graph (DAG), and of a quantitative component, in the form of conditional probabilities associated to each node of the DAG. Hence, they can be defined as multivariate statistical models satisfying sets of (conditional) independence statements encoded in a DAG.

More formally, given a random vector \(\mathbf {X}=\lbrace X_{1},\ldots , X_{k}\rbrace \) with distribution P, each node in the DAG corresponds to a random variable \(X_{i}\), resulting in a graph with k vertices and a set of directed edges between pairs of nodes; hence, a DAG G is a pair \(G = (V, E)\), where V is the set of vertices (or nodes) and E the set of directed edges. While each node corresponds to a random variable, a missing arrow between nodes imply (conditional) independence.

Specifically, the conditional independence properties of a joint distribution P can be read directly from the graph by applying the rules defined in the so-called d-separation criterion introduced by Pearl (1988) or the directed global Markov criterion due to Lauritzen et al. (1990), that is an equivalent criterion to that of the d-separation.

Based on Pearl’s criterion, given three disjoint subsets of nodes \(\mathbf {X}\),\(\mathbf {Y}\) and \(\mathbf {Z}\) in a DAG G, \(\mathbf {Z}\) is said to d-separate \(\mathbf {X}\) from \(\mathbf {Y}\), if for all paths between \(\mathbf {X}\) and \(\mathbf {Y}\) there is a vertex w such that either:

  • the connection is serial or diverging and \(w \in \mathbf {Z}\) (it is equivalent to say that w is instantiated);

  • the connection is converging, and neither w nor any of its descendants (i.e. the nodes that can be reached from w) are in \(\mathbf {Z}\).

A serial connection is of the type \(x \longrightarrow w \longrightarrow y\) (or \(x \longleftarrow w \longleftarrow y\)), a diverging connection is \(x\longleftarrow w \longrightarrow y\) while a converging one is \(x\longrightarrow w \longleftarrow y\). Clearly, if \(\mathbf {X}\) and \(\mathbf {Y}\) are not d-separated, they are d-connected.

In the graph in the Fig. 1, for example, \(X_{3}\) d-separates \(X_{1}\) and \(X_{2}\) from \(X_{4}\), implying the conditional independence among the same corresponding variables in P, i.e. \((X_{1},X_{2})\perp X_{4}| X_{3}\).

Fig. 1
figure 1

An example of DAG with 5 vertices

A directed edge from \(X_{i}\) to \(X_{j}\) states that \(X_{j}\) is influenced by \(X_{i}\): using graph terminology, \(X_{j}\) is said the child of \(X_{i}\) that, in turn, is said the parent of \(X_{j}\). The set of parents of \(X_{j}\) in the graph G is denoted by \(pa(X_{j})\). In the DAG of Fig. 1, for example, \(pa(X_{3})=\lbrace X_{1}, X_{2}\rbrace \).

Moreover, with each variable \(X_{i}\) in the DAG, is associated a conditional probability \(Pr( X_{i}| pa(X_{i}))\); all set of these conditional probabilities constitute the quantitative part of the network.

Based on the local Markov propertyFootnote 1 and applying the chain rule, the joint distribution P over the random vector \(\mathbf {X}\) can be factorized as:

$$\begin{aligned} P(X_{1}\ldots X_{k})= \prod _{i=1}^{k} P(X_{i}|pa(X_{i})) \end{aligned}$$
(1)

Besides its ability to decompose a multivariate distribution in a set of local computations, the power of Bayesian networks stems also from the availability of fast and efficient algorithms that perform probabilistic inference (Lauritzen & Spiegelhalter, 1988; Jensen et al., 1990). They update the marginal probability distributions of some variables when the information on one or more distributions of other variables is observed, thus allowing to carry out what-if analysis.

The structure of a BN (i.e the DAG), as well as its parameters, can be defined by means of the expert knowledge or learned directly from data using structural learning techniques (Buntine, 1996; Neapolitan, 2003); in practice, the combination of both approaches is, however, recommended, introducing the prior information as constraints in the first learning phase. The next paragraph provides a little overview of the main methods proposed in the literature to learn the DAG structure.

Structural learning in BNs

The structural learning techniques for BNs fall into one of these two mainstream approaches: constraint-based (Spirtes et al., 1993; Cooper, 1997) and score-based (Heckerman, 1995; Cooper & Herskovits, 1992) algorithms.

A DAG learnt by means of a constraint-based algorithm entails all the marginal and conditional independence relationships inferred from independence tests: the most famous algorithm in this class is the PC algorithm due to Spirtes et al. (1993).

A DAG learnt by means of a score-based approach is the result of an optimization search strategy based on a scoring function. The graph structure is that which corresponds to the optimal score metric (the Penalised Likelihoods, AIC and BIC, are frequently used) among candidate structures, hence, identifying the better fitting network.

It must be noticed that hybrid methods have also been proposed in the literature, combining both constraint-based and score-based techniques. The two best-known members of this family are the Sparse Candidate algorithm (SC) by Friedman et al. (1999b) and the Max-Min Hill-Climbing algorithm (MMHC) by Tsamardinos et al. (2006).

In this work, we apply a score-based algorithm known as the hill climbing algorithm, which is a heuristic optimization procedure given that the investigation of the space of all possible DAGs is infeasible in practice by growing exponentially with the number of nodes.

The hill climbing uses a greedy search strategy: generally starting from a network structure (more often the empty one), it iteratively performs small changes in the current candidate network by adding, deleting or reversing one arc at a time, stopping when any of changes improves the chosen score function.

In this study, specifically, the hill climbing with bootstrap resampling and model averaging is applied as proposed by Friedman et al. (1999a) and implemented in the R package bnlearn. More robust to noise in the data, this technique consists of sampling data using bootstrap n times and applying the chosen learning algorithm to each bootstrap sample. Then, the resulting averaged network is a graph whose edges are those with a strength greater than a threshold value (see Scutari and Nagarajan (2011) for details about the computation of the strength significance threshold) where the strength of each arc is defined in terms of the number of times the arc appears over the n learned networks.Footnote 2

Despite the broad fields of applicability, in the parametric setting, most inference and structural learning methods work under the joint multinomial assumption for discrete data and the joint normality for continuous data. In the case of mixed variables, the conditional Gaussian distribution (CLGBNs; Lauritzen and Wermuth (1989)) is assumed with the constraint that continuous nodes can not have discrete children.

Focusing on the continuous case, which is of interest in this work, in the Gaussian BNs (GBNs, Geiger and Heckerman (1994)), a multivariate normal distribution is assumed for \(\mathbf {X}\) implying that each component \(X_{i}\) follows a univariate Gaussian distribution too. Therefore,

$$\begin{aligned} f(X_{i}|pa(X_{i})=\mathbf {x})\sim N(\alpha _{i}+{\varvec{\beta }}^T \mathbf {x};\sigma ^{2}_{X_{i}}). \end{aligned}$$
(2)

Hence, the parameters of the local distributions are the \(\beta \) coefficients of the linear regression model in (2). The influence of the parents on a child is translated in terms of these partial regression coefficients, meaning that their conditioning effect is given by the additive linear term in the mean without affecting the variance.

The main drawback associated with the use of the Gaussian BNs. relates to the joint normality which may be a very strong assumption, very often unrealistic; in the next paragraph, we shall examine in detail this issue.

The joint normality assumption and its limits

When the marginal distributions of the continuous variables of the domain are far from normal, the common practice consists in their discretization; very often, it results in loss of information and, in some cases, can destroy the true underlying dependence structure. To this aim, Nojavan et al. (2017) designed an experiment to evaluate the impact of the commonly used discretization methods for BNs showing that different methods result in differing future predictions. They state that “we would caution against decisions based on models for which the outputs vary by the choice (of discretization method) that does not have justifiable scientific basis” arguing that, in general, when possible, discretization of continuous variables should be avoided.

As highlighted by Rohmer (2020), a second drawback is computational since one cannot ignore that the size of conditional probability tables grows exponentially as the number of states of the involved nodes increases.

The class of Non-Parametric BNs (NPBNs) proposed in the literature by Kurowicka and Cooke (2006) and Hanea et al. (2006) and based on a D-vine copula approach can overcome these limits. Joint density is modelled using joint normal copula without requiring any parametric assumption on the marginals, as opposed to the parametric case, which justifies the use of the adjective “Non-Parametric”.

Other interesting works combining the distributional flexibility of pair-copula constructions (PCCs) with the parsimony of the conditional independence models associated with the DAG, can be found in Bauer et al. (2012), Hobæk Haff et al. (2016), Bauer and Czado (2016), Pircalabelu et al. (2017).

Is due to Harris and Drton (2013) the proposal of the Rank PC Algorithm for Nonparanormal Graphical Models; properly, they extend the well-known PC structural learning algorithm for Gaussian BNs (Spirtes et al., 1993) to the case of Gaussian copula.

Elidan (2010) proposes the Copula Bayesian Networks (CBNs): as in BNs, they make use of a graph to encode independencies but rely on local copula functions and an explicit globally shared parameterization of the univariate densities. Therefore, joint density is constructed via a composition of local copulas and marginal densities. A generalization of the CBNs to hybrid BNs can be found in Karra and Mili (2016).

In the class of the undirected graphical models, Liu et al. (2009) introduce the family of Nonparanormal distributions to relax the Gaussian assumption proposing two different methods of estimation for a nonparametric undirected graphical model.

In our work, under the assumption of a Gaussian Copula, we estimate a NPBN to identify the main determinants affecting the player winning percentage considering several important performance indicators in basketball. We use the theoretical framework of NPBNs proposed by Kurowicka and Cooke (2006) except for the structural learning phase, i.e the estimation of the DAG: to learn the graph, we apply the Hill Climbing algorithm to a suitable transformation of the original random variables in normal scores as suggested by Liu et al. (2009) and defined more in detail later.

In the next section, we introduce the theoretical framework of NPBNs.

Non-parametric Bayesian networks

The Non-Parametric Bayesian Networks, proposed in the literature by Hanea et al. (2006) and Kurowicka and Cooke (2006), are BNs whose dependence structure is modelled through a Gaussian Copula. The term “Non-Parametric” has been referred to the marginals since no distributional assumption is strictly required, i.e. the empirical distributions could be used.

For further interesting readings we refer to Hanea et al. (2010) and Kurowicka and Cooke (2010).Footnote 3

In a general setting, each node corresponds to a quantitative variable with an arbitrary invertible distribution function while the directed edges are associated with (conditional) parent–child rank correlations computed under the assumption of a joint normal copula. These (conditional) copulae, assigned to the arcs of the NPBN, depend on a (non-unique) ordering of the parent nodes. We highlight that the ordering of parent nodes can be assessed by the experts considering the decreasing strength of influence on the child, based on data availability or other qualitative considerations.

The main result arises from the following theorem (Hanea et al., 2006):

  • given k variables \(X_{1},\ldots ,X_{k}\) with invertible distribution functions \(F_{1},\ldots ,F_{k}\);

  • given the corresponding DAG with k nodes entailing all conditional independence relationships;

  • given a specification of the conditional rank correlations on the arcs of the NPBN;

  • given a copula realizing all correlations \([-1,1]\) satisfying the so-called zero-independence propertyFootnote 4;

the joint density over the k variables is (1) uniquely determined, (2) satisfies the BN factorization and (3) the conditional rank correlations are algebraically independent.

Moreover, the zero independence property implies that the conditional independence statements can be translated in the graph, assigning zero rank correlation to the corresponding arc, i.e. the arc can be removed.

The NPBNs specification also implies that they require only the definition of the one-dimensional marginal distributions and the arbitrary (conditional) rank correlations associated with the arcs of the BN to be quantified and well specified.

More important, the Gaussian copula has been chosen not only because it satisfies the zero independence property but also because it allows to perform conditioning analytically. This allows for extremely fast exact inference, therefore, speeding the probabilistic inference process in complex domains.

More specifically, let \(X_{1}\) and \(X_{2}\) be two continuous random variables, with invertible distribution functions \(F_{1}\) and \(F_{2}\) while \(Y_{1}\) and \(Y_{2}\) are the corresponding transformation to standard normal variables. The conditional distribution of \(X_{1}|X_{2}\) is obtained directly applying \(F_{1}^{-1}(\varPhi (Y_{1}|Y_{2}))\).

This inference engine performing conditioning analytically is efficiently implemented in the UniNet software.

For sake of completeness, one needs to make a little digression on the validation procedure for NPBNs proposed by Hanea et al. (2006), Hanea et al. (2010) and Joe and Kurowicka (2011) to test the Gaussian Copula assumption.

To validate that the joint normal copula distribution is a valid model for the multivariate data, they defined a new measure of multivariate dependence equal to the determinant of the rank correlation matrix. Lying in [0, 1], it is equal to 1 if all variables are independent, to 0 in the case of multivariate linear dependence. All values in such interval take into account the different degree of dependence.

Two determinants have to be computed: the DER, the determinant of the empirical rank correlation matrix and the DNR, the determinant of the empirical normal rank correlation matrix computed on the transformed variables (that are standard normals) applying the inverse Pearson transformation such that \(\rho _{s}=\frac{6}{\pi } arcsin(\frac{\rho }{2})\) where \(\rho \) is the Pearson correlation coefficient while \(\rho _{s}\) the Spearman one.

This validation step is carried out by simulating the sampling distribution of DNR and checking whether DER is within the \(90\%\) central confidence band of DNR; if yes, the Gaussian Copula assumption cannot be rejected at the \(10\%\) significance level. This validation procedure is also used in this work.

The DAG selection in Uninet can be done by removing from the saturated graph those edges with a small conditional rank correlationFootnote 5; if the DNR falls within the 90% central confidence band of the determinant of the rank correlation matrix (computed under the assumption of joint normal copula) of the selected non-saturated BN, the model adequately approximates the complete graph and could be retained.

Since in the above model selection the arcs’ directions are assigned based on non-statistical considerations, such as subject-matter knowledge and time/logical ordering, in this paper, the structural learning phase is replaced by the following procedure: the hill climbing algorithm with bootstrap resampling and model averaging is applied to a suitable transformation in normal scores of the original variables as proposed by Liu et al. (2009) and defined in the next paragraph.

We point out that the parameters’ learning phase is based on the procedure described for the NPBNs, thus based on the conditional rank correlations.

The nonparanormal estimator proposed by Liu et al. (2009)

Following Liu et al. (2009), the Nonparanormal models extend Gaussian models to semiparametric Gaussian copula models, with nonparametric marginals. Formally:

Definition 1

a random vector \(\mathbf {X}=(X_{1},\ldots ,X_{k})^{T}\) has a Nonparanormal distribution if there exist functions \({\lbrace {f_{j}\rbrace }}_{j=1}^{k}\) such that \(\mathbf {Z}\equiv f(\mathbf {X})\sim N({\varvec{\mu ,\Sigma }})\), where \(f(\mathbf {X})=( f_1(X_1), \ldots , f_k(X_k))\). Then, \(\mathbf {X} \sim NPN \sim N({\varvec{\mu ,\Sigma }},f)\).

In another equivalent way, let \({\lbrace {f_{j}\rbrace }}_{j=1}^{k}\) be a set of monotone univariate functions and let \(\varvec{\Sigma }\) be a positive-definite correlation matrix. Then a k-dimensional random variable \(\mathbf {X}=(X_{1},\ldots ,X_{k})^{T}\) has a Nonparanormal distribution \(\mathbf {X} \sim NPN \sim N({\varvec{\mu ,\Sigma }},f)\) if \(f(\mathbf {X}):=( f_1(X_1), \ldots , f_k(X_k))\sim N_k(0,{\varvec{\Sigma }})\).

Therefore, a random vector \(\mathbf {X}\) belongs to the Nonparanormal family if there exists a set of monotonically increasing transformations functions \({\lbrace {f_{j}\rbrace }}_{j=1}^{k}\) such that \(f(\mathbf {X})=( f_1(X_1), \ldots , f_k(X_k))^T\) is Gaussian.

Moreover, as a lemma, Liu et al. (2009) proved that the Nonparanormal distribution \(NPN({\varvec{\mu ,\Sigma }},f)\) is a Gaussian copula when the \(f_j\)’s are monotone and differentiable.

A little digression is needed to remember that, in the classical Gaussian Graphical Models, one assumes that the observations have a multivariate Gaussian distribution with mean \(\varvec{\mu }\), and covariance matrix \(\varvec{\Sigma }\). More importantly, the conditional independence can be implied by the inverse covariance (concentration) matrix \(\varvec{\Omega }=\varvec{\Sigma }^{-1}\). Indeed, if \(\Omega _{jk} = 0\), then the i-th variable and j-th variable are conditional independent given all other variables. In other words, the graph G is encoded by the precision matrix.

The following Lemma (Liu et al., 2009) actually states that it also holds for the Nonparanormal graphical models.

Lemma 1

If \(\mathbf {X} \sim NPN (\varvec{\mu },\varvec{\Sigma },f)\) is Nonparanormal and each \(f_{j}\) is differentiable, then \(X_{i} \perp \!\!\!\perp X_{j}|X_{\backslash \lbrace i, j\rbrace }\, \Leftrightarrow \Omega _{ij}=0, \quad where \quad \varvec{\Omega }=\varvec{\Sigma }^{-1}\).

The important point to note is that, although the NPN identifies a more flexible and broader class of distributions than the parametric one, the independence relations among the variables are still encoded in the precision matrix \(\varvec{\Sigma }^{-1}\).

For the scopes of Liu et al. (2009), the primary objective of the Nonparanormal model (NPN) is to estimate the underlying sample covariance matrix for a better recovery of the underlying undirected graph. Therefore, the random variable \(\mathbf {X}=(X_{1},\ldots ,X_{k})\) is replaced by the transformed random variable \(f(\mathbf {X})=( f_1(X_1), \ldots , f_k(X_k))\) assuming that \(f(\mathbf {X})\) is multivariate Gaussian.

Table 1 Basketball variables

Based on the above theoretical framework, the Nonparanormal estimator proposed by Liu et al. (2009) is based on a two-step procedure: (1) replace the observations, for each variable, by their respective normal scores, i.e. they estimate the transformation functions; (2) apply the graphical lasso to the transformed data to estimate the undirected graph.

In the step (1), as a Non-parametric estimator for the functions \(\lbrace {f_{j}\rbrace }\), Liu et al. (2009) proposed the normal score-based Nonparanormal estimator, i.e. a truncated empirical distribution (the so called Winsorized estimator) whose computation is done by using the R package huge.

In detail, let \(x_{1},\ldots ,x_{n}\) be n observations of the X variable and let \(u_{1},\ldots ,u_{n}\) be the corresponding rank; denoting with \(\varPhi \) the normal CDF function, the truncated normal score \(f_{i}\) is estimated by:

$$\begin{aligned} \hat{f_{i}}=\left\{ \begin{array}{ll} \varPhi ^{-1}(\delta ) &{}\quad if \,\,\hat{u_{i}}\le \delta \\ \varPhi ^{-1}(\hat{u_{i}}) &{}\quad if \,\, \delta < \hat{u_{i}}\le 1-\delta \\ \varPhi ^{-1}(1-\delta ) &{}\quad if \,\, \hat{u_{i}}> 1-\delta \\ \end{array} \right. \end{aligned}$$

where \(\hat{u_{i}}= \frac{u_{i}}{n+1}\) and \(\delta =\frac{1}{4n^{1/4} \sqrt{\pi log(n)}}\).

For further mathematical details, see Liu et al. (2009, 2012).

Following Liu et al. (2009), we replace the observations with their respective normal scores; in this way, we can apply to them the Hill climbing algorithm as for the Normal case.

In the next sections, we describe the variables used in the analysis and, then, apply the proposed learning strategy to them.

Data description

The variables used in this study were downloaded from the NBA website and listed in Table 1. They refer to 377 players, all those who played at least 10 games and at least 6 min per game in the 2019–2020 regular season.

Thanks to Oliver’s studies (Oliver, 2004), in the Basketball, particular attention has been given to the so-called “Four Factors”: the effective Field Goal percentage, the Turnovers per Possession, the Offensive Rebounding percentage and the Free Throw Attempt rate.

These measures are considered good proxies for the overall offensive or defensive performance and are, in general, not all equivalent. Indeed, in his studies, Oliver assigned a weight to each factor so that the effective Field Goal percentage is the most important followed by the turnovers, the rebounding, and the free throws.

Because of their importance, they have included in our work. In detail, the effective field goal percentage (EFG) is the field goal percentage adjusting for made 3-point field goals, i.e. accounts for made three-pointers and serves to catch a player’s (or team’s) shooting efficiency from the field.

The turnovers per possession (To_Ratio) is the number of turnovers a player (or a team) averages per 100 possessions used.

The Offensive Rebounding percentage (OREB) is the percentage of available offensive rebounds a player obtains while on the floor; players, as well as teams, by obtaining an offensive rebound, can have one more attempt to make a basket that, in turn, can rise the winning probability.

The Free Throw Attempt rate (FTA_RATE) is the number of free throws made by a player in comparison to the number of field goal attempts; it expresses at the same time the player’s ability to get to the foul line and to make foul shots.

In addition, the Defensive Rebound percentage (DREB) is also considered, that is the percentage of available defensive rebounds a player obtains while on the floor.

We take into account the offensive and defensive ratings (OFFRTG and DEFRTG) too; they could be seen as measures of the player’s efficiency per possession. In particular, OFFRTG is the number of points per 100 possessions that the team scores while that individual player is on the court while DEFRTG is the number of points per 100 possessions that the team allows while that individual player is on the court.

We also include the Assist percentage (AST), the percentage of a team’s assists that a player has while on the court and the Assist Ratio (AST_RATIO), i.e. the number of assists a player averages per 100 possessions used.

Moreover, we take into account the age of the player (AGE), the Minutes played by a player (MIN) as well the number of possession per 48 min (PACE) and the usage percentage (USG), i.e. the percentage of team plays used by a player when they are on the floor. At last, the variable that refers to the wins over the played games (WIN) is also included.

The estimated network

Based on the previously described variables and 1.000 simulations, we cannot reject, at the 10\(\%\) level, the hypothesis that data were generated from the joint normal copula since the determinant of the empirical rank correlation matrix (DER), being equal to 0.00179533, falls within the \(90\%\) central confidence interval for the determinant of the normal rank correlation matrix (DNR) [0.0014418, 0.0029218].

Under this assumption, the Bayesian network is estimated based on the model selection procedure described in the theoretical Sections, i.e., first of all, by transforming the variables into normal scores with Winsorized truncation and then running the Hill Climbing algorithm fixing the number of bootstrap samples at 500 and using the AIC criterion.

The threshold value is set to 0.70; in this way, only the arcs appearing at least the \(70\%\) of times are retained. This high value allows to define a parsimonious model avoiding that in the parameters’ learning phase too many arcs are associated with a conditional rank correlation close to 0.Footnote 6

We argue that the model selection procedure is mainly data-driven even if some arcs’ direction are forbidden based on logical constraints.

Given the DAG structure, the conditional correlations associated with the arcs are then estimated based on the Gaussian copula assumption through the Uninet software.

The estimated NPBN is shown in Fig. 2a. The number associated with each arc is the conditional normal rank correlation, remembering that the rank correlation between the node and its first parent is unconditional, all the others are conditioned to the previous parents’ nodes. The ordering between parents is meaningful; hence, for each node, it has been set based on expert knowledge.

Fig. 2
figure 2figure 2

The estimated averaged BN under the assumption of Gaussian Copula, using the Uninet software representation

In the Fig. 2b, the same network is reported whose nodes are replaced by monitors showing the associated empirical marginal distributions (with the indication of the mean and the standard deviation) estimated from data.

The estimated network describes the complexity of the multivariate dependence structure discovering the conditional independence relationships among the performance indicators.

In particular, the concept of the Markov blanket can be useful to identify the main variables that influence, for example, the player’s winning percentage. By definition, the Markov blanket of a node X, denoted by MB(X), includes all the parents, children and parents of children of X. It contains all information needed to infer a target variable X, making all variables not belonging to it redundant. Regarding the target node “WIN”, its Markov Blanket is composed of all nodes coloured blue in Fig. 2.

In terms of conditional independence, this implies that the winning percentage and the green variables are conditionally independent, given the blue ones.

This means that the player winning percentage is completely determined by the player offensive and defensive ratings (OFFRTG and DEFRTG), with nearly the same rank correlation even if with opposite sign,Footnote 7 by the possession (PACE) and by the minutes staying on the floor (MIN). No other information is needed to predict winning percentage if those related to the above variables are available.

This is in accordance with the SEM model proposed by Baghal (2012), whose two latent factors affecting winning percentage were the offensive and defensive quality factors influenced, in turn, by the Four Factors. This is also verified in the BN since the Four Factors affect the winning percentage by means of indirect paths; indeed, unless FTA_RATE, all the other three are parent nodes of the OFFRTG. Moreover, from the estimated network, the important role of the effective field percentage is confirmed as the main factor influencing the offensive quality.

In general, all the discovered conditional relationships well match with those expected or defined in the literature.

In the following, in order to use the BN for inference purposes, we simulated the following scenario focusing on the importance of the individual Offensive and Defensive Rating, i.e. the player’s offensive production and the player’s defensive production. Taking into account an average player, who plays 26 min and is 27 years old, the winning percentage is 49.5% (see Fig. 3a).

Suppose we have an OFFRTG of 10% higher than the mean, thus setting the value of the OFFRTG node at 121 (see Fig. 3b), the mean value of the WIN node goes from 49.5 to \(69.2\%\). Suppose now that we decrease the mean value of the DEFRTG node by 10%, fixing the value of the node at 99 (see Fig. 3c), the mean value of the WIN node rises to \(61.7\%\). It means that a 10% increase in offensive quality implies a 40% relative increase in the winning percentage while the same variation in terms of defensive quality implies a \(25\%\) relative increase in the winning percentage. Therefore, the quality of the player offensive production seems to have a stronger impact on the winning percentage.

By looking at the monitors, the effects on the whole distribution can be seen as they allow the visualization of both the original unconditional distribution (showed in grey) and the new conditional distribution (showed in black).

Fig. 3
figure 3figure 3

First simulated scenario

In the second scenario, we vary the mean value of the Four Factors by 10%, given the same average player. This combination leads to a relative increase in the winning percentage of \(7\%\) (see Fig. 4 compared to Fig. 3a), all other factors being equal, the DEFRTG above all.

Fig. 4
figure 4

Second simulated scenario

Fig. 5
figure 5

Third simulated scenario

More interesting could be the third simulated scenario in which we follow a bottom-up approach. This time, we fix the expected high level of winning percentage at \(75\%\) in order to know what are the minimum levels of the other factors that guarantee to reach the above percentage (see Fig. 5 compared with Fig. 3a). Amongst them, the main evidence relates to the marked increase of Net Rating, i.e. the difference between OFFRTG and DEFRTG Obviously, the other indicators also change accordingly and can be investigated by looking at the change in their mean values (and in their distributions) compared to the baseline ones in Fig. 3a.

In the fourth scenario we investigate the role of the PACE variable. Being equal the values of OFFRTG and DEFRTG (see Fig. 6a), decreasing its value to 98 increases the mean value of the winning percentage from 50.8 to 52.7 (see Fig 6b) while if it grows up to 105, the mean value of the winning percentage decreases to 48.6 (see Fig 6c).

Fig. 6
figure 6figure 6

Fourth simulated scenario

At last, to investigate the influence of the player’s age, we simulate a fifth scenario. Being equal all other factors’ configuration, by increasing the age from 26 (Fig. 7a) to 35 (Fig. 7b), the mean value of the winning percentage goes from 53.7 to 59.4. This could be explained by the higher value of the OFFRTG which also corresponds to an increase in the AST_RATIO for the older player.

Fig. 7
figure 7

Fifth simulated scenario

By these simulations, we have shown the potential uses of BNs in this context, suggesting that any other scenario of interest could be investigated and deeply analyzed.

Conclusions

The aim of this work is to contribute to sports analytics by introducing a multivariate approach and, therefore, more complex statistical models, such as BNs, to deal with many performance indicators at the same time, studying their interrelationships rather than analysing them individually.

As shown through the what-if analysis, the links among different measures are evident. By means of the concept of Markov Blanket, we also identified the key determinants of the winning percentage. Moreover, the role of the Four Factors has been analysed: they affect the quality of the offensive and defensive production of a player that, in turn, directly affect the winning percentage. From a methodological point of view, the use of an approach overcoming the Gaussian assumption by means of a Gaussian Copula, is of interest in this field and, in general, in Statistics since it is well-known to allow great flexibility in the models. In the future, other statistical models based on BNs could be proposed that take into account the position of the player and many other features not yet included, such as the different seasons of the league.