1 Introduction

Baseball has been the de facto national sport of the United States since the late nineteenth century, and its popularity has spread to Central and South America as well as the Caribbean and East Asia. It has inspired numerous movies and other works of arts, and has also influenced the English language with expressions such as “hitting it out of the park” and “covering one’s bases”, as well as being the origin of the concept of a rain-check. As we all are aware, the omnipresent baseball cap is worn across the world. Excluding the postseason, there are approximately 2430 games of major league baseball in a season, equating to 162 games per team played over 6 months.Footnote 1 It speaks to the popularity of the game when the team with the lowest average attendance per game during the 2015 season had an average attendance of 21,796 (the highest was 40,502).

Apart from being an extremely popular pastime, there is also a business side to baseball. A winning team tends to draw bigger crowds, thus ticket and merchandise sales increase. From a managerial perspective it is therefore necessary to analyse players to identify beneficial trades and strategies, in order to create the conditions for a successful team. Furthermore, with the growth of sports betting and fantasy leagues, this analysis is not only of interest for decisions makers within the teams, but from fans and spectators as well.

Bill James was the first to coin the term sabermetrics, referring to the empirical analysis of baseball statistics. SABR is an acronym for The Society for American Baseball Research, which was founded in 1971 and was a lynch pin for modern day baseball analysis, thus sabermetrics. Bill introduced the concept of ageing curves, a representation of how the ability of a certain player increases until it peaks at a certain age, and then falls off on the other side. This was an important insight, as it explained some of the variance in the summary statistics used at the time. Another of Bill’s ideas was to compare current players with past players, creating a similarity score used to gain insight into how players performed relative to each other. It was later that Nate Silver combined these ideas to create the famous PECOTA system, which essentially used a nearest neighbour approach to create clusters of players based on their ageing curves (Silver 2012). While the system is used to create predictions of how players will perform over the coming season, it can also give insight into the worst-, best-, and most-likely scenario that will happen, as the ageing curves in the cluster of a player are also informative.

While owners and managers have always to some degree incorporated statistics as part of their strategy to trade players and win games, it was the release of Michael Lewis’s book Moneyball (Lewis 2003) that unveiled to the general public the extent to which sabermetrics was used by Billy Beane when he managed the Oakland Athletics with great success, despite financial constraints. Today the use of sabermetrics is widespread across the league, but still creates headlines when teams like the Huston Astros employ former NASA engineers (McTaggart 2012), or when breaches akin to corporate espionage are detected (Langosch 2015).

Naturally, one of the goals of sabermetrics is to produce predictions of players’ future performance. There is, as one might suspect, a rich history of approaches for this task, and an untold number of books, magazines, news letters and forums are dedicated in furthering the performance of this endeavour (Baseball prospectus, www.baseballprospectus.com; Bill James online, www.billjamesonline.com; Baseball think factory, www.baseballthinkfactory.org). From these efforts several well known systems have grown, such as Steamer (www.steamerprojections.com), Marcel (www.tangotiger.net/marcel), and the aforementioned PECOTA system (Baseball prospectus, www.baseballprospectus.com). However, sabermetricians are equally concerned with the process that leads to the prediction (Bill James’ similarity scores and Nate Silver’s outcome scenarios are examples of this). While we recognise the importance of making predictions, and acknowledge the difficulties in doing so, our intentions in this paper is in line with understanding the data. We are specifically interested in learning if there are regimes in the career data, and if so, how these regimes transition into one another. A regime in this case is defined as a steady state of the relationships between the variables in the data, but where these relationships may change between regimes. We will describe this in further detail in Sect. 2.1, for now it suffices to think of regimes as segments of the data for which a baseball player’s performance is different enough that it warrants estimating a new model.

The reason why a baseball player would transition into a new regime may not be recorded or even directly observable, but may be a complex combination of increased skill and experience, ageing, strategic decisions, etc. For some regime changes it may be more evident what could have caused the shift in the player’s performance, e.g. most of us would expect any sportsperson to play differently after an injury. What attracts our attention in this paper is the possibility of detecting regime transitions, and then identifying which of them are not directly explainable from publicly available data. The motivation for why this is of value lies in the fact that while coaches and managers make many decisions throughout a season, they may not be aware that a combination of these decisions may transition players to and from regimes in which they have been performing extraordinarily or subpar. Detecting regime changes in the data would allow decision makers to analyse the decisions they made during the time of the transition to see if they can recreate (or stay away from) the conditions that transitioned the player to the extraordinary (or subpar) regime.

We will approach this problem by using a combination of Markov Chain Monte Carlo (MCMC) and Bayesian optimisation to learn a gated Bayesian network (GBN). A GBN combines several Bayesian networks (BNs) in order to describe the relationships amongst the variables during different regimes in the data. The BNs can be either active or inactive, and are combined using so called gates which control when the states’ of the contained BNs change. The GBN can be depicted as a directed graph, in which the nodes are either BNs or gates, and from the graph we can tell how the different regimes can transition into one another. We will introduce the GBN model in Sect. 2.3.

The rest of the paper is disposed as follows. In Sect. 2 we will introduce regimes more formally, as well as giving an introduction to BNs and GBNs. In Sect. 3 we will contrast our approach with related work. In Sect. 4 we will explain in detail how we aim to learn a GBN to model a set of variables that exhibit regimes. We will in Sect. 5 return to the principal interest of this paper, using the proposed procedure on baseball players’ career data, where we will investigate the regime changes that may occur. In Sect. 5 we will also offer a brief introduction to the game of baseball. Finally, we will offer some concluding remarks and thoughts about future work in Sect. 6.

2 Regime changes and gated Bayesian networks

In this section, and the following two, we will depart from the world of baseball. We will do so in order to give a description of the general regime identification problem, as well as the model that we aim to apply. In Sect. 2.1 we offer the problem definition, which consists of subgoals, namely representing and detecting regimes. With the aim of using a single model that incorporates both of these subgoals, we suggest using a generalisation of GBNs (Bendtsen and Peña 2016, 2013), described in Sect. 2.3. In Sect. 4 we offer an algorithm for learning such a model.

2.1 Problem definition

When we observe a probabilistic system over time, it is natural that the observations we make differ from each other. We expect this to be the case due to the randomness of the variables within the system. However, as long as the relationships between the variables, and the distributions of the individual variables are unchanged, the observations we make of the system can be judged as coming from the same joint probability distribution.

Let XY and Z be three random variables, and let there exist a joint probability distribution p(XYZ). We can factorise this joint distribution using the chain rule to get \(p(X,Y,Z) = p(X|Y,Z)p(Y|Z)p(Z)\). If we also know that the relationship \(X \perp Y | Z\) holds, then the factorisation can be reduced to \(p(X,Y,Z) = p(X|Z)p(Y|Z)p(Z)\), where Y has been dropped from the conditioning set in the distribution of X, since they are independent given Z.

However, over time the relationships between the variables may change. Assume that after some time \(X \perp Y | Z\) no longer holds. This implies that the factorisation cannot be reduced, and therefore p(XYZ) prior to the change and after the change are not the same. Observations before the change and after the change should no longer be judged to be samples from the same joint probability distribution.

Formally, let \(\mathcal {S}\) be a system of random variables \(\mathbf {X}\), and let the system \(\mathcal {S}\) have regimes labeled \(R_1,R_2,\ldots ,R_m\). For each regime \(R_k\) there exists an independence model \(M_k\) over the variables \(\mathbf {X}\) such that the joint probability distribution \(p_k(\mathbf {X})\) is positive. Furthermore, assume that \(M_i~\ne ~M_j~\forall ~i \ne j\). Let \(\mathcal {D}\) be a dataset with complete observations of \(\mathbf {X}\), where \(d_i\) is the i:th observation in \(\mathcal {D}\).

By \(d_i \sim R_k\) we mean that \(d_i\) is an observation of the variables \(\mathbf {X}\) when \(\mathcal {S}\) is in regime \(R_k\), thus \(d_i\) is a sample from \(p_k(\mathbf {X})\). We will for brevity say that \(d_i\) comes from regime \(R_k\). Then \(\mathcal {D} = \{d_1 \sim R_1, d_2 \sim R_1, d_3 \sim R_1, d_4 \sim R_2, d_5 \sim R_2\}\) means that the first three observations came from regime \(R_1\) while the last two observations came from regime \(R_2\). When there is no ambiguity we will shorten \(\mathcal {D} = \{d_1 \sim R_1, d_2 \sim R_1, d_3 \sim R_1, d_4 \sim R_2, d_5 \sim R_2\}\) to \(\mathcal {D} = \{R_1, R_1, R_1, R_2, R_2\}\).

Given \(\mathcal {D} = \{R_1, R_1, R_1, R_2, R_2\}\) it is possible to directly identify which regimes that can transition into each other, i.e. in this case \(\mathcal {S}\) can transition from \(R_1\) to \(R_2\). We call this the regime transition structure of \(\mathcal {S}\), which can be drawn as a graph or denoted with \(R_1 \rightarrow R_2\). Had the observations been different, such that \(\mathcal {D} = \{R_1, R_1, R_1, R_2, R_2, R_1, R_1\}\), then this would have identified a different structure where \(R_1\) can transition to \(R_2\), and \(R_2\) can transition to \(R_1\) (\(R_1 \leftrightarrows R_2\)). It is necessary to assume a Markovian property of the regime transitions, such that knowing that \(\mathcal {S}\) is in regime \(R_i\) is the only necessary information in order to identify which regimes it can transition into. We will defer the reasoning and consequences of this Markovian assumption to Sect. 4.2.2.

We say that \(\mathcal {D}\) is a dataset for \(\mathcal {S}\) if it identifies the true regime structure of \(\mathcal {S}\), i.e. \(\mathcal {D}\) is a valid sample of \(\mathcal {S}\). For instance, if the true regime structure of \(\mathcal {S}\) is \(R_1 \rightarrow R_2\) then \(\mathcal {D} = \{R_1,R_2\}\) is a dataset for \(\mathcal {S}\), while \(\mathcal {D} = \{R_1\}, \mathcal {D} = \{R_2\}, \mathcal {D} = \{R_2,R_1\}, \mathcal {D} = \{R_1,R_2,R_1\}\) are not. It is implied that datasets are made available to us completely and immediately. However, we will also make use of observations that are given to us one by one, as \(\mathcal {S}\) is observed over time. We call this a stream of data. The size of a stream \(\mathcal {O}\) depends on how many observations we have made thus far: at time \(t = 1\) the stream only contains a single observation, e.g. \(\mathcal {O} = \{d_1 \sim R_1\}\), and at time \(t = j\) it contains j observations, e.g. \(\mathcal {O} = \{d_1 \sim R_1, \ldots , d_j \sim R_k\}\).

Given a dataset \(\mathcal {D} = \{d_1,\ldots ,d_n\}\) for \(\mathcal {S}\), where it is not known from which regime the individual observations came from, and it is not known how many regimes \(\mathcal {S}\) exhibits, the primary aim is to learn a model of \(\mathcal {S}\) from \(\mathcal {D}\). In order to do so we must:

  • Identify where in \(\mathcal {D}\) there are regime changes.

  • Identify the regimes \(R_1,\ldots ,R_m\) of \(\mathcal {S}\), and their corresponding independence models \(M_1,\ldots ,M_m\) and joint distributions \(p_1(\mathbf{X}),\ldots ,p_m(\mathbf{X})\).

  • Identify the regime transition structure of \(\mathcal {S}\).

Once such a model has been defined, the secondary aim is to take a stream of data \(\mathcal {O}\) and correctly identify which regime \(\mathcal {S}\) currently is in, given every new observation in \(\mathcal {O}\). In order to do so we must extend our model to allow it to detect regime changes in \(\mathcal {O}\). This will result in a model that can be used to reason about the current regime of \(\mathcal {S}\), and also to identify which \(M_k\) and \(p_k(\mathbf {X})\) should be used for inference purposes.

We will meet the primary and secondary aim by learning a GBN, which is a model that builds on BNs. We will therefore in the next section introduce BNs, followed by an explanation of the GBN model.

2.2 Bayesian networks

Introduced by Pearl (1988), BNs consists of two major components: a qualitative representation of independencies amongst random variables through a directed acyclic graph (DAG), and a quantification of certain marginal and conditional probability distributions, so as to define a full joint probability distribution. A feature of BNs, known as the local Markov property, implies that a variable is independent of all other non-descendant variables given its parent variables, where the relationships are defined with respect to the DAG of the BN. Let \(\mathbf {X}\) be a set of random variables in a BN, and let \(\Pi (X_i)\) be the set of variables that consists of the parents of variable \(X_i \in \mathbf{X}\), then the local Markov property allows us to factorise the joint probability distribution according to Eq. 1.

$$\begin{aligned} p(\mathbf{X}) = \prod _{X_i \in \mathbf{X}} p(X_i | \Pi (X_i)) \end{aligned}$$
(1)

From Eq. 1, it is evident that the independencies represented by the DAG allow for a representation of the full joint distribution via smaller marginal and conditional probability distributions, thus making it easier to elicit the necessary parameters, and allowing for efficient computation of posterior probabilities. For a full treatment of BNs please see Pearl (1988), Korb and Nicholson (2011) and Jensen and Nielsen (2007).

While a BN has advantages when representing a single independence model, it lacks the ability to represent several independence models simultaneously, i.e. it lacks the ability to model several regimes. Therefore, we suggest using a GBN to represent systems that have several regimes, where each regime is represented by a BN. GBNs were originally defined to operate on the posterior distribution of certain variables in the BNs (Bendtsen and Peña 2016, 2013), and to be used to model processes with distinct phases, such as algorithmic trading (Bendtsen and Peña 2016, 2014; Bendtsen 2015). However, in the coming section we will generalise the definition of GBNs, allowing them to operate on entire BNs (rather than a set of variables), while at the same time adding certain restrictions on how the GBNs can be constructed.

2.3 Gated Bayesian networks

GBNs use gates that connect with BNs using directed edges, where the parent/child relationship is given by the direction of the edge. Each gate has exactly one parent and one child BN.Footnote 2 For instance, in Fig. 1 the gate \(G_1\) has BN \(R_1\) as its parent and BN \(R_2\) as its child. As is shown in the figure, a BN can be both a child and a parent for different gates (\(R_2\) and \(R_3\)), and while it is not evident from the figure, a BN can be a parent or a child of several gates.

Fig. 1
figure 1

Example GBN

A feature of GBNs is that they keep the contained BNs in an active or inactive state. Assuming that all BNs are inactive in Fig. 1 except for \(R_1\), any probabilistic or causal queries should be answered using \(R_1\). However, the GBN also defines when the active/inactive state of the BNs change, thus one of the primary tasks of a GBN is to, for each new observation from a stream of data, re-evaluate which BN should be active.

To control this, the gates of a GBN are programmed with predefined logical expressions, known as the gates’ trigger logic. The trigger logic is an expression regarding the gate’s parent and child BNs.Footnote 3 If the trigger logic is satisfied, then the gate is said to trigger and the gate’s parent BN is deactivated while the child BN is activated.

Let there be a function \(f (\mathcal {B},\mathcal {T}) \in \mathbb {R}\), where \(\mathcal {B}\) is a BN, and \(\mathcal {T}\) is a dataset of observations over the variables in \(\mathcal {B}\). The trigger logic of a gate is defined as a comparison between the value of f given the parent and the value of f given the child, given some dataset \(\mathcal {T}\). For instance, the trigger logic for each gate in Fig. 1 is presented in Eq. 2, where we require that the child’s value divided by the parent’s value be greater than some threshold \(\theta \) in order for the gate to trigger. For each new observation in a stream of data, we re-evaluate the trigger logic of all gates that have an active parent BN.

$$\begin{aligned} \begin{aligned} TL(G_1)&:= \frac{f(R_{2}, \mathcal {T})}{f(R_{1}, \mathcal {T})}> \theta _1 \\ TL(G_2)&:= \frac{f(R_{3}, \mathcal {T})}{f(R_{2}, \mathcal {T})}> \theta _2 \\ TL(G_3)&:= \frac{f(R_{2}, \mathcal {T})}{f(R_{3}, \mathcal {T})} > \theta _3 \\ \end{aligned} \end{aligned}$$
(2)

By using a GBN to represent a system that has several regimes, we can model each regime using a BN, and then connect them with gates according to the regime transition structure. We can also define the trigger logic in such a way that the model is capable of, given a stream of observations, accurately keeping the BN active that represents the current regime. Therefore, a GBN is a model that can incorporate both aims presented in Sect. 2.1, and to this end we will suggest an algorithm in Sect. 4 to learn a GBN from data. Before we introduce this algorithm, we shall contrast our approach to related work.

3 Related work

Refining or updating the structure and conditional distributions of a BN in response to new data has been studied for some time (Buntine 1991; Lam and Bacchus 1994; Friedman and Goldszmidt 1997; Lam 1998). However, these approaches assume that data is received from a stationary distribution, i.e. a system that does not undergo regime changes.

Nielsen and Nielsen (2008) approach the problem of having a stream of observations which they say is piecewise stationary, i.e. observations within a section of the stream come from the same distribution, but changes may occur into a new stationary section. Their goal is to incrementally learn the structure of a BN, adapting the structure as new observations are made available. They achieve this by monitoring local changes among the variables in the current network, and when a conflict occurs between what is currently learnt and what is observed, they refine the structure of the BN. Focusing only on local changes allows them to reuse all previous observations for parts of the BN that have not changed, resulting in a procedure that is less wasteful and more computationally feasible. Since their goal is to adapt the structure of a single BN, their aim is not to identify reoccurring steady state regimes, but rather to adapt to what is known as concept drift. Our goal is to learn a BN for each regime and to find the regime transition structure of the underlying system. We also intend to use our model to decide which regime is current given a new stream of data, preserving the learnt BNs for each regime. We do not make the assumption of local change between regimes, but relearn the entire BN, which allows for less assumptions about changes, but as Nielsen and Nielsen (2008) point out, can be wasteful of data and computation time.

Robinson and Hartemink (2010) assume that a complete dataset exists, and use MCMC to identify change points and to revise a non-stationary dynamic BN (nsDBN). They have several potential moves that they can take within the MCMC iterations, for instance they can shift, split or merge change points, or add/remove an edge in the nsDBN at a change point, etc. By doing so they make the structural learning of the nsDBN an integral part of the MCMC proposals, while our approach plugs in any existing structure learning algorithm. Their aim is to segment the dataset and identify which structure was most likely at the different segments, and thereby discovering non-stationary relationships. Our approach is closely related to the approach of Robinson and Hartemink, in the sense that it is the BNs retained after MCMC that represent the model. However, we aim at identifying changes between static regimes, and not to capture the dynamics between timesteps, thus the GBN is not a DBN. Furthermore, Robinson and Hartemink do not approach the problem of trying to identify if regimes reoccur (each segment is considered unique). Therefore the regime transition structure is not captured, and this entails that they do not attempt to use the model on a new data stream in order to predict which of the already learnt structures should be used. Our approach addresses these two latter points.

The nsDBN approach sets itself apart from frameworks such as hidden Markov models (HMMs) and switching state-space models (Ghahramani and Hinton 2000) because, as Robinson and Hartemink state, a nsDBN “has no hidden variables, only observed variables”. In other words, the nsDBN approach does not assume that there is a hidden random variable representing the regime at each time point. For this reason, modelling the transition probabilities between such random variables, as HMMs and switching state-space models do, does not make sense. Our GBN approach sets itself apart from HMMs and switching state-space models for the same reasons as the nsDBN approach.

Guo et al. (2007) and Xuan and Murphy (2007) consider graphs with only undirected edges, i.e. Markov networks rather than BNs. The structure of the Markov network is considered latent by Guo et al. (2007), thus their proposed model is a HMM where the states of the latent variable is the entire structure. In order to identify segmentations in time series, Xuan and Murphy (2007) score segmentations by computing the marginal posterior over segmentations given all possible decomposable Markov networks. Thus the goal of Xuan and Murphy (2007) is to identify the segmentation, rather than the graph. As was pointed out earlier, the GBN that we learn does not contain any hidden variables, and the structure within each regime is explicitly learned. Furthermore, we intend to identify reoccurring regimes and, given new data, predict which regime the system currently is in.

For more on adaptation, we refer the interested reader to a recent survey by Gama et al. (2014), and a comparison between GBNs and a broader range of models can be found in our previous publication (Bendtsen and Peña 2016).

The concept of regimes is not only confined to the realm of probabilistic graphical models. For instance, Markov regime-switching regression models have been employed in economical studies (Goldfeld and Quandt 1973; Hamilton 1989), where one switches the regression coefficients depending on the current regime. This model has also been applied for detecting regimes when baseball player’s wages were put in relation to their recent performance (Haupert and Murray 2012), showing abrupt changes in the relationship between performance and salary (during the period of 1911 through 1973).

In our previous publications (Bendtsen and Peña 2016, 2014), we proposed an algorithm for learning GBNs that can be used in decision making processes, specifically in trading financial assets. The GBNs learnt use posterior probabilities in the trigger logic of their gates to decide when it is an opportune time to buy and sell stock shares (the two decisions may use different BNs, thus a GBN is created). The algorithm uses a set of predefined BNs and gates and finds the candidate that achieves the highest reward. However, the algorithm does not learn the structure of these BNs from data, but they are rather supplied by experts. Such a library of BNs was possible to create since there is a wealth of information of how the variables under consideration interact [in the case of Bendtsen and Peña (2016), Bendtsen and Peña (2014) we considered financial indicators as variables]. However, such information is not always available, for instance in the case of the current baseball setting. Furthermore, no regime identification step is taken as part of the learning algorithm, thus the resulting GBN is not the one that best describes the data (in terms of data likelihood), but rather the GBN that performs best according to some score function.

4 Learning algorithm

In this section we will describe the algorithm that we propose in order to learn a GBN, conforming to the definition in Sect. 2.3, that fulfils the primary and secondary aim of the problem definition in Sect. 2.1. The pseudocode for the learning algorithm can be found in “Appendix A”. There are three major steps involved, which we will briefly outline here, and then discuss in the rest of this section:

  1. 1.

    The GBN that we are learning consists of a BN for each regime. We shall partition a dataset \(\mathcal {D}\) into subsets, treating each subset as observations from a regime, and learn the structure and parameters of a BN for each subset. The first step of the algorithm is to find the partitioning that represents the regime changes that occurred during the collection of \(\mathcal {D}\). We will to this end assume that observations within each subset are independent and identically distributed, and that we can calculate the likelihood of the entire model by the product of the likelihoods of the contained BNs.

  2. 2.

    Since we only detect regime changes in the first step, the resulting regime transition structure is a chain of regimes, i.e. no regimes reoccur. However we are also interested in identifying reoccurring regimes, and we will therefore hypothesise mergers of the identified subsets, to find the mergers that lead to the most likely regime transition structure.

  3. 3.

    The final step of the learning algorithm is to introduce gates between the identified regimes, and to optimise the required parameters of these gates. The goal is to be able to use the GBN on a stream of observations, and for each new observation decide which regime the system currently is in.

The rest of this section describes these three steps in detail, in Sect. 5 we will return to the world of baseball, applying the proposed learning algorithm on both synthetic and real-world data.

4.1 Identifying regime changes in the dataset

In order to identify where regime changes have occurred, in a given dataset, we use Metropolis-Hastings MCMC (MH). A typical Bayesian feature selection method is employed, where we have k splits \(\delta _1,\ldots ,\delta _k\), each defined by their position in the dataset \(\beta _1,\ldots ,\beta _k\), and an indicator variable \(I_1,\ldots ,I_k\), that is either 0 or 1. By defining \(\delta _i = I_i\beta _i\) the splits can move along the dataset by changing the corresponding \(\beta \), and be turned on and off by the corresponding I. For a certain configuration of \(\beta \)s and Is we specify a model by learning a BN for each subset of the data defined by the \(\delta \)s that are nonzero. We want to estimate the values of the \(\delta \)s, given the available data, and therefore we are interested in the posterior distribution over the \(\beta \)s and Is given a dataset \(\mathcal {D}\) with n observations. That is, we need samples from the posterior distribution in Eq. 3.

$$\begin{aligned} \begin{aligned}&p(\beta _1,\ldots ,\beta _k,I_1,\ldots ,I_k | \mathcal {D}) \propto p(\mathcal {D} | \beta _1,\ldots ,\beta _k,I_1,\ldots ,I_k)\\&\quad U(\beta _1 ; 0, \beta _2)U(\beta _2 ; \beta _1,\beta _3)\cdots U(\beta _k ; \beta _{k-1},n+1)p(I_1)\cdots p(I_k) \end{aligned} \end{aligned}$$
(3)

As is evident from Eq. 3, we will a priori assume that Is are independent, and that \(\beta \)s are distributed according to a discrete uniform distribution, where U(abc) represents a discrete uniform distribution over a with the range (bc). A Bernoulli distribution is used as prior for each I, which is parameterised with \(p = 0.5\) to represent how likely each split is a priori.

4.1.1 Likelihood derivation

In order to sample from the posterior in Eq. 3, we must be able to calculate the marginal likelihood of the data, \(p(\mathcal {D} | \beta _1,\ldots ,\beta _k,I_1,\ldots ,I_k)\). To do this we use the following restructuring:

  • Let \(\{\gamma _1,\ldots ,\gamma _{k^\prime }\}\) represent the subset of \(\{ \delta _1,\ldots ,\delta _k \}\) for which the corresponding \(I_1,\ldots ,I_k\) are equal to one, i.e. \(\{\gamma _1,\ldots ,\gamma _{k^\prime }\}\) represents the nonzero \(\delta \)s. We will first assume that \(k^\prime > 1\), the two cases of \(k^\prime = 0\) and \(k^\prime = 1\) will be addressed subsequently.

  • Let \(\mathcal {D}_{\gamma _i}^{\gamma _j-1}\) represent observations \(d_{\gamma _i},\ldots ,d_{\gamma _j-1}\) where \(i < j\).

  • We can then write the marginal likelihood expression as:

    $$\begin{aligned} p(\mathcal {D} | \beta _1,\ldots ,\beta _k,I_1,\ldots ,I_k) = p\left( \left\{ \mathcal {D}_1^{\gamma _1 - 1},\mathcal {D}_{\gamma _1}^{\gamma _2 - 1},\ldots ,\mathcal {D}_{\gamma _{k^\prime }}^{n}\right\} | \gamma _1,\ldots ,\gamma _{k^\prime }\right) \end{aligned}$$
    (4)
  • Since we have assumed that a different model holds for each subset, and that we can calculate the marginal likelihood of the entire model by the product of the contained models, we can further rewrite the expression from Eq. 4:

    $$\begin{aligned} \begin{aligned}&p\left( \left\{ \mathcal {D}_1^{\gamma _1 - 1},\mathcal {D}_{\gamma _1}^{\gamma _2 - 1},\ldots ,\mathcal {D}_{\gamma _{k^\prime }}^{n}\right\} | \gamma _1,\ldots ,\gamma _{k^\prime }\right) \\&\quad = p\left( \mathcal {D}_1^{\gamma _1 - 1} | \gamma _1\right) p\left( \mathcal {D}_{\gamma _1}^{\gamma _2 - 1}| \gamma _1,\gamma _2\right) ~\cdots ~p\left( \mathcal {D}_{\gamma _{k^\prime }}^{n} | \gamma _{k^\prime }\right) \end{aligned} \end{aligned}$$
    (5)
  • We shall use a BN for each subset, thus the full Bayesian approach of evaluating each factor on the right hand side in Eq. 5 would be to sum over all BNs, e.g. \(p(\mathcal {D}_1^{\gamma _1 - 1} | \gamma _1) = \sum _{BN_i \in \mathbf{BN}} p(\mathcal {D}_1^{\gamma _1 - 1} | BN_i)p(BN_i | \gamma _1)\). However, this is impractical and therefore we shall approximate each factor by using the maximum a posteriori (MAP) BN, obtained by using a learning algorithm L. Let \(L(\gamma _i,\gamma _j-1)\) represent the BN learnt using algorithm L and dataset \(\mathcal {D}_{\gamma _i}^{\gamma _j - 1}\). Then the MAP approximationFootnote 4 to the full Bayesian approach is given by:

    $$\begin{aligned} \begin{aligned}&p(\mathcal {D} | \beta _1,\ldots ,\beta _k,I_1,\ldots ,I_k) = \\&p\left( \mathcal {D}_1^{\gamma _1 - 1} | L(1,\gamma _1-1)\right) p\left( \mathcal {D}_{\gamma _1}^{\gamma _2 - 1} | L(\gamma _1,\gamma _2 - 1)\right) ~\cdots ~p\left( \mathcal {D}_{\gamma _{k^\prime }}^{n} | L(\gamma _{k^\prime },n)\right) \end{aligned} \end{aligned}$$
    (6)

In the special case when \(k^\prime = 1\), only the first and last factor of Eq. 6 applies, and when \(k^\prime = 0\) we have \(p(\mathcal {D} | \beta _1,\ldots ,\beta _k,I_1,\ldots ,I_k) = p(\mathcal {D} | L(1,n))\).

From Eq. 6 we can deduce that to calculate the marginal likelihood in Eq. 3, we must calculate the marginal likelihood of the subset of data used to learn the corresponding BN. For discrete BNs there exists a closed form expression to exactly calculate the marginal likelihood of a BN structure via the Bayesian–Dirichlet equivalent (BDe) score (Heckerman et al. 1995) (where the parameters of the marginal and conditional distributions have been marginalised out, assuming a conjugate Dirichlet prior, and it is assumed that observations are independent and identically distributed).

4.1.2 Proposal distributions

The MH simulation is initialised by setting all \(\beta \)s so that the \(\delta \)s are evenly spaced out across the dataset, and all Is are set to 1. Proposing new values for the indicator variables I is done via a Bernoulli distribution with \(p = 0.5\), so that it is equally likely that the I will change or that it will stay the same. The \(\beta \) values require a bit more work, as there are constraints on how we want them to be proposed.

First of all, the \(\beta \)s have to be proposed within the bounds of the dataset, i.e. between 1 and n. Secondly, we want \(\beta _i < \beta _j\) for all \(i < j\), so that two \(\beta \)s never collide or jump over each other. Finally, we want the \(\beta \)s to have a positive probability of taking on any value between their respective upper and lower bounds. Since the \(\beta \)s are positions in a dataset they are discrete values, thus we can create a proposal distribution for each \(\beta \) according to Eqs. 7 through 11. In the equations, \(\beta _j\) is the current value while \(\beta _j^*\) is the proposed value. In Eqs. 7 and 8 we define the upper and lower bound for \(\beta _j^*\), in such a way that \(\beta _j^*\) always is proposed within the dataset, and so that \(\beta _j^*\) never can be proposed as the same value as any of the other \(\beta \)s. In Eqs. 9 and 10, each allowed value i of \(\beta _j^*\) is given a probability that is proportional to its distance from the current value \(\beta _j\). Z is a normalisation constant defined in Eq. 11.

$$\begin{aligned} lb(\beta _j^*)&= {\left\{ \begin{array}{ll} 1 &{} \text {if } j = 1 \\ \beta _j - \left\lfloor \frac{1}{2}(\beta _j - \beta _{j-1}) \right\rfloor - 1 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(7)
$$\begin{aligned} ub(\beta _j^*)&= {\left\{ \begin{array}{ll} n &{} \text {if } j = k \\ \beta _j + \left\lfloor \frac{1}{2}(\beta _{j + 1} - \beta _j) \right\rfloor - 1 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(8)
$$\begin{aligned} \kappa&= \max (\beta _j - lb(\beta _j^*), ub(\beta _j^*) - \beta _j) \end{aligned}$$
(9)
$$\begin{aligned} p(\beta _j^* = i | \beta _j)&= {\left\{ \begin{array}{ll} \frac{1}{Z}(1 + i - \beta _j + \kappa ) &{} \text {for}~lb(\beta _j^*) \le i \le \beta _j \\ \frac{1}{Z}(1 - i + \beta _j + \kappa ) &{} \text {for}~\beta _j < i \le ub(\beta _j^*) \end{array}\right. } \end{aligned}$$
(10)
$$\begin{aligned} Z&= \sum _{i = lb(\beta _j^*)}^{\beta _j} (1 + i - \beta _j + \kappa ) + \sum _{i = \beta _j + 1}^{ub(\beta _j^*)} (1 - i + \beta _j + \kappa ) \end{aligned}$$
(11)

To visualise this, assume that there are two \(\beta \)s and \(n = 39\). In the top left plot in Fig. 2 the current values are \(\beta _1 = 10\) and \(\beta _2 = 30\). As can be seen, \(\beta _1^*\) and \(\beta _2^*\) can be proposed in both directions within their respective bounds. In the top right plot the two \(\beta \)s are positioned at 1 and 39 respectively, and are now constrained to move in only direction due to the dataset bounds. In the bottom left plot the two \(\beta \)s are positioned at 15 and 25, and points that they have in common in their range are removed from both proposals. The bottom right plot shows the case where the \(\beta \)s are positioned at 19 and 21, and there is no probability of the \(\beta \)s moving closer to each other.

Fig. 2
figure 2

Example proposal distributions for \(\beta \)s

4.1.3 Iterations and samples

While it is possible to monitor the MH iterations to decide when to stop sampling, we run the simulation for a fixed number of iterations and throw away the first half of the samples (treating them as burn-in samples). We use the mean of the marginal of each scalar parameter \(\beta \) and I and round to integer values, i.e. if the marginal mean of \(I_j\) is 0.4 we will round this to 0 and if it is 0.6 we will round it to 1. The resulting nonzero \(\delta \)s then identify where in the dataset regime changes have occurred.

Fig. 3
figure 3

Example of merging subsets

4.2 Identifying regimes and structure

Having identified nonzero \(\delta \)s where regime changes have occurred in a dataset \(\mathcal {D}\), the next step is to identify the individual regimes, as well as the regime transition structure of the underlying system. Naïvely assuming that each change identifies a new regime would lead to a chain of regimes, i.e. if there were two nonzero \(\delta \)s identified we would have \(R_1 \rightarrow R_2 \rightarrow R_3\). While it may be entirely possible for a system to exhibit this type of regime structure, it is also possible that \(R_2\) transitioned back to \(R_1\), and not into a new regime \(R_3\). Therefore we must hypothesise each possible recursive merging of nonadjacent subsets, as defined by the nonzero \(\delta \)s, and score a new model based on these new subsets. Follows does an explanation and example of this merging.

4.2.1 Merging subsets

In the example in Fig. 3 we have identified three nonzero \(\delta \)s, resulting in four subsets of the data (labeled 1, 2, 3 and 4). The first hypothesis requires no merging at all; it suggests that each subset identifies a new regime (depicted to the left) and the regime transition structure is therefore a chain of four regimes (depicted to the right). From the first hypothesis we cannot merge subsets 1 and 2, since they are adjacent and we would not have a nonzero \(\delta \) here if the two subsets belonged to the same regime. A new hypothesis can however be constructed by merging subsets 1 and 3, resulting in the second hypothesis in the figure. Note that we have now labelled subset 3 with \(R_1\) and subset 4 with \(R_3\), as we now only have three regimes rather than four in the previous hypothesis. Should this hypothesis be true, then the regime transition structure must also reflect that from \(R_1\) it is possible to transition to \(R_3\), and from \(R_2\) it is possible to transition to \(R_1\). Thus when two subsets are merged, we also merge the parent and child sets in the regime transition structure. From the second hypothesis the only merging that can be done is to merge subsets 2 and 4, resulting in the third hypothesis. Note that the example is not complete, as we should go back to the first hypothesis and start the recursive procedure again, but this time by merging subsets 1 and 4 (and similarly so for 2 and 4).

In order to identify the regime structure of a system \(\mathcal {S}\), we start by hypothesising the chain of regimes that is given directly from the nonzero \(\delta \)s, and then continue to hypothesise each possible recursive merging. For each hypothesis, we merge the subsets of \(\mathcal {D}\) accordingly, and learn a new set of BNs using these merged subsets. The learnt BNs are scored according to their marginal likelihood (a simple rephrasing of Eq. 6). The hypothesis with the highest marginal likelihood, among all the hypotheses, represents the identified regime structure of \(\mathcal {S}\). While merging, previously scored hypotheses may be created, so the number of hypotheses to score can be substantially pruned. To see this we can continue the merging example from before. This time we start by merging subsets 2 and 4, resulting in subset labelling \(R_1,R_2,R_3,R_2\), and then we merge subsets 1 and 3, resulting in \(R_1,R_2,R_1,R_2\), which is the same as the last hypothesis when we started by merging 1 and 3.

The number of hypotheses to score is directly connected to the number of nonzero \(\delta \)s identified. If there are none, or only one, nonzero \(\delta \) then there is no merging possible, resulting in a single hypothesis. If there are two nonzero \(\delta \)s there are two hypotheses to score, and with three nonzero \(\delta \)s there are five hypotheses to score. If there are \(k^\prime \) nonzero \(\delta \)s, then the number of hypotheses to score equals the number of partitions of \(\{0,\ldots ,k^\prime \}\) into subsets of nonconsecutive integers (including the partition where each element belongs to its own subset, which we would interpret as a chain of regimes). Following from Theorem 2.1 in Munagi (2005), the number of hypotheses to score therefore follow the well known Bell numbers (A000110 in OEIS), where the first ten numbers are: 1, 1, 2, 5, 15, 52, 203, 877, 4140, 21147. By starting with the partition where each element belongs to its own subset, and then recursively merging two subsets (without violating the constraint that the subsets must not contain consecutive numbers), we will visit each allowed partition, thus each possible hypothesis will be scored. While at some point the computational effort required to score hypotheses may exceed current limitations, regime changes are infrequent in the current setting, keeping the effort within the feasible range.

4.2.2 Markovian assumption

Merging the parent and child sets of the regime transition structure is only valid if we assume that the transitions are first order Markovian. For instance, in the second hypothesis in Fig. 3 we assume that the system can transition from \(R_1\) to \(R_3\), however without the Markovian assumption the data suggests that the system can transition from \(R_1\) to \(R_3\) only if it has transitioned to \(R_2\) previously. If the system only allowed transitions from \(R_1\) to \(R_3\) after it had passed through \(R_2\), then the system would be somehow different depending on if it had gone through \(R_2\) or not. It would then be arguable that \(R_1\) prior to \(R_2\) would not be the same as \(R_1\) after \(R_2\), and therefore they would not truly be identical regimes. Hence, it is reasonable to assume this Markovian property of the regime transition structure, allowing us to identify different regime transition structures depending on the most likely hypothesis.

4.3 Constructing a GBN

Having identified the regime transition structure, and the BNs that represent each regime, the final step to complete the model of the system is to construct a GBN and configure the gates. The goal is to accept a stream of observations, and for each new observation decide which regime the system currently is in. In each gate we define a comparison between the parent and the child BN in such a way that if the child is preferred over the parent, then the gate triggers, thereby deactivating the parent and activating the child. This comparison is a likelihood ratio between the parent and child using a moving window of the most recent \(\tau \) datapoints. If this ratio is above some threshold \(\theta \), then the gate triggers. Let \(\mathcal {O}\) be a stream of observations and \(\mathcal {O}_\tau \) represent the \(\tau \) most recent observations, then \(p(\mathcal {O}_{\tau } | R_1)\) is the likelihood of the most recent \(\tau \) observations given regime \(R_1\). Each gate is then configured to trigger when \(p(\mathcal {O}_{\tau } | R_{child})/p(\mathcal {O}_{\tau } | R_{parent}) > \theta \).

For instance, assuming that we have identified the regime structure \(R_1 \rightarrow R_2 \leftrightarrows R_3\), we construct the GBN in Fig. 1, where the square nodes represents BNs for each regime. For this GBN we define the trigger logic in Eq. 12, where we use the same \(\tau \) and \(\theta \) for all gates.

The remaining task, which we will describe in Sect. 4.3.1, is to choose appropriate values for \(\tau \) and \(\theta \). The task requires the optimisation of a score function, and in the current setting there are two major factors that we need to consider. First, the function that we wish to optimise is a black box, and second, evaluating the function is expensive. As we cannot compute gradients analytically we would have to sample them, which would require several evaluations of the function during each iteration (which we would like to avoid). The Bayesian approach is to go from a prior to a posterior utilising observations that we collect, hopefully identifying the global maximum of the function. However, this implies being able to compute posteriors from priors, but since we cannot describe our function in closed form, we cannot produce an expression for the posterior. A non-parametric approach (i.e. one that does not depend on the shape of the function) is to use a Gaussian process (GP) to act as a proxy for the function. This derivative free technique, commonly known as Bayesian optimisation (Brochu et al. 2009), has become a popular way of optimising expensive black box functions, and we will now turn our attention to how we shall adopt this framework for the task of choosing \(\tau \) and \(\theta \).

$$\begin{aligned} \begin{aligned} TL(G_1)&:= \frac{p(\mathcal {O}_{\tau } | R_{2})}{p(\mathcal {O}_{\tau } | R_{1})}> \theta \\ TL(G_2)&:= \frac{p(\mathcal {O}_{\tau } | R_{3})}{p(\mathcal {O}_{\tau } | R_{2})}> \theta \\ TL(G_3)&:= \frac{p(\mathcal {O}_{\tau } | R_{2})}{p(\mathcal {O}_{\tau } | R_{3})} > \theta \end{aligned} \end{aligned}$$
(12)

4.3.1 Gaussian processes and Bayesian optimisation

Ultimately, we would like to find the parameter pair \(\Lambda = \{\tau ,\theta \} \in \mathbf \Lambda \) that maximises an objective function f. In the current setting, f is the accuracy of the predictions of the current regime (we will give more attention to the details of f in Sect. 4.3.2). While we can evaluate f at a given point \(\Lambda \), we do not have a closed form expression for f, thus we cannot solve the problem analytically. Instead, we will introduce a random variable g to act as a proxy for f, and place on g a prior over all possible functions. This is done by using a GP, which is defined as a distribution over an infinite number of variables, such that any finite subset of these variables are jointly multivariate Gaussian. Thus, for each \(\Lambda \) we treat \(g(\Lambda )\) as a Gaussian random variable, where \(f(\Lambda )\) is a realisation of the random variable \(g(\Lambda )\), and treat a set of such random variables as jointly multivariate Gaussian.

Formally, the GP is parameterised by a mean function \(\mu (\Lambda )\), and a covariance kernel \(k(\Lambda ,\Lambda ^\prime )\), which are defined over all input pairs \(\Lambda ,\Lambda ^\prime \in \mathbf \Lambda \). The mean function \(\mu (\Lambda )\) represents the expected value of \(g(\Lambda )\), and it is commonly assumed to be zero for all \(\Lambda \in \mathbf \Lambda \), although this is not necessary if prior information exists to suggest otherwise. The covariance kernel \(k(\Lambda ,\Lambda ^\prime )\) represents how much the random variables \(g(\Lambda )\) and \(g(\Lambda ^\prime )\) covary, thus it defines how smooth the function f is thought to be, and can therefore be tuned to ones prior belief about f. For instance, by using the radial basis kernel in Eq. 13, we can tune c to fit our prior belief about smoothness.

$$\begin{aligned} k(\Lambda ,\Lambda ') = \exp (-c ||\Lambda - \Lambda '||^2) \end{aligned}$$
(13)

For points close to each other, Eq. 13 will result in values close to 1, while points further away will be given values closer to 0. The GP prior will obtain the same smoothness properties, as the covariance matrix is completely defined by k. To visualise the smoothness achieved by tuning c, Figure 4 shows the decreasing covariance as distance grows with three different settings of c (0.5, 1.0 and 2.0). As can be seen, as c increases the decrease is faster, thus less smoothness is assumed.

Fig. 4
figure 4

Covariance decreases by distance

Let \(\Lambda _{1:i}\) represent i different \(\Lambda \)s, without any specific ordering, and let \(f_{1:i}\) represent the value of f at these \(\Lambda \)s, i.e. \(f_j = f(\Lambda _j)\). Similarly, let \(g_j\) represent the random variable \(g(\Lambda _j)\). Having collected observations \(\{ \Lambda _{1:i}, f_{1:i} \}\), we can calculate the posterior distribution of g for some new input \(\Lambda _{i+1}\), where both the prior smoothness and the observed data have been considered. A closed form expression exists for this calculation as described in Eq. 14. Notice that Eq. 14 is the usual equation to compute the conditional density function of X given an observation for Y, when X and Y are jointly Gaussian. For more on GPs, please see Rasmussen and Williams (2006).

(14)

Since g is acting as a proxy for the objective function f, using a GP allows us to encode prior beliefs about the objective function, and sampling the objective function allows us to update the posterior over objective functions. In the framework of Bayesian optimisation (Brochu et al. 2009), we utilise the posterior over objective functions to inform us of how to iterate between sampling and updating, in order to find the \(\Lambda \) that maximises the objective function f. The next \(\Lambda \) for which to evaluate f is the \(\Lambda \) that maximises an acquisition function. Several acquisition functions have been suggested, however the goal is to trade off exploring areas where the posterior uncertainty is high, while exploiting points that have a high posterior mean. We will use the upper confidence bound criterion, which is expressed as \(UCB(\Lambda ) = \mu (\Lambda ) + \eta \sigma (\Lambda )\), where \(\mu (\Lambda )\) and \(\sigma (\Lambda )\) represent the posterior mean and standard deviation of \(g(\Lambda )\), and \(\eta \) is a tuning parameter to allow for more exploration (as \(\eta \) is increased) or more exploitation (as \(\eta \) is decreased). Once a predetermined number of iterations m have passed, the \(\Lambda _i\) for which \(f_i\) is greatest in \(\{\Lambda _{1:m},f_{1:m}\}\) is the set of parameters that maximises the objective function.

4.3.2 Evaluating accuracy

In our case, the objective of Bayesian optimisation is to find the pair \(\tau \) and \(\theta \) that maximises the accuracy of the GBN. We say that the GBN is correct if, after processing a new observation in a stream of data (see Sect. 2.3), the active BN corresponds to the true BN that generated the observation. Thus, if an observation in a stream truly came from \(R_1\) and the GBN has \(R_1\) active after processing the observation, then this is considered correct, while any other situation is considered incorrect.

Since we only have access to the training dataset during the elicitation process, there is a risk that we may overfit this data. In order to reduce this risk we resample the observations in each identified subset of the original data, and concatenate the resamples to create a new dataset of the same size as the original one. Doing so several times we synthetically create new datasets from the original one. As a result of the resampling we can calculate the average accuracy over all resampled datasets for a pair of \(\tau \) and \(\theta \). Thus, the objective function f that we aim to maximise is therefore the average accuracy of a pair \(\tau \) and \(\theta \) over all the resamples. Notice that we do not use the true regime labels from the training data, since we assume that these are not known to us, but rather add regime labels based on the regime transition structure identified during subset merging.

Given a GBN with all gates parameterised with some values \(\tau \) and \(\theta \), and a stream of data \(\mathcal {O}\), where each observation is associated with a regime label, the accuracy of the GBN is calculated as follows:

  • Activate the BN that represents \(R_1\), deactivate all others.

  • For the first \(\tau - 1\) observations from \(\mathcal {O}\) do not process any observations, the current regime according to the GBN will be \(R_1\) for all of these observations. Having less than \(\tau \) observations does not allow us to compute the values in Eq. 12.

  • Process each observation beyond the first \(\tau - 1\) observations, the active BN for that observation represents the current regime according to the GBN.

  • When no more observations are available, count the number of observations for which the GBN had the correct BN active.

  • The accuracy of the GBN is the number of times the GBN was correct divided by the number of observations in the stream \(\mathcal {O}\).

5 Regimes in baseball players’ career data

Having defined the problem and introduced the intended model in Sect. 2, as well as introduced a learning procedure for this model in Sect. 4, we now turn our attention back to baseball. Our goal is to gain insight into what a GBN may tell us regarding a baseball player’s career, specifically via the regime transition structure. As was mentioned in Sect. 1, the cause of a regime transition may be evident, for which public information is available, such as injuries or trades. However, if the data suggests a regime transition at a point where there is no obvious reason to why the player is playing differently, then this may be of value for the managers and coaches. Consider a player that was performing extraordinarily well during a past regime, but has now transitioned back to a more normal level of performance. It may be of great value for the coaches to identify when the transitions in and out of the extraordinary regime happened, and then go back and analyse what may have caused them. Managers and coaches have access to data that is not made public, such as practice, exercise and strategic decisions. From this data they may be able to identify what it was that caused the player to transition to the extraordinary regime, and may be able to arrange for the conditions for this to happen again.

We are therefore interested in discovering whether or not there are regimes in the data, and if there are regimes, then how many of these can we accredit to evident events, and how many of these are not explained by publicly available data. We will provide a summarised view of the analysis of 30 players, followed by a more in-depth view of two of these. We will however begin this section with a brief explanation of the game of baseball, followed by a description of the available data and procedure configuration, and then end this section by reporting and discussing our results.

Fig. 5
figure 5

Stylised representation of a baseball field

5.1 The game of baseball

The official major league baseball rules are documented in a publication consisting of 282 pages, and is revised yearly for clarifications, removals and amendments. Here we will attempt to give only the most basic understanding of how the game of baseball is played, there are many more finer details and exceptions that we will not cover. We will on occasion refer to the stylised representation of a baseball field offered in Fig. 5, the figure is an abstract representation and not to scale.

The game of baseball is played between two teams over nine innings, where in each inning the teams are given the opportunity to play both offense and defense. The team playing offense is able to score runs, and the team with the most runs at the end of the nine innings is considered the winner. Extra innings are added if both teams have the same number of runs at the end of regulation play, there are no draws.

The main component of the game is a pitcher (positioned at 1 in the figure) that pitches (i.e. throws) the ball to the catcher (positioned at 2), both are on the defensive team. The batter, part of the offensive team, stands between the pitcher and catcher (at position 3) and tries to hit the ball. The remaining defensive players are positioned strategically on the field. The defensive team needs to get three batters out in order to end their defensive part of the inning.

There are several outcomes that may take place when the ball is pitched, we describe the relevant outcomes here:

  • Assuming that the batter does not swing the bat at all, a ball that is pitched to the catcher can count as either a strike or a ball. If the baseball passes the batter within the strike zone then the pitch is counted as a strike, if it is outside the strike zone then it is counted as a ball. The strike zone is an imaginary volume of space that extends from the batters knees to the midpoint of the batters torso. Simply put, it is a zone in which the batter has a fair chance of hitting the baseball. Whether or not the baseball is within the strike zone is decided by an umpire positioned behind the catcher.

  • If the batter swings at the baseball and misses then it counts as a strike, even if the pitch was outside the strike zone.

  • If the strike count goes to three, then the batter is out. If the ball count goes to four, then the batter can freely move to first base (positioned at B1 in the figure), this is known as a walk.

  • If the batter hits the baseball then there are two main outcomes. The ball can either fly into the air and be caught by a defensive player before hitting the ground, at which point the batter is out. Alternatively, the ball can hit the ground in which case the batter becomes a runner, and needs to reach first base before a player in the defensive team touches first base while holding the baseball. The runner (previously known as the batter) is also allowed to attempt to advance to second or third base (2B and 3B in the figure), but will be out if a player on the defensive team manages to touch the runner while holding the ball. If caught then the runner is out, if safe at a base then the play is over and the next player in the offensive team becomes the batter. Each time an offensive player is allowed an attempt to bat it is counted as a plate appearance. If there was an offensive player already on a base when the batter hit the ball, then they can advance further (sometimes they are forced to advance).

  • Each time the offensive team gets a runner all the way around the bases, i.e. running through B1, B2, B3 and back to where the batter is positioned at 3 in the figure, then the offensive team scores a run.

  • If the batter hits the baseball outside the field, e.g. into the stands or even outside the arena itself, then it is called a home run and the batter can freely run through all bases and come back home, thereby scoring a run. Any other offensive players that were already safe at one of the bases can also run home, thus scoring a run for each player that comes home.

  • There are several other reasons why a batter may freely move to first base, for instance if the pitcher hits the batter with the baseball or the defensive team makes a rule error.

There is a myriad of statistics developed that aim at summarising a players performance. We will use a statistic developed by Bill James, known as the somewhat cumbersome on-base plus slugging (OPS) statistic which is calculated by Eq. 15. When a batter hits the ball and makes it to first base, it counts as a single, if the batter makes it to second base it is called a double, a triple represents making it to third base and home runs are home runs. As is evident from the calculation of slugging (SLG) there is a weighting on how much the outcomes are worth, and the linear combination is divided by the number of at-bats. While the number of plate appearances does not equate exactly to the number of at-bats, it can be thought of as the same for the purposes of this paper. If a player hits either a single, double, triple or home run it counts as a hit, which is used in the calculation of on-base percentage (OBP). A walk is as described earlier, and hit by pitcher happens when the pitcher hits the batter with the baseball. A sacrifice fly is a strategic option where the batter on purpose hits the ball far and high, making it easy for the defensive team to catch it before it hits the ground thus giving them an out, but an offensive player safe on a base may make it back to score a run (thus the batter is sacrifised). The OPS statistic aims to represent an offensive player’s skill, however does not take into account the player’s running abilities (fast players are not credited by an increased OPS). Bill James also suggested a discretisation of OPS, presented in Table 1.

Apart from OPS we will also make use of the less complex runners batted in (RBI) statistic, which is a count of how many runs a batter is responsible for, e.g. if the offensive team has one runner safe on third base and the batter hits the ball allowing the runner to make it back home, then it counts as a run for the offensive team and a RBI for the batter.

$$\begin{aligned} \begin{aligned} SLG&= \frac{single + 2 * double + 3 * triple + 4 * home~run}{at\text {-}bat} \\ OBP&= \frac{hit + walk + hit~by~pitcher}{at\text {-}bat + walk + sacrifice~fly +hit~by~pitcher} \\ OPS&= OBP + SLG \end{aligned} \end{aligned}$$
(15)
Table 1 Discretisation of OPS statistics

Following the above definitions, we here present the variables that we extracted for our purposes:

  • Outcome—The outcome of the event, with states: single, double, triple, home-run, walk, out or reaching first base for another reason.

  • OPS—A 30 day rolling window calculation of OPS, discretised according to Table 1.

  • RBI—A 30 day rolling window calculation of RBI, discretised into three equal width bins.

  • B1—Whether or not there is a runner on first base (true or false).

  • B2—Whether or not there is a runner on second base (true or false).

  • B3—Whether or not there is a runner on third base (true or false).

  • Home—Whether or not the offensive team is the home team (true or false).

  • Arm—Whether or not the pitcher throws with the left or right arm (left or right).

  • Outs—The number of outs in the inning (0, 1 or 2).

  • Balls—The ball count (0, 1, 2 or 3),

  • Strikes—The strike count (0, 1 or 2).

5.2 Datasets

Retrosheet is an organisation founded in 1989 (www.retrosheet.org) that digitally stores events of major league baseball games. Each event describes the outcome of a plate appearance, and the conditions under which the event happened. There are over 8900 players in the Retrosheet database, for which a complete analysis was not within our aims of this paper. Instead we defined a subpopulation of all these players, and then sampled from this subpopulation. We filtered the event data in such a way that we were left with players that had their major league debut during the 2005 season or later, and with the criteria that they had to have at least 2000 event data entries. The range of number of events in this subpopulation was 2001 to 6614. This subpopulation consisted of 150 players, from which we uniformly sampled 30 players.

Since we do not know exactly when regime transitions have occurred in the career data of the 30 sampled players, we cannot use this data to show how well the proposed learning algorithm identifies the location of regime transitions. Therefore we also created synthetic datasets representing fictional careers, for which we knew when the transitions occurred. We created a BN, using the variables introduced in Sect. 5.1, and then manipulated this BN slightly to create another BN, representing a new regime. We continued this manipulation until we had five different BNs, representing five different regimes in a fictional baseball player’s career. We then arranged these BNs into three different regime transition structures, and sampled data from these structures. For each structure, one sample was used as learning data, and 100 samples was held out as testing data. The testing data was used to estimate the accuracy of the learnt GBN, i.e. measuring whether or not the GBN had the correct BN activated given the testing data, as described in Sect. 4.3.2. For full details of the BNs and the transition structures, please see “Appendix B”.

5.3 Method

During MH we used \(k = 8\) splits, thus we are not assuming any knowledge about how many splits there truly are in the data, only that there are a maximum of eight. We ran 100,000 iterations, throwing away the first 50,000 samples as burn-in. A greedy thick-thinning algorithm for learning the BN structures was used (Heckerman 1995), where the log marginal likelihood was the target to improve.

We ran 250 iterations of Bayesian optimisation using the radial basis kernel with \(c = 2.0\), and the upper confidence bound criterion acquisition function using \(\eta = 5\). The objective function was the average accuracy over 1000 resamples.

5.3.1 Synthetic data

For the synthetic data we used the generated learning datasets to learn a GBN for each transition structure, and then used the 100 testing datasets to compute the accuracy of the learnt GBNs, following the description in Sect. 4.3.2. We shall report the location of the identified regime transitions, and compare them with the true locations, as well as the average accuracy of the GBNs over the test datasets. For comparative reasons, we also learnt a standard HMM using each one of the training datasets, and calculated the state probabilities given all data. Since we knew how many regimes there were in the synthetic datasets, we fixed the number of states during the learning of the HMM, rather than employing a cross-validation scheme to also learn this number. We however emphasise that no such advantage was given to the learning of GBNs.

5.3.2 Real world data

In order to count the number of regime transitions for which there is an evident cause, we created the following three criteria. If a regime transitioned happened in conjunction with an event for which any of the criteria are satisfied, then the cause was considered evident:

  1. 1.

    An injury to the player—We expect a player to play differently a period after an injury, thus if a regime transition coincides with an injury we count it as a transition where the cause is known.Footnote 5

  2. 2.

    The player being traded—When a player moves to a new team there is a good chance that the player needs to learn a new system, new coaching and may need time to adjust. While the trade itself may not be the cause, a transition in conjunction with a trade is to some degree explained by the trade\(^{5}\).

  3. 3.

    The beginning of a new season—A new season usually brings new players to a team, sometimes new managers and coaches, new strategies are put in place, etc. A regime transition early in the season is therefore considered to be explained by these changes.

From these criteria it should be understood that transitions not counted towards the evident reasons are those where a transition occurs mid season, without a major injury or trade. The reason for such a transition may be clear for the managers and coaches, since they have access to data that the public has not, however they may also be unaware of these transitions, and identifying the transitions allows for analysis of what may have caused them. It is precisely here that our procedure adds value to the decision making by managers and coaches, discovering transitions that may have been unintentional or the result of multiple events and decisions.

5.4 Results and discussion

We will divide this section into four parts. In the first part we will account for the performance of the GBN learning using the synthetic datasets. In the second part we will give a summary view of the GBNs that were learnt for the 30 sampled players, and then immediately discuss our findings in relation to this sample. Thereafter we will look at two players in more detail. This detailed analysis allows us to show the GBNs that were learnt, as well as discuss the regime transitions which we can explain and which we cannot. Finally, we will end this section with a discussion regarding the representational power of GBNs versus BNs.

Table 2 Location of identified regime transitions during GBN learning using synthetic data
Table 3 Accuracy of learnt GBNs

5.4.1 Synthetic data

In Table 2 we present the true regime transition locations in the synthetic data, the locations identified using the proposed learning algorithm, and the difference between the two. The three structures are the ones presented in “Appendix B”. As is evident from the table, the proposed learning algorithm identifies well the true locations, as the differences are relatively small. The merging procedure of the learning algorithm was also able to correctly identify the regime transition structures in all three cases. The standard HMM that we also learnt using the synthetic data was however not as successful, the results for which we defer to “Appendix C”.

Turning our attention to Table 3, we present the optimised \(\tau \) and \(\theta \) values for each structure, along with the average and minimum accuracy achieved over the 100 test datasets. As is evident, the average accuracy is high, however since the minimum for Structure 2 was lower, we also counted the number of times the accuracy was below 0.9 and 0.8. From the table we can tell that such occurrences were few.

All in all, these results show that the proposed learning algorithm is capable of correctly identifying regime transition structures, and that the resulting GBN can be used to accurately identify the current regime. This is consistent with a more extensive evaluation on synthetic data that we have completed (Bendtsen 2017).

5.4.2 Summary view

Table 4 Summary view of the GBNs learnt for a sample of baseball players

For each baseball player analysed we present a summary of the GBN learnt in Table 4. In the table we first present the player’s name, followed by the number of nonzero \(\delta \)s identified, i.e. the number of transitions. The table continues with the number of BNs contained in the GBN, and a Boolean value representing whether or not there were reoccurring regimes in the GBN, i.e. if the GBN contains directed cycles. The third to last column is the number of regime transitions for which we could find an evident cause, and the number for which we could not. The last two columns are the \(\tau \) and \(\theta \) values determined during optimisation of the gate parameters.

The first observation we make is the fact that there certainly are regimes in the data, using a single BN to represent a player’s career is not preferred for any of the players. What is further interesting is that, for all but two players, the regimes are not structured in a chain, but the regimes rather reoccur. While we expect any sportsperson to change the way they play their respective game throughout their career, be it due to increasing skill or age, it is not necessarily obvious why we find that some players revert back to previous regimes. Though highly speculative at this point, transitioning into one regime and then transitioning back may be due to strategic decisions by the manager or coach, or due to injury, since players may be playing more cautiously while receiving treatment.

There certainly is an overweight towards regime transitions which we could not explain from our superficial investigation. While we are not ruling out that there may be publicly available sources of information that may be helpful in order to explain the remaining transitions, it is noteworthy that there is reason for further investigation. As we shall see in Sect. 5.4.3, some of these transitions may be highly sought after (e.g. transitioning into a regime where a player is playing extraordinarily well), thus for the managers and coaches it can be essential to look at the decisions that were made at the time of the transition.

Fig. 6
figure 6

GBN learnt for Nyjer Morgan

The last thing we wish to comment on regarding this summary view are the \(\tau \) and \(\theta \) values. In the synthetic experiments, reported in Sect. 5.4.1, we saw relatively low values for \(\tau \) in all three cases. However, in a previous publication we ran the learning algorithm on a more extensive set of synthetic datasets (Bendtsen 2017). These experiments showed that the algorithm produced GBNs with high accuracy, and the average \(\tau \) value was 18.2 (with a range of [10,39]). As is evident from the last two columns in Table 4, the \(\tau \) values for the GBNs learnt in these experiments are mostly within this range. What this tells us is, that if the GBNs were to be used on new data, for instance the 2016 season, they would not require an unusually large window (\(\tau \)) to detect regime changes, and the required threshold (\(\theta \)) is comparable to what we have shown had good accuracy on synthetic data. An unusually large window could imply that the GBN would be detecting regime changes much later than they occurred, thus the use of the GBN as a detection model would have been limited. So while we do not explore this opportunity further in this paper, the learnt GBN is a model that can be used to detect if the player has transitioned to any of the identified regimes. We imagine that a coach could learn a GBN using data from previous seasons, from this GBN identify what the causes were for a favourable regime transition, put these causes in place and then monitor the player by entering the new data that the player is generating into the GBN. If the GBN does transition into the intended regime, then the coach has succeeded.

5.4.3 Detailed analysis

In order to gain further understanding of how the regime changes in the players’ data can be used, we picked out two players that served as good examples for how a manager or coach could do a further analysis. For these players, Nyjer Morgan and Kendrys Morales, we present the GBNs learnt in Figs. 6 and 8. In Figs. 7 and 9 we present representations of the subsetting of the data, similar to the ones given in the example in Fig. 3. In these representations we have marked to which regime each subset belongs, as well as the OPS statistic and label according to the discretisation in Table 1 for each particular subset. We have also added the date on which the transitions occurred.

We will begin our analysis with Nyjer Morgan (Figs. 6, 7). We first remark on the OPS labels given to the regimes. The subsets 2, 4 and 7 all belong to regime \(R_2\), which have the OPS labels below average, poor and below average, thus we conclude that the \(R_2\) regime represents a low performance regime for Nyjer (with respect to OPS). The subsets 1 and 3 belong to \(R_1\) and have been given average and above average, and we will therefore refer to regime \(R_1\) as a high performance regime. Subset number 5 is the only subset that belongs to regime \(R_3\) with the label above average, a regime that also represents high performance, but not the same as \(R_1\). Subset number 6 is the sole subset for regime \(R_4\), and given the OPS label above average it also seems to correspond to a high level of performance, however we note that the OPS value of 0.775 in subset 6 is lower compared to 0.818 in subset 5.

Fig. 7
figure 7

Regime subsets with OPS statistic and label for Nyjer Morgan

Fig. 8
figure 8

GBN learnt for Kendrys Morales

Fig. 9
figure 9

Regime subsets with OPS statistic and label for Kendrys Morales

Turning our attention to the regime transitions, we will offer a narrative that may explain why the transitions happened at these specific dates. Nyjer’s major league career started with the Pittsburgh Pirates, where he was given the starting role in left field in the beginning of the 2009 season (this was based on a positive second half of 2008). We see a transition from the high performance regime \(R_1\) to the low performance regime \(R_2\) on the 6th of April (the 2009 season started on the 5th of April, but the Pirates first game was on the 6th). At the beginning of any season there are a lot of changes to the team and tactics, including giving Nyjer the starting role in left field. Perhaps due to the performance of Nyjer, he was traded on the 30th of June 2009 to the Washington Nationals. We see a regime change on the 23rd of June to the high performance regime \(R_1\) at approximately the same time. The next regime change on the 2nd of May is however not as easy to pinpoint. It is another move from \(R_1\) to \(R_2\) (high to low performance), which is not necessarily tied to a new season, trade or injury. However, Nyjer was involved in some headline worthy mistakes and unsportsmanlike actions during the second half of 2010, that may have affected the way he played. In any case, this is a type of transition where a manager or coach could delve deeper into their data to find the causes of this regime transition. On the 27th of March 2011 Nyjer was traded to the Milwaukee Brewers (this happened during the offseason), and in the beginning of the 2011 season, on the 6th of April 2011, a new regime change occurred (Milwaukees first game of the 2011 season was on the 31st of March, but Nyjer did not play full games the first days of the season). During this last transition Nyjer moved into \(R_3\), a high performing regime which lasts until the 23rd of July 2011, when Nyjer transitions to \(R_4\) (also a high performance regime, but as we noted with a lower OPS), and then to the low performance regime \(R_1\) on the 19th of August 2011. These last two regimes cannot be directly accredited to a new season, trade or injury, and are examples of transitions needing further investigation by managers and coaches. What can be said is that the regime transitions \(R_3\) to \(R_4\) and finally \(R_2\) seem to indicate a dwindling performance for Nyjer, with respect to the OPS values. He would later take contracts for other major league teams, as well as stints in both Japan and Korea. However Nyjer has not returned to his former high level performance \(R_1\) or \(R_3\) regimes.

The second player we will analyse is Kendrys Morales, for whom we display the GBN and subset representation in Figs. 8 and 9. For Kendrys we could just as well assign a high performance label to both \(R_1\) and \(R_2\) as they have one subset with average and above average respectively. The narrative we soon will offer may clear up the distinction, however it serves as a reminder that summary statistics such as OPS does not tell the full story. Regime \(R_3\) represents exceptional performance, as the subset has been given the label great. The last regime \(R_4\) will be referred to as a low performance regime due to the OPS label poor.

As we did for Nyjer, we will offer an explanation for the transitions we found in Kendrys’ data. At the beginning of the 2009 season Kendrys was promoted to starting first baseman.Footnote 6 He had a fantastic season, playing well in the beginning and even better in the second half of the season. He was given the Player of the Month award in August 2009. The first transition on the 14th of May from \(R_1\) to \(R_2\) came early in the season (the 2009 season started on the 5th of April), and while the transition is not exactly on the first day of the season, it is reasonable to assume that Kendrys was adjusting into his new position. There is no evident reason for the transition from \(R_2\) to \(R_3\) on the 17th of June, however it is clear that Kendrys was playing better, thus there is certainly reason for further investigation into why there was a transition on this date. The transition from \(R_3\) to \(R_2\) happened on the 7th of September 2009, again a transition for which no evident reason can be given. However, looking at the OPS values rather than the labels, we see that the transitions \(R_1 \rightarrow R_2 \rightarrow R_3 \rightarrow R_2\) coincide with an increase, peak and decline of the OPS value. A question, to which we have no direct answer, is to ask whether this is simply due to Kendrys going on a hot-streak and then cooling off, or if there were actions taken at these times that caused the transitions through these regimes. If the latter is the answer to the question, then it would be extremely valuable for managers and coaches to investigate these regime changes in order to find these causes so that they may be put in place again. In an unfortunate turn of events, Kendrys was injured on the 29th of May 2010 (coinciding with the transition from \(R_2\) to \(R_1\)). He was sidelined for the rest of the 2010 season and the entire 2011 season. In 2012 Kendrys came back, and it seems that he was able to play at a reasonable level as before (\(R_1\)), but did not come back to his high and extraordinary performance regimes \(R_2\) and \(R_3\). Kendrys was traded to the Seattle Mariners prior to the 2013 season, however the transition from \(R_1\) to \(R_4\), the low performance regime, did not come until mid season. While Kendrys was traded, the data does not suggest that this change caused any disruption in Kendrys performance, but rather something that happened later.

5.4.4 GBNs versus BNs

From a modelling perspective, regardless of application domain, it is interesting to note that the BNs that are learnt do in fact represent the data better for their subsets than do the other BNs. This is shown in Tables 5 and 6, where the log marginal likelihood for each combination of subset and BN is presented, as well as for a single BN that was learnt using all the data (using data for Nyjer Morgan and Kendrys Morales). Note that these are log-terms, so differences may look small in some cases when they are in fact orders of magnitude.

Table 5 Log marginal likelihood of data subsets given individual BNs (Nyjer Morgan)
Table 6 Log marginal likelihood of data subsets given individual BNs (Kendrys Morales)

It is clear then that if we use a single BN to represent our data, we will not be able to capture and take advantage of the structural changes in the data. However, while treating a weakness of BNs we are also exposing one of GBNs. Looking at Figs. 6 and 8 it is definitely much harder to now grasp what the underlying structure of the data is. What we gain in expressive power, we lose in clarity. On the positive side, if one takes the time to analyse the BN structures in the GBNs, valuable pieces of information may be extracted. For instance, in Fig. 8 we see that only in \(R_2\) does the pitchers throwing arm stand in relation with Kendrys playing at home and his recent OPS value. This may be due to the opposing teams’ coaches choosing a favourable matchup (it is generally accepted that a right handed pitcher has an advantage over a right handed batter, and the same goes for left on left, however the batter has the advantage when they are opposite). Adding to the complexity is the fact that Kendrys himself is a switch hitter, meaning that he will change his batting hand to always have the advantage. Perhaps this is why the relationships are only brief, the opposing coaches giving up on trying this tactic. Whatever the reason for these brief relationships, they would vanish if one learnt a single BN using all the data, thus this type of extraction would not be possible.

6 Conclusions

All sports seem to demand a certain degree of data collection, it would be hard to imagine any type of competitive market for trading players in any sport without some statistical description of the players. For baseball however, it seems to have been taken to a level of healthy obsession. Retrosheet is certainly a remarkable endeavour, not only attempting, but doing very well in achieving its goal to computerise every single plate appearance of every single major league game since 1871. We have only picked out a handful of the variables that are represented in the database. It is however possible that this obsession also creates a certain bias, as those looking at the data may become falsely assured that usable information is extractable simply due to the sheer amount of data available. Nonetheless, the penetration of sabermetrics into major league teams is undeniable.

In this paper we have investigated the use of GBNs for modelling baseball players’ careers, specifically in order to identify regimes in the data. It is clear from our experiments that such regimes do exist, and that these regimes can reoccur. We have also shown how a deeper analysis of a GBN allows us to label the regimes depending on the player’s performance, thereby offering a tool to managers and coaches that would allow them to investigate further the causes for the regime changes. Since we have seen that regimes reoccur, we cannot rule out the fact that it is possible for the causes to be put in place again, attempting to force a regime transition. Although only mentioned briefly (see Sect. 5.4.2), we also explained how the GBNs that were learnt can be used on future data in order to monitor whether or not the decisions of the coaches are transitioning their players to favourable regimes.

In a more general context, when collecting data over time, we must keep in mind that regime transitions can occur. In ecology, biology, finance etc. regime changes are common (e.g. periods of exuberance and panic in financial markets). We have seen in Sect. 5.4.4 that learning a single model when regimes do occur results in a model that is worse at explaining the data, than are the individual models for each regime. It happens to be the case that we are analysing career data of baseball players in the current setting, however regardless of the context it is important to acknowledge that regime transitions may occur.

Moving forward there are a few things that we would like to address. From a modelling perspective we are interested in collapsing the entire GBN into something that is easier to comprehend, for instance by representing all BNs at the same time using a single chain graph (Sonntag and Peña 2015). There would naturally be a loss of granularity, however a chain graph could potentially give a brief overview of a full GBN. We are also interested in merging our work with the ideas of Bill James and Nate Silver. It is possible that we could cluster players based on the GBN structure, and then attempt to predict which regime structure a new player will take (similar to clustering ageing curves). Finally, when we collect data over time, we should be aware that structural changes may occur. Thus for any player of any sport we could potentially find regimes, and make the same conclusions as we have for the baseball players in this study.

Fig. 10
figure 10

Pseudocode for identification of nonzero \(\delta {s}\)

Fig. 11
figure 11

Pseudocode for regime transition structure learning