1 Introduction

1.1 Background

Information describing the content of data is often called “metadata”. In the context of video data, metadata refers to data that is used to facilitate content-based retrieval [1]. For example, in the case of video data for a drama, the metadata may include captions (subtitles), bibliographic information about the actors, and so on. In the case of video data for a sports event, the metadata may consist of information about the players, important event scenes, and so on. In the past decade, the importance of such metadata has been increasing in the broadcasting community, because of its potential applications to various broadcasting services, program production systems, and so forth, e.g., [2] [3]. Against this backdrop, there has been an increasing demand for effective metadata creation particularly for unscripted programs [4, 5].

A successful approach for creating metadata of unscripted sports games, is to use a framework with an event detection-based system [6, 7]. Figure 1 shows an example of such a system based on Ref. [6].

Fig. 1
figure 1

An example of an automatic metadata creating system based on [6]. In this system, module (i) estimates the players’ positions from the input video data sets filmed by several cameras. In module (ii), the feature vector sequence is extracted from the estimated players’ positions. Module (iii) detects target events with the extracted sequence The metadata is created in module (iv) using the detected target events

In such event detection-based systems, the event detection method is required to predict when the target event has occurred from given time-series data which is typically a high-dimensional, probabilistically uncertain data sequence extracted from videos.

1.2 Purpose

This paper attempts to perform soccer game event detection for constructing metadata creating system using the hierarchical Bayesian Hidden Markov Model (HMM) [8, 9] where performance evaluation was not conducted. “Hierarchical” in the present context refers to a learning scheme for hyperparameters behind HMM parameters. Note that hyperparameters often play an important role in a Bayesian framework in that HMM parameter distribution at a particular set of hyperparameter values could be largely different from that with another set of hyperparamter values. Therefore, such hierarchical structure endowes more flexibilities with an HMM, however, adjusting the hyperparameters atutomatically from the data becomes non-trivial. Recall that in a Bayeisan HMM with discrete states, discrete outputs, the likelihood functions are generally defined through multinomial distributions so that the prior distributions are often Dirichlet because of the natural conjugacy. Care needs to be exercised, however, in designing hyperparameter distributions behind Dirichlet. This study assumes a particular form of the hyperparameter prior distributions in such a way that irrelevant features are automatically suppressed. In this way, the predictive capabilities could be improved. Training of the HMM and predictions of event sequences are implemented by MCMC (Markov Chain Monte Carlo) method. The proposed method will be tested against 40-dimensional data sequences extracted from video data sets of professional soccer games. Performance will be evaluated with respect to three theoretical measures.

1.3 Related work

Since sports game data are generally sequential, and since HMM is a general model for capturing properties of sequential data, many of the papers on sports event detection are based on HMM [1017]. Some of them use the maximum likelihood method where the Baum–Welch algorithm [18] is used for parameter estimation [19]. Many others use Bayesian framework.

There have been a variety of studies on HMMs with Bayesian learning (Bayesian HMMs), as well as their implementations. For instance, Ref. [20] describes a maximum a posteriori estimation for Bayesian HMMs. In Ref. [21], a variational Bayesian method is described for HMMs. A Bayesian HMM with a MCMC method is discussed in Ref. [22]. These standard Bayesian HMMs (often with several modifications for observation densities) have been successfully applied to many applications, including speech [20] and event detection [23].

The hierarchical Bayesian HMM to be described in the next section, is a generalized version of such standard Bayesian HMMs, where hyperparameter learning is performed behind the target HMM parameters.

2 Hierarchical Bayesian HMM for event detection

This section describes the hierarchical Bayesian HMM designed for the soccer game event detection problem. There are two points that we would like to address in this study.

  1. 1.

    The proposed model consists of discrete states and discrete outputs so that popular prior distributions within a Bayesian framework are Dirichlet since they are naturally conjugate with respect to the multinomial distributions. Recall that there are hyperparameteres behind Dirichlet distributions which control the properties of the distributions. Therefore, the prior distributions consist of a family of distributions parameterized by hyperparameters instead of a single distribution, which gives rise to more flexibility in a model. The proposed method attempts to learn hyperparameteres in addition to learn the target parameters of HMM. As such, prediction capabilities of Bayesian methods with hyperparameter learning often outperform one without hyperparameter learning.

  2. 2.

    We will propose in this paper, a particular form of the hyperparameters for the HMM parameter prior distributions. More precisely, the hyperparameters are in a product form of two hyperparameters. One is the commonality hyperparameter, which describes the degree of commonality for the output emission probabilities among different hidden states. Another is the common shape hyperparameter that characterizes the average shape of the output emission probabilities among different hidden states. This prior structure enables us to suppress those output components that are ineffective for predictions. This is one of the desired properties when the output dimension is high. In the experiments to be reported below, output dimension is 40. Recall that there is an underlying topology behind HMMs. This paper assumes “ergodic" topology where every hidden state can go to any other state. Other topologies are also possible.

2.1 Data and hidden variables

Associated with an HMM, there are two sequential data. One is the observation data sequence y := (o 1, …, o T ) whereas another is the hidden variable (hidden state) sequence z := (q 1, …, q T ). Here o t and q t stand for time-series data and the hidden variable at time t, and T is the length of the sequence. The hidden variable q t is a one-dimensional variable that can take a finite number of values among N states (i.e., q t  ∈ {1, …, N}).

In event detection problems, the data o t can consist of two types of variables, e t and f t . The variable e t is a one-dimensional event variable at time t, and the variable f t := (f 1,t , …, f L,t ) is an L-dimensional feature variable at time t. Here, M e is the number of target events (e t  ∈ {1, …, M e }), and L is the dimension of the feature variable f t . The variable f l,t represents the lth component of f t , and M f l is the number of symbols for f l,t (i.e., f l,t  ∈ {1, …, M f l }). In the experiment described later, the event variable e t represents the occurrences of kick offs, corner kicks, and so on. The feature variable f t consists of extracted values from players’ positions at time t.

2.2 Observation model

Given the whole parameter set \(\theta :=(a,b,c)\) of an HMM, the probability of the data y t can be written as

$$ P(y|\theta) :=\sum_z P(y|z,b)P(z|a,c), $$
(1)
$$ P(y|z,b) :=\prod_{t=1}^T P(o_{t} |q_t,b), $$
(2)
$$ P(z|a,c) :=P(q_1|c)\prod_{t=2}^T P(q_t|q_{t-1},a), $$
(3)

where a is a hidden variable transition probability, c is an initial hidden variable probability, and b is an emission probability of the data o t . In event detection problems, the emission probability of the data o t  = (e t f t ) in (2) is described by the following equations:

$$ P(o_{t} |q_t,b) := P(e_t|q_t,b_e)P(f_t|q_t,b_f), $$
(4)
$$ P(f_t|q_t,b_f) :=\prod_{l=1}^L P(f_{l,t} |q_t,b_{f_l}), $$
(5)

where b := (b f b e ) and \(b_f :=(b_{f_{1}},\ldots,b_{f_{L}}).\)

Consider the multinomial distribution for one data item be defined as \({\mathcal M}ulti (x;p) :=\prod\nolimits_{i=1}^{K} p_i^{I(x=i)},\) where ∑ K i = 1 p i  = 1, p i  ≥ 0, and I(·) is an indicator function that returns 0 for false and 1 for true. Using this equation, the event emission probability P(e t |q t b e ) in (4) is written as

$$ P(e_t|q_t,b_{e}) :={\mathcal M}ulti(e_t; b_{e,q_t}), $$
(6)

where b e := (b e,1, …, b e,N ) and \(b_{e,i} :=(b_{e,i1},\ldots, b_{e,iM_{e}}).\) The emission probability of the lth component f l,t in (5) is

$$ P(f_{l,t}|q_t,b_{f_l}) :={\mathcal M}ulti(f_{l,t}; b_{f_l,q_t}), $$
(7)

where \(b_{f_{l}} :=(b_{f_{l,1}},\ldots,b_{f_{l},N})\) and \(b_{f_{l,i}} :=(b_{f_{l,i1}},\ldots, b_{f_{l},iM_{f_{l}}}).\) The hidden variable transition probability P(q t |q t−1, a) and the initial hidden variable probability P(q 1|c) in (3) can be written as

$$ P(q_t|q_{t-1} , a) :={\mathcal M}ulti(q_t; a_{q_{t-1}}), $$
(8)
$$ P(q_{1}|c) :={\mathcal M}ulti(q_1;c), $$
(9)

where a := (a 1, …, a N ), a i := (a i1, …, a iN ), and c := (c 1, …, c N ).

2.3 Prior distribution for parameter set

Within a Bayesian framework, not only the observation model, but also the prior distribution of the parameter set is defined. In Bayesian HMMs, parameter independency in the prior distribution is assumed for simplicity of the implementation; that is:

$$ P(\theta|\phi)=P(a|\alpha) P(b|\beta) P(c|\gamma), $$
(10)
$$ P(a|\alpha) :=\prod_{i=1}^N P(a_i|\alpha_i), $$
(11)
$$ P(b|\beta) :=P(b_e|\beta_e)P(b_f|\beta_f), $$
(12)
$$ P(b_f|\beta_f) :=\prod^L_{l=1}\prod^N_{i=1}P(b_{f_l,i}|\beta_{f_l,i}), $$
(13)
$$ P(b_e|\beta_e) :=\prod^N_{i=1}P(b_{e,i}|\beta_{e,i}), $$
(14)

where \(\phi :=(\alpha, \beta, \gamma), \alpha :=(\alpha_{1},\ldots,\alpha_{N}), \beta :=(\beta_f, \beta_e), \beta_f :=(\beta_{f_1},\ldots,\beta_{f_L}), \beta_{f_l} :=(\beta_{f_l,1},\ldots,\beta_{f_l,N}), \beta_e :=(\beta_{e,1},\ldots,\beta_{e,N}).\) Using the “naturally conjugate” Dirichlet prior distribution,Footnote 1 the prior distributions of \(a_i, b_{f_{l,i}}, b_{{e,i}},\) and c in (10)–(14) are defined as follows:

$$ P(a_i|\alpha_i) :={\mathcal D}ir(a_i;\alpha_i), $$
(15)
$$ P(b_{e,i}|\beta_{e,i}) :={\mathcal D}ir(b_{e,i};\beta_{e,i}), $$
(16)
$$ P(b_{f_l,i}|\beta_{f_l,i}) :={\mathcal D}ir(b_{f_l,i};\beta_{f_l,i}), $$
(17)
$$ P(c|\gamma) :={\mathcal D}ir(c;\gamma), $$
(18)

where \(\alpha_i :=(\alpha_{i1},\ldots,\alpha_{iN}), \beta_{e,i} :=(\beta_{e,i1},\ldots,\beta_{e,iN}), \beta_{f_l,i} :=(\beta_{f_l,i1},\ldots,\beta_{f_l,iN}),\) and \(\gamma :=(\gamma_{1},\ldots,\gamma_{N}).\)

2.4 Settings for hyperparameter set

In this paper, all components of the hyperparameter vectors except for \({\beta_{f_l,i}}\) are fixed at 1.0, like those of several conventional Bayesian HMMs (e.g., [7]).Footnote 2 On the other hand, to control negative influences from the components that have low dependency on the states (redundant components), this study considers a reparameterization of the hyperparameter vectors \(\beta_{f_l,i},\) and the prior distribution of the reparameterized hyperparameters described below.

2.4.1 Reparameterization of \(\beta_{f_{l},i}\)

In the hierarchical Bayesian HMM [8, 9], the reparameterization of the hyperparameter vector \(\beta_{f_l,i}\) is described by the following equation:

$$ \beta_{f_l,1}=\beta_{f_l,2}=\cdots=\beta_{f_l,N} :=\lambda_{f_l} \eta_{f_l}. $$
(19)

The non-negative variable \(\lambda_{f_{l}}\in R\) is the commonality hyperparameter, which describes the degree of commonality for the emission probabilities \(P(f_{l,i} |q_t,b_{f_{l}})\) among different hidden states. The hyperparameter \(\eta_{f_l} :=(\eta_{f_l,1},\ldots,\eta_{f_l,M_{f_l}}) \in R^{M_{f_l}}\) is the common shape hyperparameter that defines the average shape of the emission probabilities \(P(f_{l,t}|q_t,b_{f_{l}})\) for different hidden states, where \(\eta_{f_{l,i}}>0,\) and \(\sum\nolimits_{i=1}^{M_{f_l}} \eta_{f_l,i}=1.\)

Here, let us take a closer look at how the commonality hyperparameter \(\lambda_{f_{l}}\) influences the emission probability \(b_{f_{l,i}}.\) Figure 2 describes the shapes of the prior distribution (13) with several settings of \(\lambda_{f_{l}}.\) First, consider a case where the commonality hyperparameter \(\lambda_{f_{l}}\) is large so that the probability mass is concentrated around a relatively narrow region, as shown in Fig. 2c, d. This amounts to the fact that there is a relatively small amount of diversity among the parameter vectors \(b_{f_{l,1}}, \ldots, b_{f_{l,N}},\) i.e., \(b_{f_{l,1}}\approx b_{f_{l,2}} \approx \cdots \approx b_{f_{{l,M}_{f_l}}} \approx \eta_{f_l}.\) Therefore, low dependency of f l,i on the states is expected. If a component of the given data has little effect on the states, then that particular component may not carry useful information for prediction purposes, so that one wants to suppress such a component. On the other hand, when the commonality hyperparameter λ fl is not so large (as shown in Fig. 2a, b), there would be more diversity of b f l,i among states, and hence, the dependency of f l,i on the states is expected.Footnote 3 By learning such reparameterized hyperparameters, the proposed hierarchical Bayesian HMM enables us to reduce negative influences from the redundant components.

Fig. 2
figure 2

The Dirichlet prior distribution for b f l,i with several settings of the commonality hyperparameter λ f l . In this figure, the parameters b f l,i are three-dimensional variables, b f l,i  = (b f l,i1b f l,i2b f l,i3), and the common shape hyperparameter η f l is fixed at η f l  = (0.3, 0.3, 0.4). Since the component b f l,i3 can be determined automatically using b f l,i3 = 1 − b f l,i1 − b f l,i2, the variable b f l,i3 is omitted in this figure. It should be observed that the parameter b f l,i concentrates more around the average η f l when the commonality hyperparameter λ f l is larger

2.4.2 Prior distribution for λ f l and η f l

This section describes the prior distributions of the hyperparameters λ f l and η f l , for learning these hyperparameters in a Bayesian framework to be described later. There is no well-known “naturally conjugate” prior distribution for the commonality hyperparameter λ f l . Assuming that λ f l  ∈ (0, ∞), there are several possible prior distribution for λ f l . Among them, we assume the gamma distribution:

$$ P(\lambda_{f_l}) :={\mathcal{G}}amma(\lambda_{f_l}; \kappa, \omega), $$
(20)

where \(\mathcal{G}amma(\cdot;\kappa, \omega)\) stands for the gamma distribution with the non-negative shape parameter κ and the non-negative scale parameter ω.Footnote 4 In the experiments of Sect. 4, these hyperparameters are set to κ = 0.5 and ω = 100, which enables λ f l to be distributed widely in its range.

Naturally conjugate prior distribution for η f l is not known either. However, because of the constraints of η f l , the Dirichlet distribution is assumed in this paper as the prior distribution for η f l , i.e.,

$$ P(\eta_{f_l}) :={\mathcal D}ir(\eta_{f_l}; \eta_0), $$
(21)

where η0 denotes the hyperparameter vector. In the experiments described in later sections, the vector η0 is set to η0 = (1.0, …, 1.0) by considering a non-informative setting for η f l .

The model specifications described in this section can be summarized graphically as shown in Fig. 3.

Fig. 3
figure 3

Graphical representation of the proposed model. In this figure, the double circles denote observable probabilistic variables, whereas the single circles are unobservable probabilistic variables. The squares denote the fixed variables, the arrows indicate probabilistic dependencies between variables, and the dashed lines show the groups of variables. For clarity, the hyperparameters and their dependencies are omitted here

3 Implementation of event prediction

The target problem of this paper can be formulated as follows.

  • Problem: Predict an event e new t in test data, when the feature variable sequence \(f_{1:t}^{\text{new}} :=(f_1^{\text{new}},\ldots, f_t^{\text{new}})\) in the test data and a training data set Y are given.

In this problem, the training data set Y is considered as the set of time-series data sequences {y d} D d=1 , where D is the number of sequences, and d is the index of the sequence. Another data sequence y new := (o new1 , …, o new T ) is considered as test data, which is not included in the training data set Y. The test data y new consists of two sequences: the event sequence e new1:T and the feature variable sequence f new1:T , i.e., \(y^{\text{new}}=(e_{1:T}^{\text{new}}, f_{1:T}^{\text{new}}).\)

3.1 Bayesian predictive probability for target event

Within the Bayesian framework with the model described in Sect. 2, a reasonable approach for the target problem is to evaluate the predictive probability for the target event under the condition that the training data Y and the feature variable sequence f new are available. Such predictive probability for the target event variable at time t is represented by

$$ P(e_t^{\text{new}}|f_{1:t}^{\text{new}},Y)=\int \int P(e_t^{\text{new}}|f_{1:t}^{\text{new}},\theta)P(\theta ,\phi |Y)\,{\rm d}\theta\,{\rm d}\phi, $$
(22)

via the evaluation of the joint posterior distribution:

$$ P(\theta ,\phi |Y)=\sum_Z P(\theta ,\phi ,Z|Y), $$
(23)

where

$$ P(\theta ,\phi ,Z|Y)=\frac{P(Y|Z,\theta )P(Z|\theta )P(\theta |\phi )P(\phi )} {\sum_Z \int \int P(Y|Z,\theta )P(Z|\theta )P(\theta |\phi )P(\phi )\,{\rm d}\theta\,{\rm d}\phi} , $$
(24)

and Z stands for the set of hidden variable sequences {z d} D d=1 corresponding to the data set Y.

3.2 Calculation of predictive probability

Because of their complexity, there is no closed-form analytical solution for the integrations in (22). Therefore, we use MCMC methods [22, 24, 25] to generate samples from the posterior distribution (23) of the hierarchical Bayesian HMM. Once the samples \(\{(\theta^{(r)},\phi^{(r)})\}_{r=1}^R\) are generated by the Monte Carlo method, the predictive probabilities (22) can be easily approximated as

$$ P(e_t^{\text{new}}|f_{1:t}^{\text{new}},Y) \approx \frac{1}{R}\sum_{r=1}^R P(e_t^{\text{new}}|f_{1:t}^{\text{new}},\theta^{(r)} ), $$
(25)

using only parameter samples {θ(r)} R r=1 . Here, r stands for the index of the sample, and R is the number of samples. The conditional predictive probability \(P(e_t^{\text{new}}|f_{1:t}^{\text{new}},\theta^{(r)})\) in (25) can be calculated analytically by

$$ P(e_t^{\text{new}}|f_{1:t}^{\text{new}},\theta^{(r)})= \sum_{q_t^{\text{new}}} P(e_t^{\text{new}}|q_t^{\text{new}}, b_e^{(r)})P(q_t^{\text{new}}| f_{1:t}^{\text{new}}, \theta^{(r)}), $$
(26)

where q new t is the hidden state at time tb (r) e stands for the parameter b e in the rth parameter sample θ(r), and

$$ P(q_t^{\text{new}}|f_{1:t}^{\text{new}},\theta^{(r)})= \frac{P(f_{t}^{\text{new}}|q_t^{\text{new}},b_{f}^{(r)}) P(f_{1:t-1}^{\text{new}},q_t^{\text{new}}|\theta^{(r)})}{\sum_{q_t^{\text{new}}} P(f_{t}^{\text{new}}|q_t^{\text{new}},b_{f}^{(r)})P (f_{1:t-1}^{\text{new}},q_t^{\text{new}}|\theta^{(r)})}. $$
(27)

By considering only the feature variable sequence f new, it is easy to compute the forward probability \(P(f_{1:t-1}^{\text{new}},q_t^{\text{new}}|\theta^{(r)})\) in (27) with the well-known forward procedure for HMMs, as follows:

$$ P(f_{1:t-1}^{\text{new}}, q_t^{\text{new}}|\theta^{(r)})= \sum_{q_{t-1}^{\text{new}}} P(f_{t-1}^{\text{new}}|q_{t-1}^{\text{new}},b_{f}^{(r)}) P(q_t^{\text{new}}|q_{t-1}^{\text{new}},a^{(r)}) P(f_{1:t-2}^{\text{new}}, q_{t-1}^{\text{new}}|\theta^{(r)}). $$
(28)

The implementation described in this section is summarized in Fig. 4.Footnote 5

Fig. 4
figure 4

Procedure for event prediction

4 Event detection experiment for soccer games

To evaluate the metadata generation system in Fig. 1 in real situations, the proposed method was tested for the problem of event detection in soccer games. The details are described below.

4.1 Target data sequences

In this subsection, we explain the target data sequences of the event detection problem. Following Ref. [7], the target events are defined as Kick Off (KO), Corner Kick (CK), Free Kick (FK), Throw In (TI), and Goal Kick (GK). The event occurrence period is defined as a 9 s range centered on the time corresponding to the referee’s instruction.

4.1.1 Target soccer games

For the data set in this experiment, we used videos of five half games in J-league (professional soccer league in Japan): the first four half games were used for training, and the last half game was used for testing. Table 1 shows basic information of the five target games and Fig. 5 shows example scenes of the video data set.

Table 1 Basic information of target half games
Fig. 5
figure 5

Example scene of the video data set. Two-directional cameras are used in this example. a Example scene of the video data set captured by the left video camera. b Example scene of the video data set captured by the right video camera

4.1.2 Players’ positions

For all five half games, the players’ position sequences were estimated from the corresponding video data set using the player tracking method in [26]. Figure 6 shows an example of the estimated players’ position. By considering symmetries in the soccer games, reversed position sequences were also generated for the first four half games. More specifically, for each players’ position sequence, we generated a non-reversed sequence, a long-side axis reversed sequence, a short-side axis reversed sequence, and a long-and-short-side axis reversed sequence for training. Thus, we used 16 players’ position sequences for training, and 1 player’s position sequence for testing.

Fig. 6
figure 6

Example of the estimated players’ positions

4.1.3 Feature variables

Feature variable sequences, consisting of 40 components, were extracted from the players’ position sequences. These 40 components were preliminarily selected from about five thousand candidates by applying a simple screening method with information-based criteria to the training data sets. The selected components included the average players’ positions on the long-side axis (the X-coordinate) and the variance of the players’ positions on the long-side axis. The features also included geometric information about the soccer field. The soccer field was divided into a k v  × k h grid, and some of the features carry information about the grid square where that particular feature is located. Figure 7 schematically illustrates the case with k v  = 3 and k h  = 5. All components were quantized to 10 levels for HMM modeling. The details of the candidates and the screening method are described in Appendix B.

Fig. 7
figure 7

Soccer field is divided into grid. This particular case is with 3 × 5 grid

4.2 Settings

The proposed event detection method was evaluated using the feature variable sequences described above. The unit time for the HMM was 1 s, and the number of hidden states, N was, set at N = 30. For the computation described in Fig. 4, we generated 1,000 samples in the MCMC step (b) (G = 1,000), and we used the last 500 samples for the Monte Carlo approximation (R = 500) in this experiment. The approximated predictive probability (25) was averaged over five independent trials to obtain a more accurate approximation.

4.3 Experimental results

The third column of Table 2 summarizes the Area Over ROC Curve (AOC) of each event, which is a detection error index defined with ROC curves drawn using the predictive probabilities for a specific event. Figure 8 shows three examples of predictive probabilities estimated by the proposed method: (a) CK, (b) GK, and (c) FK.

Table 2 Event AOC with test sequence
Fig. 8
figure 8

Examples of event predictive probabilities estimated by the proposed method. A ground truth is superimposed as a gray rectangle. The horizontal axis is over a 500-s interval around typical event occurrences. The ranges of the vertical axes are adjusted to show the maximum and minimum values of the predictive probabilities. a CK, b FK, c GK

One of the reasons for the low predictive capability of FK may be attributed to the fact that FK can take place without a game interruption. One of the possible improvement strategies could be taking into account the ball position which is not associated with the player positions.

Let us show how the hyperparameters were learned in our experimental results. As was alluded to earlier, the hyperparameter learning is one of the main points of this study. Figure 10 demonstrates box plots of some of the commonality hyperparameter posteriors, associated with features 17, 4, 6, 36, 37, 27, 1, 2 and 3. Since the hyperparameter associated with feature 17 is relatively large, feature 17 could be less relevant to the prediction problem in question. Features 1, 2, and 3 could be more relevant than the others because the associated commonality hyperparameters are smaller. Features 4, 6, 36, 37 and 27 are of intermediate relevance for the target prediction problem. Figure 10 shows posterior trajectories of the commonality hyperparameters associated with features 17, 36 and 2, where the horizontal axis is the MCMC iteration number. The vertical axis is in log scale.

The features which appeared in Fig. 9 are described by the following list, where ”Player occlusion weight” is defined as the parameter which places weight proportional to the number of occluded players. ”Grid-based feature” has been defined in Sect. 4.1. In addition to a target feature quantity, each feature carries two descriptors: whether it is a statistical or a grid-based quantity, and whether or not a player occlusion weight is applied. The following list includes such information together with the meaning of each feature.

  1. a.

    feature 17

    1. 1.

      minimum velocity among all the players

    2. 2.

      statistical feature

    3. 3.

      no player occlusion weight

    4. 4.

      indicates if the players move

  2. b.

    feature 4

    1. 1.

      variance of the X-coordinates among all the players

    2. 2.

      statistical feature

    3. 3.

      no player occlusion weight

    4. 4.

      indicates the player density

  3. c.

    feature 6

    1. 1.

      variance of the X-coordinates among all the players

    2. 2.

      statistical feature

    3. 3.

      with player occlusion weight

    4. 4.

      indicates the player density

  4. d.

    feature 37

    1. 1.

      number of players in a grid

    2. 2.

      grid-based feature

    3. 3.

      no player occlusion weight

    4. 4.

      indicates the number of players in the center foreground area

  5. e.

    feature 36

    1. 1.

      number of players in the center background area

    2. 2.

      grid-based feature

    3. 3.

      no player occlusion weight

    4. 4.

      indicates the number of players in the center background area

  6. f.

    feature 27

    1. 1.

      sum of the players’ velocities in the center field

    2. 2.

      grid-based feature

    3. 3.

      with player occlusion weight

    4. 4.

      indicates the total momentum of players in the center background area

  7. g.

    feature 3

    1. 1.

      number of players in the center background area

    2. 2.

      grid-based feature

    3. 3.

      with player occlusion weight

    4. 4.

      indicates the number of players in the center background area

  8. h.

    feature 1

    1. 1.

      number of players in the center field convolved with Gaussian kernels

    2. 2.

      grid-based feature

    3. 3.

      no player occlusion weight

    4. 4.

      indicates the density of players in the center field area

  9. i.

    feature 2

    1. 1.

      mean of the X-coordinates among all the players

    2. 2.

      statistical feature

    3. 3.

      with player occlusion weight

    4. 4.

      indicates the players’ average position

Fig. 9
figure 9

Box plots of some of the commonality hyperparameter posteriors. The larger the posterior value of the hyperparameter, the less its relevance on the predictions. Feature 17 appears relatively irrelevant whereas features 3, 1, and 2 appear more relevant

Fig. 10
figure 10

Trajectories of the posterior samples of some of the commonality hyperparameters associated with features 17 (top), 36 (middle), and 2 (bottom), where the horizontal axis is the MCMC iteration number. The vertical axis is in log scale

The number of players in the center field and the players’ mean X-coordinate appears important, whereas the minimum velocity of the players appears less relevant for the event perdiction in question.

Average computational time for each time step is 5.019 × 10−3s/sample with the following environment:

CPU: Intel(R) Xeon(TM) (3.00GHz), memory: 3.00 GB, OS: Microsoft Windows Server 2003 Enterprise Edition Service Pack 2, language: C++. Compiler: Microsoft(R) 32-bit C/C++ Optimizing Compiler Version.

With parallel processors each handling certain number of samples, real time computations are feasible.

The order of complexity for event detection is

$$ \begin{aligned} &O(RN^2+RNL), {\text{where}}\;R=\#{\text{samples}}, N=\#{\text{states}},\\&L=\#{\text{output values (dimension of feature variable)}}.\end{aligned} $$

4.4 Performance comparison

Of the many papers on sports event detection methods listed in the references, many of them use EM algorithms for parameter estimation, whereas references [16, 17] use MCMC. The advantages and disadvantages of MCMC and EM are relatively well understood. EM algorithms are simpler to implement, but they sometimes suffer from the local maxima problem, giving rise to sensitivity to the initial conditions. MCMC needs more work than EM to implement; however, it is relatively robust.

As was alluded already, there are two novelties, we believe, in the present study: (a) the hyperparameter learning, and (b) the particular hyperparameter prior distributions. Recall that hyperparameters are one of the important ingredients of Bayesian learning in that they control the properties of the target prior distributions so that the shape of the target distribution at a certain set of hyperparameter values can be largely different from that with another set of hyperparameter values. Thus, if hyperparameters are learned automatically from the data, the prediction capability could be improved. This is what we attempt to demonstrate in our paper. Therefore, the point of performance comparison boils down to a comparison of cases with or without hyperparameter learning in HMM for soccer game event detection. Of the references [1017, 19] on Bayesian HMM-based sports event detection methods, none appears to perform hyperparameter learning if we understood them correctly.

Here, we will describe performance comparisons of the HMM event detection method with and without hyperparameter learning. Both methods were trained and tested with the same feature variable sequences under the same settings. Our comparison was in terms of three theoretical measures. One is the paired Wilcoxon test on the differences of AOC values given in the second and third columns of Table 2. The result of this test is summarized by the following:

Paired Wilcoxon test

  1. 1.

    significance  level: 0.05

  2. 2.

    one−sided  alternative  hypothesis: median of  AOC  differences >0

  3. 3.

    p−value: 0.03125.

The second and the third measures are information based. One is the cross entropy (negative averaged log likelihood):

$$ H_{test}(q) := -1/T \sum_{t=1}^T \log_{2} q(e^{\text{new}}_t), $$

whereas the other is the perplexity:

$$ 2^{H_{\text{test}}(q)}, $$
(29)

where q(e new t ) stands for the approximated predictive probability. Table 3 shows these information-based indexes for the predictive probabilities estimated by both methods.

Table 3 Information-based error indexes with test sequence

These comparisons appear to indicate that the proposed method outperformed the method without hyperparameter learning.

5 Conclusion

In this paper, we have proposed a method for soccer game event detection with an HMM in a hierarchical Bayesian framework. The method was hierarchical in the sense that hyperparameter learning was performed in addition to the target HMM parameter learning. Furthermore, we have proposed particular hyperparameter prior distributions which are in a product form consisting of commonality hyperparameters and scale hyperparameters. One of the main reasons for using such prior distributions is their ability to automatically control ineffective features. The method was implemented by MCMC instead of the popular EM.

The proposed method was applied to 40-dimensional data sequences extracted from real professional soccer games. The performance was compared with a method without hyperparameter learning. The comparison was performed with respect to three information criteria: AOC, cross entropy, and perplexity. The proposed method appeared functional.

The following is a list of possible future research projects that are planned in our laboratory:

  • Model extensions: In many cases of sequential data modeling, model extensions with generalized HMMs, also known as hidden semi-Markov Models, can improve the modeling performance, and they have been successfully applied to several problems (e.g., [27, 28]). Model extensions, including such a generalized HMM-based approach, are expected to be effective for event detection problems.

  • Incomplete information: There are cases where some of the data is missing. Such a situation is called incomplete information" and is common in Bayesian HMM learning framework and is solvable. We would like to deal with this as the subject of a future research project.

  • Problems with wider scope: In this paper, we focused on the event detection problem based on a small number of professional soccer game data. Currently we are in a process of obtaining a larger data set of soccer game videos of a professional league. Applications of the proposed method, with modifications, to problems other than event detection problems, such as sports strategy/situation analysis, may be interesting.

  • Other sports: Although we applied the proposed method only to soccer games in this paper, the proposed method is not limited to soccer games. Applications to other sports can be considered, e.g., rugby football, ice hockey, and basketball, among others.