Nonparametric Bayesian

In this chapter, we take a Bayesian nonparametric approach in deﬁning a prior on the hidden Markov model that allows for ﬂexibility in addressing the problem of modeling the complex dynamics during robot manipulation task. At ﬁrst, considering the underlying dynamics that can be well-modeled as a hidden discrete Markov process, but in which there is uncertainty about the cardinality of the state space. Through the use of the hierarchical Dirichlet process (HDP), one can examine an HMM with an unbounded number of possible states. Subsequently, the sticky HDP-HMM is investigated for allowing more robust learning of the complex dynamics through a learned bias by increasing the probability of self-transitions. Additionally, although the HDP-HMM and its sticky extension are very ﬂexible time series models, they make a strong Markovian assumption that observations are conditionally independent given the discrete HMM state. This assumption is often insufﬁcient for capturing the temporal dependencies of the observations in real data. Toaddressthisissue,weconsiderextensionsofthestickyHDP-HMMforlearningthe switchingdynamicalprocesseswithswitchinglineardynamicalsystem.Inthelater chaptersofthisbook,wewillverifytheperformancesinmodelingmulitmodaltime seriesandpresenttheresultsofrobotmovementidentiﬁcation,anomalymonitoring, andanomalydiagnose.


Introduction
Human can perceive the state of the outside world or the objects that need to be operated through the five perspectives based on past experience and internal models, and can recognize the advantages and disadvantages of self to learn and adjust existing experiences and models. However, robots can only understand the outside world through sensing data, so modeling and analysis of multi-modal time series is a difficult and hot research topic to realize robot perception. The multi-modal information during the execution of the robot is a highly complex and dynamic process that incorporates non-linearities such as motion encoders, vision, forces/torques, haptics, and sounds. Modeling based on the traditional hidden Markov model has the problems of uncertainty of the number of recessive states and rapid conversion of recessive states, and it is difficult to accurately capture the potential modes of multi-modal data. In this chapter, based on the Bayesian basic theory and the theoretical derivation of the HMM model, combined with the inherent characteristics of the robot's multi-modal sensor information, it proposes that it is more efficient and complicated in terms of complex dynamic phenomena, uncertainty, and model parameter learning. The robust non-parameterized Bayesian model implements a solution in which the number of recessive states of the HMM model is determined by the complexity of the data and a high self-transition expected probability of the recessive state. To a certain extent, the non-parameterized Bayesian model is Research on robot multi-modal information sensing and fusion provides theoretical framework and application guidance.

Related Works
In neurosciences, studies have shown that the different senses of humans can be closely combined [1], so that they can implement the complex manipulation tasks and environmental perception in daily life. In recent years, inspired by the results of this research in neurosciences, scholars have carried out research on the role of multimodal sensing information fusion for implementing practical manipulation tasks in robotics [2,3]. Fitzpatrick et al. [4] introduced a method of cross-modal verification, which enhances the robot's perception of multi-modal events by repeating and redundancy of sensing information, so that the robot can learn the underlying correlation between vision and auditory. Wu et al. [5] studied the use of a combination of acceleration and sound measurements to detect structural defects in aircraft parts. Su et al. [6] introduced a multi-modal event detection framework based on the robot's axis hole assembly task, including the robot's end pose and haptic sensing signals. It can be known from aforementioned instances that multimodal fusion refers to the synthesis of sensory data with a certain temporal sequence and spatial relationship from multiple different sensors, modeling at different levels and learning its potential patterns [7]. Then, the essence of multimodal fusion is to realize the modeling and analysis of Multivariate Time Series (MTS). For example, in order to allow the robot can perform flexible operations in an uncertain environment, a force/torque sensor is mounted on the end-effector [8], a tactile sensor and a acoustical sensor are installed on the robot claw, and a visual sensor is placed for monitoring the robot workspace [9][10][11], as shown in Fig. 2.1. In this way, the multi-sensor fusion method can be used to more comprehensively and accurately monitor environmental characteristics, eliminate signal variability and information uncertainty, and improve system reliability and stability [7].
Time series modeling is pervasive in fields as diverse as speech recognition, robotics, economics, medicine, multimedia, bioinformatics, and system control [12]. In recent years, related modeling methods have emerged endlessly, among which the Hidden Markov Model (HMM) [13] is the most popular method that were used in speech recognition [14][15][16], computational biology [17,18], Machine translation [19,20], cryptanalysis [21,22], economics [23,24], and human behavior recogni-  [25,26], etc. In particular, HMM has been widely used in the field of robotics, mainly including process supervision [27,28], robot state estimation [29,30], decision making [31,32], and learning robot motion primitive transformation [33,34], etc. In literature [27] indicated that HMM can be used for anomaly monitoring, providing a solution for the follow-up work of this book in anomaly implementation. Although there have been a lot of successful application cases of HMM, this model has two inherent shortcomings [35]: (1) It's difficult to determine the size of the HMM hidden state space (the number of hidden states); (2) HMM follows the Markov chain hypothesis (in a given hidden state, the observation data are distributed independently of each other) and the traditional Expectation Maximization (EM) algorithm is used to infer unknown parameters [36,37]. This inherent disadvantage greatly weakens the accuracy and efficiency of HMM modeling in time series data.
To address the problems of HMM, non-parameterized Bayesian model [38] and HMM of infinite hidden state space [39] were proposed. Yee Whye The et al. [40] proposed a Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM). HDP-HMM uses the hidden state transition probability distribution and observation distribution of HMM. The methods of introducing hyper-parameters to Bayesian a priori are added separately, so that the number of hidden states can be learned from the complexity of the training data, and is calculated by the Markov Chain Monte Carlo (MCMC) method. The Gibbs Sampling (GS) method infers the unknown parameters of the model [41,42]. HDP-HMM uses the full Bayesian method to learn the number of hidden states by assuming an infinite state space, it effectively avoids overfitting the data and repeated trial and error model selection. From the process of the model implementation and experimental results, it can be known that HDP-HMM's modeling of hidden states still retains the properties of the original Markov chain, which will cause rapid transitions between hidden states, resulting in a lack of the principle of time continuity. The so-called time continuity means that the occurrence of events can always maintain a certain consistency, which makes it difficult to analyze events in the real world. To address the problems of HDP-HMM, Fox et al. [43][44][45] proposed a Sticky Hierarchical Dirichlet Process Hidden Markov Model (sHDP-HMM). The sHDP-HMM tackle the rapid transition of hidden states in HMM is to increase the self-transition of hidden states by adding the bias of the probability of each hidden state transition, so as to meet the requirement of time continuity. The experimental results show that the "sticky" hidden state enhances the consistency modeling of time series data, and can achieve more accurate results than HDP-HMM in speech recognition and multiple speaker recognition experiments [43].

Multimodal Time Series
With the development of multi-modal sensing fusion technology and the diversity of the environment, except the joint encoders, robots often need to add force/torque sensors and tactile sensors to sense the surrounding environment, so that the robot's observation is often multimodal. In general, a multidimensional time series with dimension D and length T is represented as follows: where, t represents a time frame, d represents a variable, and y d (t) denotes the observation of the d-dimensional variable at the t-th time step. For ease of writing, define the observation vectors for all dimensions at time t as y t ∈ R D . Hence, the Eq. 2.1 is expressed as a matrix with T × D, and the matrix is set as an example of a multidimensional time series, that is, (1) ... y D (1) y 1 (2) y 2 (2) · · · y d (2) · · · y D (2) . . . . . . · · · . . . · · · . . .
Each row in Eq. 2.2 represents the observation at time t, and each column represents the data of a signal. This book uses a non-parametric Bayesian hidden Markov model to learn and analyze the multi-dimensional time series [45,46].

Bayes' Rules
Assuming two random variables are y and θ respectively, then according to the definition of Bayes' rule as follows: Where p(θ |y) is the posterior probability, p(y|θ) is the likelihood probability, p(θ ) is the prior probability, and p(y) is the standardized constant. From Eq. 2.3, the posterior probability is proportional to the product of the likelihood probability and the prior probability. Bayesian law summarizes the process of Bayesian inference: given the likelihood probability and the standardized constant, the posterior probability can be inferred from the prior probability. The Bayesian time series model mentioned in this book means that the variables in the model are time-dependent, but due to the endless time correlation, the model cannot be established, so it is proposed that the variables require certain constraints are time-independent such as Markov chain constraints. So that two adjacent variables y t+1 and y t can be predicted by Eq. 2.3 Therefore, if y in Eq. 2.3 is expressed as the random variable and θ as parameters of corresponding probability model, where unknown parameter is accompanied by a hyper-parameter, Bayesian theory can be extended to From Eq. 2.5, we can know that the parameters of the model are estimated by maximizing the posterior probability, that is, the maximum posterior probability estimation method (Maximum a Posteriori, MAP), i.e.
If the prior probability p(θ |λ) is a constant, then the maximum posterior estimate is equivalent to the Maximum Likelihood (ML), that is, Equations 2.6 and 2.7 are both point estimation methods for model parameters, which will widely used to the later chapters.

Conjugate Prior
As we all know, the posterior distribution and the prior distribution belong to the same type in Bayes' rule such that the prior distribution and the posterior distribution are called conjugate distributions. Where the prior distribution is called the conjugate prior of the likelihood function. Therefore, the used of conjugate priors is often motivated by practical considerations that the parameters θ can't be directly learned from the observations. Namely, the conjugate priors allow for introducing a computationally tractable mechanism for incorporating new observations into the posterior distribution of the parameters θ . With the Bayesian framework, we are interested in incorporating a prior distribution on the latent model parameter θ for making predication about the new observations. To this end, assuming the associated conditional exist, and given N i.i.d. observations, this predictive likelihood is written by To simplify the computational complexity, we only discusses conjugate priors that require the use of three similarity functions in subsequent chapters. In particular, three common observation model are considered in the hidden Markov model, including Multinomial Distribution, Multivariate Gaussian Distribution as well as Linear Gaussian Regression Model.

• Multinomial Observations
Multinomial Distribution [130, 131] considers a random variable y on a finite sample space Y = {1, . . . , K }, its probability mass function can be denoted by π = [π 1 , . . . , π K ], and describes the probability of a string of N observations of y taking on values y 1 , . . . , y n by Where, the notation δ( j, k) indicate the discrete Kronecker delta. When K = 2, this distribution is referred to as the binomial distribution. The Dirichlet prior distribution is used for formulating the conjugate prior of the multinomial distribution. A K -dimentional Dirichlet distribution is the conjugate prior for the class of K -dimensional multinominal distribution and is uniquely defined by a set of hyperparameters α = [α 1 , . . . , α K ], which has the following form: Where, Γ (·) represents the standard Gamma function. When K = 2, this distribution is referred to as the Beta distribution, which can be denoted by Beta(α 1 , α 2 ). The initial value of the distribution in Eq. 2.10 is given by Then, refer to the conjugacy of the Dirichlet distribution, conditioned on N multinomial observations y 1 , . . . , y N , the posterior distribution on π is also Dirichlet that formulated by p(π |y 1 , y 2 , . . . , y N , α) ∝ p(π |α) p(y 1 , y 2 , . . . , y N |π) ∝ Dir(α 1 + N 1 , . . . , α K + N K ) (2.12) Where, the N k represents the observed times of observation y n = k. Subsequently, using the normalizing constant of the Dirichlet distribution, and substituting to Eq. 2.8, one can derive the predictive likelihood to be

• Multivariate Gaussian Observations
Multivariate Gaussian distribution is parameterized by a mean vetor μ and covariance matrix Σ, which provides a useful description of continuous-valued random variables that concentrate about a given value and have constrained variability [130]. This distribution over a sample space y ∈ Y = R D that each observation y is a D-dimensional vector can be derived by We conveniently denote this multivariate Gaussian distribution by N (μ, Σ). Three categories of priors about this distribution in terms of the parameters are considered in this book, including known covariance, known mean, and both the mean and covariance are uncertain.

Known Covariance
For known covariance Σ, we use the normal prior distribution as the conjugate prior on the mean vector μ, and represented by N (μ 0 , Σ 0 ).

Known Mean
For known mean and only the convariance Σ is uncertain, the conjugate prior is formulated by the inverse-Wishart distribution [130]. The D-dimensional inverse-Wishart distribution with two parameters: covariance parameter Δ and ν degrees of freedom, and given by Where, tr(·) denotes the trace of matrix. We write this distribution by I W (ν, Δ), and the first moment is given by calculating the expectation value:

Both Covariance and Mean are Uncertain
We use the normal-inverse-Whishart distribution as the conjugate prior when both the mean and covariance are uncertain. This distribution defines a conditionally normal prior on the mean μ|Σ ∼ N (ϑ, Σ/κ), and an inverse-Wishart distribution on the covariance, Σ ∼ I W (ν, Δ). Therefore, the joint prior for uncertain mean and covariance is then given by (2.17) Where, κ represents the degree of trust in the mean generated by this prior distribution, ϑ represents the mean of this prior distribution, ν represents the degree of trust in the variance generated by this prior distribution, and Δ represents the mean of the variance Σ matrix. We will use the notation N I W (κ, ϑ, ν, Δ) to represent the normal-inverse-Whishart distribution with four parameters.

• Multivariate Linear Gaussian Regression Observations
The normal multivariate linear regression model is one in which the observations y i ∈ R d can be described as a linear combination of a set of known regressors x i ∈ R n with errors accounted for by additive Gaussian noise, written as For considering the temporal correlation in a multimodal system, we may combine a set of N observation vectors into a matrix Y = [y 1 · · · y N ], the regressors also into a matrix X = [x 1 · · · x N ], and the noise terms into E = [e 1 · · · e N ] and formulated by where the notation A = [a 1 · · · a N ] is referred to as the design matrix. Assuming the noise covariance Σ is know, the conjugate prior on the design matrix A is the matrix-normal distribution [132]. A matrix A ∈ R d×m has a matrix-normal distribution M N (A; M, V, K ) that given by where, the M is the mean matrix, and V and K −1 are related to the covariance along the rows and the columns of A. Along with Eqs. 2.15 and 2.20, the conjugate prior on the set of parameters A and Σ is the matrix-normal inverse-Wishart prior that places a conditionally matrixnormal prior on matrix A given Σ, that is

Hidden Markov Model
HMM has been a workhorse in pattern recognition able to encode probabilistic statespace model. The HMM is a stochastic process where a finite number of latent states have Markovian state transitions. Conditioned on the mode sequence, the model assumes (discrete or continuous) conditionally independent observations given the latent state. Let z t denote the latent state of the Markov chain at time t ∈ T , and π j the state-specific transition distribution for state j ∈ K . Given the state z t , the observation y t is conditionally independent of observations and states at other time steps. Then, the Markovian structure on the state sequence and the observation are simply described as: where, the state at the first time step is distributed according to an initial transition distributions π 0 ; F(·) represents a family of distributions (e.g., the multinomial for discrete data, or the multivariate Gaussian for real-vector-valued data); θ z t are the state-specific emission parameters and the z 1:T is a state path over the hidden state space. Therefore, the resulting joint density for T observations is given by: As illustrated in Eq. 2.23, if the state path z 1:T is estimated, the maximum probability of the observation sequence can be obtained. However, the true state-path is hidden from the observations. There are different approaches to compute it: (i) the maximum likelihood state at any given moment; however, this approach does not consider uncertainty; (ii) a marginal probabilistic representation over hidden states at each time step. Here, the log-probability at time t is derived by computing the natural logarithm of the sum of exponentials over all hidden states: where α t (k) = p(y 1:t , z t ), β t (k) = p(y t+1:T |z t ), represent the forward message passing and backward message respectively in the standard forward-backward algorithm [37,47]. p(z t = k|y 1:t−1 ) = j p(z t |z t−1 = j) p(z t−1 |y 1:t−1 ) α t (k) = p(z t = k|y 0:t ) = 1 p(y t |y 1:t−1 ) p(y t |z t = k) p(z t = k|y 1:t−1 ). (2.26) The denominator represents the probability of a sequence of observations that can be calculated by Eq. 2.24.

Forward-Backward Algorithm
To learn unknown parameters, the forward-backward algorithm [48] provides an efficient message passing procedure on computing node marginals of interest problems including filtering p(z n |y 1 , . . . , y n ), prediction p(z n+m |y 1 , . . . , y n ), and smoothing p(z n |y 1 , . . . , y N ), N > n, where, z n denotes the hidden state and y n represents the observation. According to the implementation of the belief propagation algorithm in Bayesian graphical model, we define a set of forward and backward messages by f orward : α n (z n ) p(y 1 , . . . , y n , z n ) backward : β n (z n ) p(y n+1 , . . . , y N |z n ) (2.27) To address the problem of filtering, we formulate the filtering with the forward messages defined in Eq. 2.27. The filtering is simply generated as f iltering : p(z n |y 1 , . . . , y n ) = α n (z n ) Σ z α n (z) (2.28) To address the problem of prediction, we can utilize the Markov structure for the underlying chain to derive, that is From Eq. 2.29, it shows that the prediction is equivalent to propagating the forward message without incorporating the missing observations y n+1 , . . . , y n+m . However, we utilize both of the forward and backward messages for addressing the problem of smoothing for any m by p(z n |y 1 , . . . , y N ) = p(y 1 , . . . , y N |z n ) p(z n ) p(y 1 , . . . , y N ) = p(y 1 , . . . , y n |z n ) p(y n+1 , . . . , y N |z n ) p(z n ) p(y 1 , . . . , y N ) Hence, we can referred to the belief propagation algorithm for addressing the problems of filtering, predication, and smoothing of HMM. Among of them, we can use the forward messages for calculating the marginal likelihood of observations.

Viterbi Algorithm
The Viterbi algorithm is proposed for addressing the problem of the most likely state sequence to have generated an observation sequence y 1 , . . . , y N when given a set of HMM parameters, which provides an efficient dynamic programming approach to computing this Maximum A Posteriori (MAP) hidden state sequence bŷ From the above, we can note that choosing the MAP sequence is not necessarily equivalent to choosing the maximum marginal independently at each node bŷ z n = max p(z n |y 1 , . . . , y N ). The marginal sequence also may not even be a feasible sequence for the HMM.
The Viterbi algorithm can be used to compute the most probable sequence of states in HMM based on the dynamic programming principle that the minimum cost path to z n = k is equivalent to the minimum cost path to node z n−1 plus the cost of a transition from z n−1 to z n = k. Therefore, the MAP hidden Markov model state sequenceẑ 1 , . . . ,ẑ N can be computed in the following four steps that originally presented in literature [45]. Here, we first initialize the minimum path sum to state z 1 = k for each k ∈ {, . . . , K } by We will propose methods that have straightforward connections with the Viterbi algorithm for anomaly monitoring during robot manipulation in Chap. 4.

Nonparametric Bayesian Hidden Markov Model
Bayesian methods are widely applied to various fields such as biomedicine, clinical trials, social science, and economics. The collected data are more and more complicated with the rapid development of modern science and technology. These complicated data include high/ultrahigh dimensional data, big data, discrete and continuous data. The aim of this section is to introduce the newly developed Bayesian methods, including Bayesian variable selection (e.g., fixed-dimensional data analysis and high/ultrahigh dimensional data analysis), Bayesian influence analysis (e.g., case deletion method and local influence analysis), Bayesian estimation and clustering methods (e.g., fixed dimensional data, high/ultrahigh dimensional data analysis, Bayesian network, and Bayesian clustering for big data), Bayesian hypothesis test including discrete and continuous random variables, variational Bayesian analysis and Bayesian clinical trials including design and dose-finding algorithm, and their applications in various fields. Moreover, Bayesian nonparametric methods avoid the often restrictive assumptions of parametric models by defining distributions on function spaces, especially, for the prior function in Bayesian formulation. If effectively designed, these methods allow for data-driven posterior inference. The overview of Bayesian nonparametric inference in recent decades, see [49][50][51], and we briefly describe two aspects of Bayesian nonparametric methods on the hidden Markov model, including the Dirichlet process and its hierarchical extension, such that formulated the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) for modeling multimodal observations in robotics.

Bayesian Nonparametric Priors
HMMs restrictively assumes a fixed model complexity. In process monitoring, latent states can represent robot primitives. It's clear that not all robot skills have the same number of primitives and that even the same skill might have variation when conducted under different conditions. To introduce flexibility into the number of computed hidden states, we leverage priors as probability measures.
HMMs can also be represented though a set of transition probability measures G j . Probability measures yield strictly positive probabilities and sum to 1. Consider if instead of using a transition distribution on latent states, we use it across emission parameters θ j ∈ Θ. Then, where δ θ k is the unit mass for mode k at θ . The emission parameter θ j , can be evaluated at time index t − 1, such that: So, given θ j , different probability weights are assigned to possible successor candidates θ k . We can also assign a prior to the categorical probability measure G j . The Dirichlet distribution is a natural selection due to conjugacy. Thus, the transition probabilities π j = [π j1 · · · π j K ] are independent draws from a K -dimensional Dirichlet distribution: Additionally, emission parameters are drawn from a base measure H such that, p(θ j ) = H . The Dirichlet process (DP) was used as the base measured instead of the Dirichlet distribution. The DP is a distribution over countably infinite probability measures G 0 : Θ ⇒ R + , where G 0 (θ ) ≥= 0 and Θ G 0 (θ )dθ = 1. The DP has a joint Dirichlet distribution: p(G 0 (θ j ), . . . , G 0 (θ K )) = Dir(γ H (θ j ), . . . , γ H (θ K )). (2.40) We can further summarize the probability measure as p(G 0 ) = D P(γ , H ) as: (2.41) where γ is the concentration parameter and H is the base measure over parameter space Θ. The weights β k are sampled via a stick-breaking construction: where p(ν k ) = β (1, γ ). For succinctness, the stick-breaking process is defined as: The DP is used to define a prior on the set of HMM transition probability measures G j . However, if each transition measure G j is an independent draw from D P (γ , H ), where H is continuous, like a Gaussian distribution, transition measures lead to non-overlapping support. This means that previously seen modes (robotic primitives) cannot be selected again. To deal with this limitation, a Hierarchical Dirichlet Process (HDP) is used. The latter constructs transition measures G j on the same support points (θ 1 , θ 2 , ...θ K ). This can be done when G j is only a variation on a global discrete measure G 0 , such that: π jk δ θ k , p(π j | α, β) = D P(α, β).

(2.43)
This HDP is used as a prior on the HMM. The implications are a mode complexity learned from the data and a sparse state representation. The D P(α, β) distribution encourages modes with similar transition distributions, but does not distinguish between self-and and cross-mode transitions. This is problematic for dynamical data. The HDP-HMM yields large posterior probabilities for mode sequences with unrealistically fast dynamics. Emily B. Fox et al. [52] introduced the sticky parameter into the HDP, yielding the sHDP-HMM, thus the transition probability π j is defined as: The sticky HDP increases the expected probability of self-transitions by an amount proportional to κ and leads to posterior distributions with smoothly varying dynamics. Finally, priors are placed on the concentration parameters α and γ , and the sticky parameter κ. Latent state creation is influenced by α and γ , while self-transition probabilities are biased by κ. These priors allow to integrate expert knowledge for better modeling than the expectation-maximization (EM) algorithm traditionally used in HMMs.

The sHDP-VAR-HMM
Many complex dynamical phenomena are not adequately described as conditionally independent of the observations given the state z i . That is, the observations are a noisy linear combination of some finite set of past observations plus additive white noise. In such cases, the dynamical processes can be modeled by an autoregressive (AR) process. Particularly, an order r vector AR process, denoted by VAR(r ), with observations y t ∈ R d . The VAR(r ) system can be considered an extension of the HMM; instead of having conditionally independent observations given the hidden state sequence, the system has conditionally linear dynamics. Figure 4.9 shows a graphical model representation and compares it to the sHDP-HMM described in previous section. The VAR has simplifying assumptions that make it a practical choice in applications [52]. The switching regime can be combined with the sHDP-HMM from Sect. 2.5.1 to leverage the expressiveness of the VAR system with the ability of nonparametric priors to learn the mode complexity of the model. The VAR(r) process, with autoregressive order r consists of a latent state z t ∈ R n with linear dynamics observed through y t ∈ R d . The observations have mode specific coefficients and process noise as: N (0, Σ).
We see a generative model for a time-series {y 1 , y 2 , . . . , y T } of observed multimodal data, a matrix of regression coefficients , and a measurement noise Σ, with a symmetric positive-definite covariance matrix. Given the observation data, we are interested in learning the "r th " model order, for which we need to infer {A (k) , Σ (k) }. We leverage the Bayesian approach through the placement of conjugate priors on both parameters for posterior inference. As the mean and covariance are uncertain, the Matrix-Normal Inverse-Wishart (MNIW) serves as an appropriate prior on the multivariate AR distribution. The MNIW places a conditionally matrix-normal prior on A (k) given Σ:

Summary
In this chapter, a Bayesian nonparametric approach in defining a prior on the hidden Markov model that allows for flexibility in addressing the problem of modeling the complex dynamics during robot manipulation task is proposed. considering the underlying dynamics that can be well-modeled as a hidden discrete Markov process, but in which there is uncertainty about the cardinality of the state space. We use the hierarchical Dirichlet process (HDP) for examining an HMM with an unbounded number of possible states. Subsequently, the sticky HDP-HMM is investigated for allowing more robust learning of the complex dynamics through a learned bias by increasing the probability of self-transitions. Although the HDP-HMM and its sticky extension are very flexible time series models, they make a strong Markovian assumption that observations are conditionally independent given the discrete HMM state. This assumption is often insufficient for capturing the temporal dependencies of the observations in real data. Consequently, an extension of the sticky HDP-HMM for learning the switching dynamical processes with switching linear dynamical system is investigated. As the basis, the detailed formulations and theoretical process of those models are explained step-by-step in each sections, and widely used in the later chapters of this book.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.