Nonparametric Bayesian Method for Robot Anomaly Monitoring

In this chapter, we introduce an anomaly monitoring pipeline using the Bayesian nonparametric hidden Markov models after the task representation and skill identiﬁcation in previous chapter, which divided into three categories according to different thresholds deﬁnition, including (i) log-likelihood-based threshold, (ii) threshold based on the gradient of log-likelihood, and (iii) computing the threshold by mapping latent state to log-likelihood. Those method are effectively implement the anomaly monitoring during robot manipulation task. We also evaluate and analyse the performance and results for each method, respectively.


Related Work
The premise of performing anomaly monitoring is to endow robots with longerterm autonomy. To this end, we would like to discriminate between nominal and anomalous behavior while the robot executes a task. Researchers usually use two different methods to solve the problem of robot multimodal anomaly monitoring [1,2]. On the one hand, a probability statistical model is established for each modal signal such as Rodriguez et al. [3] classify the success and failure of the robot in the assembly task by extracting the magnitude of the single-modal output force signal [4]. Ando et al. [5] used the K-Nearest Neighbor (KNN) algorithm to independently measure the differences between modalities to monitor the abnormal behavior of a mobile robot. Pastor et al. [6] used multi-modal sensing signals to monitor the failure of the robot to flip the box with chopsticks. This method first uses DMP to describe the motion of each stage, and then models each modality independently and combines normal statistical features (e.g. mean, variance, etc.) of the data for establishing ASM of each movement primitive to realize abnormality monitoring. The occurrence of abnormality is defined as if the signal value at least five times within three consecutive seconds is lower than a given threshold. Chu et al. [7] proposed an anomaly monitoring approach by training multiple HMM models for each modality given the observed values, and then the log-likelihood of each HMM was calculated as the feature vector for support vector machine model SVM (Support Vector Machine, SVM).
On the other hand, a probabilistic statistical model is established for all modality signals. Clifton et al. [8] used Gaussian Mixture Model (GMM) to model multi-modal signals, and realized anomaly monitoring through extreme value theory. Qiao et al. [9] proposed the use of a sequence of recessive states in the HMM model to monitor abnormal instructions. Zhang et al. [10] proposed a hierarchical HMM (Hierarchical HMM, HHMM) for system intrusion anomaly monitoring under the premise of the HMM method. This method uses HHMM to model normal data and is reflected by anomalies in hidden state sequences. Out the system intrusion event. Park et al. [11][12][13] carried out an online anomaly monitoring method based on multi-modal sensory signals fusion during robot manipulation, where a parametric HMM (Left-to-right HMM) models a multi-modal time series of a robot performing normal motions, and proposes abnormal thresholds that change by the execution progress to achieve abnormal monitoring. We extend the HMM for anomaly monitoring in the subsequent part of this chapter that inspired by the HMM cases. In addition, the proposed method uses a variety of complementary modal information to detect a variety of potential anomalies in the human-robot interaction process on the system where the robot feeds the disabled, such as human abnormal contact and environmental collision.
Hu et al. [14] proposed an abnormal event monitoring method based on HDP-HMM. This method combines the Fisher Kernel into the OCSVM (One-Class Support Vector Machine) model, which has the advantages of generating models and discriminative models. Dilello et al. [15,16] first proposed the application of the nonparameterized Bayesian time series model (sHDP-HMM) for abnormal monitoring and fault diagnoses during robot operation, using the model to transmit force/torque of the robot to perform normal alignment tasks [17,18]. Sensing data is used for modeling, and the threshold value for abnormality monitoring is calculated from the validation data set. Then, the log-likelihood function value of the online observation data is calculated based on the learned model. Finally, the robot online is compared with the abnormality threshold value for anomaly monitoring.
In this chapter, we define an anomaly whenever a data stream is structurally different from prior learned models [19]. As Pimentel et al. review literature shows, many anomaly monitoring approaches have been proposed in the last decade [1]. However, few works focused on multivariate time series with spatial and temporal dependencies as in our case. In Milacski et al. [20], the method requires the full data sequence for processing, thus precluding the online detection capability. Besides our previous works [21,22], others have used non-parametric Bayesian approaches to learn HMMs in contact tasks for abnormal detection and recognition. In [16,23], DiLello et al. used the sticky hierarchical Dirichlet process prior to learn the model parameters of an HMM based on wrench signatures for an industrial robot alignment task. The work learned specific failure modes online. The approach is robust as it is models not just individual contact events but rather the evolution of signals, it does so while incorporating domain knowledge, and adjusting its model complexity according to the data patterns presented. Nonetheless, the approach is limited by the assumptions of the HMM in which observations are conditionally independent given the state. This is often an insufficient condition to capture the temporal dependencies of complex dynamical behaviors. In [24], Niekum et al. used a beta-process prior on the vector auto-regressive (BP-AR-HMM) to extract the state complexity from the data. The BP-AR-HMM has the ability to discover and model primitives from multiply related time-series. Robot skill discrimination was reliable and consistent; however, this work only modeled robot joint angles in a primarily noncontact oriented task. In [25], Hu et al. implemented anomaly monitoring based on HDP-HMM models and analyzed the efficiency and effectiveness of beam sampling in the HDP-HMM model. This implementation however suffers of low computational efficiency. Instead, we explore efficient online inference algorithms instead of the conventional sampling algorithms.

Problem Statement
Anomaly monitoring is used to continuously monitor the motion status of the robot to identify unexpected events during execution. Based on the recognition of robot motion in Chap. 3, this chapter describes in detail the abnormal event monitoring method for robots based on non-parametric Bayesian models. The proposed method is also applicable to the subsequent anomaly diagnoses. The main implementation ideas are: first, modeling the normally executed multi-modal data; then, test the normal movement behavior data based on this model. In order to obtain the definition of the boundary between normal and abnormal, that is, the calculation of the abnormal threshold; finally, monitoring the testing samples based on the model and the corre- sponding threshold. In order to better study the theory and methods of multivariate time series anomaly monitoring, this chapter proposes the non-parametric Bayesian models HDP-HMM and sHDP-HMM in different observation models based on the parametric HMM model. An anomaly monitor implemented under the combination of modal information and different anomaly threshold methods. The anomaly decision of the proposed anomaly monitor on robot motion behavior includes: normal, single anomaly, and multiple anomalies.
An ideal online anomaly monitoring system often requires the use of multimodal fusion methods to respond to different types of anomalous events, such as force/torque sensing information that can directly detect robotic collision anomalies and tactile sensing information that can directly sense objects. And the ability to respond to abnormal event occurrences in a timely manner (high response rate) and low false touch rate. In order to meet such requirements, this chapter addresses the problem of multi-modal fusion robot anomaly monitoring, and proposes the performance of anomaly monitors with three different probability models, different sensor information fusion, and different anomaly thresholds. The overview framework proposed after skill identification in Chap. 3 as shown in Fig. 4.1.

Problem Formulation
The robot introspection model uses non-parametric Bayesian priors along with a Hidden Markov model (HMM) and either Gaussian or Autoregressive emission models. First, HMMs are a doubly stochastic and generative process used to make inference on temporal data. The underlying stochastic process contain a finite and fixed number of latent states or modes z t which generate observations Y = {y 1 , . . . , y t } through mode-specific emission distributions b(y t ). These modes are not directly observable and represents sub-skills or actions in a given node of a task. Transition distributions, encoded in transition matrix A ji , control the probability of transitioning across modes over time. The model assumes conditionally independent observations given the generating latent state. Given a set of observations, the Baum-Welch algorithm is used to infer model parameters Π = (A, b). The fixed-modes assumption in HMMs limits the model's expressive power as it is unable to derive natural groupings. Bayesian nonparametric priors extend HMM models to learn latent complexity from data [26,27]. We use Fox's et al. sticky-Hierarchical Dirichlet Process (sHDP) prior with an auto-regressive switching system [26] to model nominal skills as in our previous work [21] and derive more robust failure identification methods in manipulation tasks, specially in recovery scenarios.
Bayesian statistics are combined with the sHDP prior to both learn model complexity k from the data but also to model the transition distribution of an HMM. The sticky parameter in the prior discourages fast-mode-switching otherwise present. Consider a set of training exemplars t = {x 1 , . . . , x T } of observed multi-modal data τ consisting of Cartesian pose and wrench values. Then, a mode-dependent matrix of regression coefficients with an r th autoregressive order and d dimensions is used along with a measurement noise Σ, with a symmetric positive-definite covariance matrix. The generative model for the sHDP-AR-HMM is summarized as: The probability measure G j , which models the transition distribution of the modes π j determines the weights (probabilities) of transitioning between modes δ θ k . To avoid fixing the mode complexity, G j uses a prior G 0 that is unbounded and can grow with the complexity of the data. While G j uses the same set of modes as G 0 , G j introduces variations over those points. G 0 provides support for a possibly infinite space, but due to the Dirichlet's process properties (the Chinese Restaurant Process), a finite set of modes is selected. In fact, we can understand the hierarchical specification as G j = D P(α, G 0 ). Different observation models can be included into the HMM. A Gaussian distribution with different covariance models (full, diagonal, and spherical) are considered. For Gaussian models, mode specific means and standard deviations are used θ z t = N (μ, σ 2 ). Additionally, the sHDP-HMM can be used to learn VAR processes, which can model complex phenomena. The transition distribution is defined as in the sHDP-HMM case, however, instead of independent observations, each mode now has conditionally linear dynamics, where the observations are a linear combination of the past r mode-dependent observations with additive white noise. A prior on the dynamic parameters {A (k) , Σ (k) } is necessary. A conjugate matrix-normal inverse Wishart (MNIW) was used to this end. The generative process for the resulting HDP-AR-HMM is then found in Eq. (4.2).
Observation Dynamics: . By using the model over a set of multi-modal exemplar data t , the sHDP-AR-HMM can discover and model shared behaviors in the data across exemplars. Scalable incremental or "memoized" variational coordinate ascent, with birth and merge moves [27] is used to learn the posterior distribution of the sHDP-HMM family of algorithms along with mean values for the model parameters θ of a given skill, hence

Skill Identification
The robot introspection system simultaneously detects nominal skills and anomaly events. Models are trained for individual skills to capture dynamics from multi-modal observations through vector τ m . τ m consists of end-effector pose and wrench values. Scalable incremental or "memoized" variational coordinate ascent, with birth and merge moves [27] is used to learn the posterior distribution of the sHDP-HMM along with mean values for the model parameters Π of a given skill s. Hence Π s = {π, A}: the transition matrix and regressor coefficients. Given S trained models for M robot skills, scoring is used to compute the expected cumulative likelihood of a sequence of observations E [log P(Y |Π s )] for each trained model s ∈ S. Given a test trial r , the cumulative log-likelihood is computed for test trial observations conditioned on all available trained skill model parameters log P(y r 1 :r t |Π) S s at a rate of 200 Hz (see Fig. 4.2 for an illustration). In Fig. 4.2, note the 4 probabilistic models. Given an indexed position in the graph, an anomaly threshold corresponding to that skill's expected cumulative log likelihood. The figure also illustrates how at the end of a skill, all data is reset and restarted. The process is repeated when a new skill m is started. Given the position in the graph s c , we can index the correct log-likelihood I(Π s = s c ) and see if its probability density of the test trial given the correct model is greater than the rest: Illustration of cumulative log-likelihood (L-) curves for four skills s in a task. As such, 4 probabilistic models Π s are derived. Given an indexed position in graph N i , one can generate four L-curves. The curve for the correct model, has a value increasing likelihood, while the other graphs have value decreasing likelihoods. One can also see how at the termination of a skill, all data is reset and restarted. No anomalies are shown in this plot log P(y r 1 :r t |Π correct ) > log P(y r 1 : If so, the identification is deemed correct, and the time required to achieve the correct diagnoses recorded. At the end of the cross-validation period, a diagnoses accuracy matrix is derived as well as the mean time threshold value (these results were also reported in [21]).

Anomaly Monitoring
Due to the characteristics of abnormality, diversity, and unpredictability in the robot's motion behavior, unsupervised learning is the only way to achieve abnormality monitoring. In particular, this section proposes the use of log-likelihood function values based on non-parameterized Bayesian probability statistical models to implement anomaly monitoring. The main reason is that the probability model learned from normal data will behave when testing anomalous values The cumulative loglikelihood function value decreases sharply. Additionally, the curve of the cumulative log-likelihood function value that calculated from normal movement behavior will show an increasing trend, and the curve calculated with abnormal motion behavior will drop significantly when anomaly occurred. Therefore, a method based on the log-likelihood function of the model is proposed to realize the abnormal monitoring of the robot's motion behavior by setting the abnormal lower limit threshold.
Anomaly monitoring assumes that the cumulative log-likelihood L of a set of nominal skill exemplars share similar cumulative log-likelihood patterns. If so, the expected cumulative log-likelihood derived in training can be used to implement an anomaly threshold F1. Initially, we consider a likelihood curve generated from training data for a given skill s. Then, for each time step in an indexed skill s c , the anomaly threshold is set to where c is a real-valued constant that is multiplied by the standard deviation to change the threshold. Here, we are only interested in the lower (negative) bound. Then, an anomaly is flagged if the cumulative likelihood crosses the threshold at any time: if log P(y r 1 :r t | Π correct ) < F1 s c : anomaly, else nominal. However, it is found that the threshold obtained by Eq. (4.3) makes the abnormal monitor frequently trigger falsely (False Positive) during actual application, which means that the normal execution condition monitoring becomes abnormal. In addition, the determination of constant variables requires tedious interactive verification. Aiming at this problem, this section refers to the experience of robot motion recognition, and determines the threshold of abnormality by solving the gradient information of each log-likelihood function value vector. Upon our initial exploration of recovery schemes we noticed that after resetting the cumulative log-likelihood observations in the HMM model, false-positive anomalies were triggered at the beginning of the skill. Further examination revealed that the standard deviation of cumulative log-likelihood graphs during training began with small variances but grew over time as shown in Fig. 4.3. Given that variances are small at the beginning of the task, small variations in observations can trigger failure flags.
A second threshold definition was designed to overcome this situation and used in our work instead of F1. As the difference between L and F1 is minimal at first the new anomaly threshold F2 (for an indexed skill) is focused on computing the derivative of the difference:

Experiments and Results
Three manipulation experiments are designed to test the robustness of the online decision making system for anomaly recovery. We use accidental human collisions as disturbances for a pick-and-place and open-and-close-drawer tasks, and tool col-   lisions for a pick-and-place task. For each of the three experiments, we test four separate conditions: (i) no anomalies (ii) anomalies without recovery (iii) one anomaly caused at each executed skill and (iv) multiple anomalies caused at each skill.
The pick-and-place task consists of 5 basic nodes (not counting the home node and the recovery node): pre-pick, pick, pre-pick (returns to an offset position), preplace, and place. The open-and-close-drawer task consists of 5 nodes: pre-grip, grip, pull-to-open, push-to-close, and go-back-to-start. The direction and intensity of the human collisions was random but all executed under the sense that these are accidental contacts as a human user reaches into the workspace of the robot without noticing the robot's motion. We note that while the tasks are simple, the main focus of the work is placed on the robustness of the recovery systems given different external disturbance scenarios. A dual-armed humanoid robot -Baxter-was used. All code was run in ROS Indigo and Linux Ubuntu 14.04 on a mobile workstation with an Intel Xeon processor, 16 GB RAM, and 8 cores.
For motion generation, two techniques were used for the different tasks. For the pick-and-place task, we used Baxter's internal joint trajectory action server which uses cubic splines for interpolation. The goal target is identified online through imageprocessing routines. For the open-and-close-a-drawer task, we trained dynamic movement primitives using kinesthetic teaching.
In terms of robot introspection, the sHDP-HMM code with "memoization" variational coordinate ascent, with birth and merge moves was implemented using BNPY [28] and wrapped with ROS. Training used 10-trial batches. Observations used 13 dimensional vectors composed of 7 DoF pose values (position and quaternion) and 6 DoF wrench values. A baseline HMM algorithm was implemented through hmmlearn [28] and wrapped with ROS. The anomaly threshold for each skill was computed through leave-one-out cross-validation. .5 shows a representative image of the Baxter robot attempting a pick operation. A human collaborator accidentally collides with robot before a pick action. Note that collisions were strong enough to move the current pose significantly from the intended path and sometimes collide with other parts of the environment. The robot introspection system identifies an anomaly and triggers a recovery behavior. The lower left part of the image shows the anomaly F2-metric flagging the anomaly. The system then begins recovery as seen in the directed graph on the right (implemented in ROS-SMACH). Video and auxiliary data for the three experiments under the four conditions are available in [21].
For results reporting, two markers are provided: (i) A success recovery rate for the recovery policy and (ii) an F-score, precision, and recall numbers for assessing anomaly identification. In particular, these markers are used under experimental conditions (iii) and (iv) conditions described at the beginning of the experiment and results section. The first two conditions are use as a baseline and compare with the generic recovery behavior success rates.
(1) Human Collisions One anomaly per node: 27 test trials were used for the pick-and-place task and 24 test trials were used for the open-and-close drawer task. Given that both of the tasks have 5 nodes, a total of 5 anomalies were induced in the task, 1 per node. Table 4.1, shows the recovery success rate of the task (represented by the F-score) and the robustness of the identification system through the precision-and-recall quantities for both micro and macro settings. The pick-and-place recovery had an average Table 4.1 Results for accidental human and tool collisions. Recovery strategy success rates are presented by the F-score, and robustness of the robot introspection system for anomaly is shown in the recall and precision settings. We also compare the performance of the more expressive sHDP-HMM model with an HMM that serves as a baseline. Generally, the sHDP-HMM model allowed for better introspection and thus better recovery rates and better recall success rate of 85% with a maximum of 88%. The precision was 100% (for both macro/micro) indicating strong resistance to false positives. The recall was ∼82% (for macro/micro) indicating the existence of some false-negatives. This might indicate that there might have been some collisions that were of lower magnitudes than the ones we might have trained with and were not detected by the system. For the drawer task, the recovery success rate was 91.67%. The precision was 95% (macro/micro) and recall was ∼84.5%. As with the human collision, weaker contact signals in tests compared to training might be the reason for the presence of false-negatives in our system. Multiple anomalies per node: Under this scenario the pick-and-place task ran 42 test trials and the drawer task ran 30 test trials. Five anomalies were induced repeatedly one-after-the other for each node. The pick-and-place recovered 85% of the time with a precision of ∼95% and a recall of ∼84.5% (micro/macro). The drawer task recovered 93.3% of the time with a precision of ∼100%, and a recall of ∼93% (micro/macro).
(2) Tool Collisions For the tool collision experiment, we only tested recovery in the pick-and-place task. The results for this scenario were similar to that of human collision. The recovery success rate was ∼88.89%, the precision 100%, and the recall 88.89%.

Discussion
This work showed the ability of a system to recover from unmodeled and accidental external disturbances that can't be anticipated. Such disturbances will be more common in shared human-robot workspaces. Our work demonstrated that our recovery strategy in connection with our previous introspection framework recovered 88% of the time from accidental human and tool collisions under single-anomaly and multiple-anomaly scenarios per node. The results indicate the system can recover at any part of the task, even when it is abused and multiple anomalies occur consecutively. From the videos in [21], we can see that even when in cases where the robot is in constant duress, the robot recovers consistently. Such robustness will play a role in enabling robots have increasing levels of long-term autonomy.
We by using a more expressive model to do robot introspection, our recovery ability also increased. We will continue to explore improved models that can better capture spatio-temporal relationships of high-dimensional multi-modal data. As well as looking for representations that scale over time in order to acquire a useful and practical library of skill identification and motion generation.
Yet there is much work to be done. Manual annotations for state-dependency are an important weakness. Crucially we wish to move towards modeling human understanding for decision making in the midst of robot-object-environment interactions. Scalability is an important factor in this domain as the system must scale to ever growing number of learned tasks. Learning how anomalies and recovery decisions are made and re-used across similar scenarios will be important. The manual approach will not scale. Adaptability and not only reverting is also important. Incremental learning is also key. The current work is limited to reverting. By simply revering we don't model recovery behaviors and also do not learn how to adapt. We also cannot handle new scenarios. We must develop action-confidence metrics that let us learn new scenarios on demand.

Gradient of Log-Likelihood for Anomaly Monitoring
Based on the implementation in Sect. 4.4, it is difficult to determine the constant parameters c in actual applications, which leads to frequent false positives. To address the problem, this section explores an anomaly monitoring method based on the loglikelihood function gradient value and mathematically verifies its implementation process to prove its feasibility and effectiveness. After identifying the motion behavior s, this method assumes the parameters of the specific motion behavior and its established probability model are known as Θ. From Sect. 2.4.1, we can know that the log-likelihood function value for the observation sequence can be determined by the probability in all credible states, and the simplified description is Therefore, given an HMM model Θ and an incoming multimodal time series Y 1:t , the natural logarithm of the filtered belief state associated with the forward model for latent state i ∈ {1, 2, . . . , N } can be represented according to Eq. (4.5). To compute L t , we first compute log α i (t). According to the forward-algorithm, we derive: From Eqs. (4.5) and (4.6), we know that log α i (t) can be computed recursively through log α * (t − 1). Expanding the log in Eq. (4.5) we have:

The Hidden Markov Model Forward Gradient • Viterbi Path in HMMs
The Viterbi algorithm, expanded in Eq. (4.8), attempts to estimate the most likely state sequence. Viterbi uses dynamic programming to estimate the underlying state sequenceẑ 1:t through MAP computations given a sequence of observations y 1:t : arg max First, we introduce simplified notation for the Viterbi algorithm, where a node j has a maximal belief state δ t ( j) = max z 1:t P(z 1:t = j | y 1:t ) with associated tracebackẑ t . Good models are those that predict their data as accurately as possible and can be achieve through two steps: (i) HMM parameters optimization: the Baum-Welch algorithm given a proper initialization or similar algorithm incrementally optimize HMM parameters until a local maximum of likelihood is reached; minimizing the perplexity of the model. (ii) Model selection optimization: this consists in selecting the number of hidden states and values for observation models. Many techniques exist including BIC, MCMC, Variational Bayes, or non-parametric HMMs [29].
Proof Consider, without loss of generality, a Viterbi graph where we examine two consecutive time-steps (t − 1, t) along two possible latent states l, k (the analysis generalizes to HMMs with more states). We also assume ∀i, i = arg max j A i j ; that is, all hidden states states tend to self-transition. Also at time t − 1: (4.9) We also define the following symbol: (4.10) Due to our first assumption, we have: max j w ji (t) = w ii (t). Then, at time t, the δ values are: According to (4.9) and our max weight formulation, the max function in (4.11) is: So, the max state l at time t − 1 will contribute to itself instead of k at time t. Therefore, there is only one condition under which the Viterbi sequence breaks: In other words, given our original assumption, the Viterbi sequence breaks if the following inequalities are met: In ratio form: and If an observation is emitted by state l and it is not undergoing a state switch, , inequality (4.14) fails and the Viterbi sequence does not break.
When we transition from state l to k and begin emitting w kk (t) > w ll (t). However, the momentum in δ t (k) and δ t (l) prevent their inequality relationship to switch. Nonetheless, after p time steps, the inequality The latter is reasonable when state k has been emitting for some time. It is before this time t − 1 + p that inequalities (4.13) and (4.14) are met. To see why, one will notice the left hand side inequalities in (4.13) and (4.14) are larger than 1 while the right hand side ratios become smaller than 1 at time t − 1 + p, thus their inequality relationships must've swapped before this time. Note that the order in which inequalities (4.13) and (4.14) are met matters. If (4.14) is met but (4.13) is not, a clean cut occurs since we have: and Equation (4.16) asserts a switch from l to k. Equation (4.15) states that δ t−1 (l) contributes to δ t (k), implying the previous max state contributes to the next max state. And since the max state has changed, the roles of l and k swap and inequality (4.14) begins to fail and renders sequence break unattainable. This yields a clean transition cut. If (4.13) and (4.14) are met simultaneously, a sequence break occurs. But since (4.14) is met, the states switch and the roles of l and k swap and preclude a further sequence break. If, let's say at time t a , inequality (4.13) is met but (4.14) is not, a future sequence break is destined to happen at the future moment when (4.14) is met, let's say at time t b . When that sequence break occurs, the history between t a and t b flips and after the sequence breaks the state transitions from l to k. Again, roles switch and no further sequence breaks occur. Above all, we can safely conclude that the during the execution of the stable period of a hidden state, no sequence break will occur. It is only during state transitions that a sequence could break and even if it does, it only last for a single time step-the time step when inequality (4.14) is met. The analysis extends to HMM models with more than 2 hidden states, so long as we apply the analysis to pairs of hidden states composed of the current max state and another non-max state. Proof According to the log-exp-sum trick [29], the approximation log N i=1 exp(y i ) ≈ max i∈{1,...,N } y i is best approached for larger values of y i . Applying this approximation to Eqs. (4.5) and (4.7), which is supported by Theorem 4.1, we have: Substitute Eq. (4.18) into Eq. (4.17), rename i and j to z t and z t−1 , then recursively decompose log α, we have: Equation (4.19) is the log version of Eq. (4.8). This suggests that, after approximations by Eqs. (4.17) and (4.18), the computation of the log-likelihood is the same as the computation of the Viterbi path using the Viterbi algorithm. Now, since Theorem 4.1 shows that in general the Viterbi path at time t,ẑ 1:t , expands on the Viterbi path at time t − 1,ẑ 1:t−1 , we have: (4.20) Then, the forward gradient can be derived from Eq. (4.20) as: Equation (4.21) supports Corollary 4.1, where the forward gradient depends on the latest emission probability bẑ t (y t ) and transition probability from hidden stateẑ t−1 toẑ t . Also, given that good HMM models have strong inertia (high probabilities of self-transitions), state-switching should be sparse and thenẑ t−1 will equal toẑ t most of the time.

• Normal Skill Identification Based on the Forward Gradient
Corollary 4.1 led us to design a new method for skill identification. If we use n HMM models to represent n robot skills, with observations coming from a certain skill, the HMM model corresponding to that skillm, should output a value-increasing forward log-likelihood curve that is greater than the rest of the HMM models. This also means modelm will output a larger forward gradient value compared to other models. The forward gradient depends on the latest emission probabilities, which in turn depend on the latest observation. The largest probabilities and thus gradients will belong to the HMM model of a currently executing robot skill. The key insight however is the inverse relationship: the use of the forward gradients to infer the currently executing skill, and then it can validate the strong correlation between the forward gradient and skill observations. This is contrasted with the log-likelihoods of the observations log P(Y 1:t |Π) do not. The forward-gradient measure for skill identification is defined as follows: given p skills s 1 : s p , we have HMM models m s for s ∈ {1, . . . , p} and an input time series Y , then the most probable skillŝ generating Y [t] is inferred as: where, ∇ L m s t (Y ) is the forward gradient output by model m s at time t computed using time series Y .

Anomaly Monitoring Based on the Forward Gradient
We now build on the premise established in Eq. (4.22). Furthermore, consider a set of nominal observations for an executing skill, we know that the corresponding skill HMM model will output a value-increasing forward log-likelihood curve, and hence, a stable positive forward gradient. So, when an anomaly occurs, the forward gradient decreases significantly as illustrated in Fig. 4.6. Given that anomalies influence the forward gradient value, we propose a gradient-based metric for HMM anomaly monitoring.
Consider an HMM model m representing a skill s with n time-series trials Y i for i ∈ {1, . . . , n} collected from successful executions of skills s ∈ S. To detect anomalies in a new time series Y we can first derive: where T i is the time length of trial Y i and ∇ L m t (Y i ) is the forward gradient output by model m at time t computed using time series Y i . Then, we use an empirically-derived test to trigger an anomaly for Y : This test detects if the gradient is an outlier compared with gradients of successful skill executions.

Experiments and Results
As for experimental setup, a dual armed humanoid Baxter robot was used to perform a pick-and-place operation. The robot consisted of a Robotiq force-torque sensor and standard Baxter fingers. 5 nominal trials were used for training the HMM model. In testing, 5 trials were used for skill identification and anomaly monitoring respectively. The pick-and-place task consists of 5 skills: (i) hover over the picking position, (ii) grasp the object, (iii) lift the object, (iv) hover to the placing position, and (v) place the object. For training, the observation vector concatenates a 7-dimensional Cartesian endeffector pose and a 6-dimensional wrench. For each skill, we train corresponding HMM models using the Baum-Welch algorithm. The number of hidden states is selected such that emission probabilities are maximized leading to distinct and uniquely grouped hidden states.
For results reporting, we use the three factors identified by Pettersson's survey on Event Detection [30]. Namely: diagnoses accuracy, robustness (false-positive rate), and reaction time (the time it takes to identify a skill from the beginning of a skill execution). Note that for anomaly identification, internal and external perturbations are used including: unexpected movement of target object, object absence, slippery picks, and human collisions.

• (B) Reaction Time Performance
In terms of reaction time, a percentage is computed to assess the time it takes for the identification to execute from the beginning of a skill. The reaction percentage, using t as "true", is computed as: offset = predicted-t_beginning, length=t_end-t_beginning, and reaction=offset/length. The closer the reaction percentage is to 0% the better the identification method. A negative reaction percentage means the predicted start occurs earlier than ground truth, while a positive percentage implies a delayed identification. We assess two forms to determine the beginning of a skill (i) use the "first skill" occurrence, or (ii) use the "first 10 successive skill" occurrences. The reaction percentage for these two formats is found in Table 4.2. The average reaction time for absolute values (i.e. looking at the average time difference of the prediction, whether early or late) the "first skill" is 2.70% across all skills and the average reaction time for the "first 10 skills" is 0.97%. Between the two measures, we have a total average of 1.84%.

• (C) Anomaly Monitoring Performance
For anomaly monitoring performance of our gradient-based method, we use two environments: (i) anomaly identification as it occurs and any false-positives before an actual anomaly occurs, and (ii) an external collision is given to the robot to trigger a recovery. Then, when the robot completes its recovery behavior, we count how many false-positives are triggered before moving to the next skill execution. The robot recovery behavior is detailed in [31]. Five nominal and five anomalous trials are used for the analysis. The results are compared with two other baseline methods: the magnitude-based metric from the description in normal skill identification based on the forward gradient, and the derivative-of-difference metric from the definition in anomaly monitoring based on the forward gradient.
The five anomalous trials contain a total of 14 anomalies, consisting of: (i) one anomaly caused by the displacement of the target object (ii) one anomaly caused by no target object (iii) two anomalies caused by slippery picks (iv) five anomalies caused by human collisions to the robot gripper during each skill execution (v) five anomalies caused by human collisions to the robot arm during each skill execution.

• (D) Pre-recovery Performance
For pre-recovery performance computation, during each trial, we record the anomalies triggered by the testing metric and count its true positives, false positives and false negatives. The results are shown in Table 4.3. The result shows our proposed forward gradient detected all anomalies and triggered no false positives or false negatives. The other two baseline methods suffer from false positives though they deliver high true positives.

• (E) Post-recovery Performance
For post-recovery performance metrics, we trigger an intentional anomaly, after recovery is completed, we count any false-positives before next skill execution. Both the forward gradient method and the derivative-of-difference method had not falsepositives. Whilst the magnitude-based metric had more than 10 and prevented the system from continuing its task execution.

• Comparison with Related Works
Comparisons across works is challenging as results use different formats across experiments. Table 4.4 is an effort to harmonize results across related literatures. The comparison should be done loosely as different tasks (small levels of contact vs. large levels of contacts, structured environment vs. unstructured environment) present different challenges to event detection. For skill identification, our current approach ranks 2nd behind the tool breakage work that identified anomalies in structured milling processes. Our work did better than [16,22], albeit these works modeled more complex dynamical phenomena. Similar statements can be made about anomaly identification. As for reaction times, our approach offers about double the speed-up compared to the only other work that reported this number. In conclusion, based on internal and external evidence, we hold that our measure is the most robust, stable, and fastest measure reported to date.

Discussion
This work presented a theoretically derived event detection measure useful for nominal and anomalous behavior identification, even in post-recovery actions. Our results Table 4.4 Skill and anomaly identification, and reaction time comparison for state-of-the-art eventdetection methods Technique ID accuracy (%) AFF/DCC/CSM/SVM [32] 84.66 a sHDP-HMM [16] 89.50 RCBHT w/multiclass SVM [22] 97.00 HMM w/GradientBased measure (current) 98.40 Tool breakage SVM [33] 99.38 Technique anomalyID (%) Accuracy HMM, varying threshold [11] ∼80.00 MLP [12] 83.27 sHDP-VAR-HMM, mag metric [21] ∼85.00 sHDP-HMM [16] 87.50 RCBHT w/multiclass SVM [22] 97.00 HMM, gradient metric (current) 100.00 Technique Reaction time (%) sHDP-VAR-HMM, mag metric [21] 3.70 b HMM, gradient metric (current) 1.84 a Average of Table 3 is cited from [32], k-fold: gesture classification performance (novice, inter, and expert and both macro/micro levels) b Average reaction time for their best multimodal setup showed very strong performance compared with state-of-the-art results across the board. More experimental validation is certainly necessary: both in number of trials and robotic tasks. This work also remains to be tested in the area of anomaly diagnoses. The latter is concerned not only with the identification problem with the grouping of anomaly types which is more challenging. We anticipate working in conjunction with machine or deep learning models for the diagnoses of this signals. Some works [11,12,16,23] already provide some characterizations.

Hierarchical Dirichlet Process Vector Auto-regressive Hidden Markov Models
HMMs segment sequential data into interpretable hidden states that are assumed Markovian. They are limited by its a priori determination of hidden states (and fixed throughout the modeling process) as well as the HMMs limited ability to model complex dynamics due to the assumption that observations are conditionally independent given the state. Instead, this section uses Bayesian non-parametric techniques that model HMMs through priors that can model an unbounded number of hidden states. In particular, we use HDP priors to learn HMMs with infinite hidden states. We also use the switching vector auto-regressive (HDP-VAR-HMM) to model multimodal observation sequences and its element consists of features extracted from an F/T sensor, joint encoders, and a tactile sensor (for a discussion on the used features see experimental setup). Figure 4.9 illustrates graphical model representations for the sHDP-HMM and the sHDP-VAR-HMM. Assume that we record N sequential executions for each skill and expect to learn the common dynamics among them. We utilize the HDP-VAR-HMM to jointly model those sequences, where sequence n ∈ N has multidimensional observation data y n = [y 1 , y 2 , . . . , y T ] at time t. For instance, y t ∈ R d could be an instant of wrench signatures and end effector velocity, where d is the dimensionality of observed features. The HDP-VAR-HMM interprets each observation y t by assigning it a single hidden state z t . The hidden state value is derived from: (i) a countably infinite set of consecutive integers z t = k ∈ {1, 2, 3, . . .}, (ii) an initial probabilistic state distributions π 0 generated with Markovian dynamics, and (iii) a transition distribution {π } ∞ k=1 . Then, the Markovian structure on the state sequence dictates that for all t > 1: (4.24) Since z t ∼ π z t−1 , we see that z t−1 indexes all observations y t are assigned with z t−1 . We can draw the observation y t given specific hidden state z t = k from its corresponding observation likelihood functions. In our case, we consider the first-order auto-regressive Gaussian likelihood. As such, observations y t are considered to be a noisy linear combination of the previous observation plus additive white noise ε and can be expressed as: where, each state k has two dynamic parameters {A k , Σ k }: A k , a d × d matrix of regression coefficients that defines the expected value of each successive observation as E[y t |y t−1 ] = A k y t−1 , and Σ k , a d × d symmetric positive definite matrix that defines the covariance matrix of state k. Therefore, the Gaussian regression observation model explains a observed data pairs of input y and output z. Each input is a vector of length y ∈ R d , while each output is a scalar. In the paper, we assume that the input data y are fixed dependence on previous observation and focus on a generative model for learning the output data z which depends on y. Various generative models, such as full-covariance Gaussian, diagonal-covariance Gaussian, are possible for the observed input data. These can be directly define the multivariate Gaussian log-likelihood function given specific state z k ,

(4.26)
However, it is often the case that we don't know the parameters of this distribution, but instead to estimate them. We need to learn the approximate values θ k = {A k , Σ k } for each state by defining a conjugate prior distribution on it. Particularly, the Matrix Normal Inverse Wishart (MNIW) distribution is conjugate for both the A k and Σ k are uncertain. This distribution assume that when only the covariance Σ k is uncertain, the conjugate prior is defined as d-dimensional Inverse Wishart (IW) distribution with covariance parameter Δ, a symmetric, positive definite scale matrix, and ν, the degrees of freedom, i.e.
Σ k ∼ I W (ν, Δ), (4.27) where the first moment is given by the mean After that, we place a conditionally Matrix-Normal (MN) prior on A k given Σ k , which is given by where M is the mean matrix of A k and V is a symmetric, positive definite matrix that determine the covariance of A k relative to Σ k . Therefore, we derive the useful expectations The resulting log-likelihood function of joint distribution of A k , Λ k is defined as where, (4.32) where Γ () is represented the gamma function. Therefore, all the optional hyperparameters of HDP-VAR-HMM are specified by ν, Δ, M, V . Apparently this straightforwardly method is computationally feasible when those four parameters are known. We will specify the detailed options in experimental section through all our experiments.
With reference to the Eq. (4.24), for the probabilistic representation of transition distribution π j under the HDP-VAR-HMM prior, the number of states is unbounded and every observation assigned a unique state. The HDP prior encourages sharing states over time through a latent root probability vector β over the infinite set of states (see Fig. 4.9). Consider the probability ∞ j=1 β j = 1 on a countably infinite set, where defines the sticking-breaking construction of the prior on β and first draws independent variable μ j ∼ Beta(1, γ ) for each state j, and then sets (1 − μ ), where the concentration parameter γ controls the relative proportion of the weights β j . Here, we interpret μ j as the conditional probability of choosing state j amongs states { j, j + 1, j + 2, . . .} In expectation, we always describe the first K orders in stick-breaking as the most common states and represent their probabilities with the vector [β 1 , β 2 , . . . , β K , β >K ], where β >K = ∞ j=K +1 . Given this (K+1) dimensional probability vector β, we simplify the HDP-VAR-HMM transition distributions π j for each state j from a Dirichlet distribution with mean equal to β and variance governed by concentration parameter α > 0 Generally, we define the starting probability vector π 0 from the same prior, but with much smaller variance, π 0 ∼ Dir(α 0 β) with α 0 α, which denotes that few starting states are observed.
To capture the sufficient dynamics of given time series n with T n observations. Our goal is to interpret these observation with the hidden state z j . However, instead of assigning variable states in a short time period, we should treat each observation in the context of its predecessors for improving the temporal consistency. A "sticky" parameter κ of favors self-transition by assigning extra prior mass on the transition probability π j j was introduced. Thus, the transition representation will in a temporally consistent way, that is (αβ 1 , . . . , αβ j + κ, . . . , αβ >K ). (4.34) where κ > 0 governs the degree of self-transition bias. Informally, what we are doing is increasing the expected probability of self-transition for keeping the dynamical phenomenon consistency for many time-steps. After constructing the sHDP-VAR-HMM model, we utilize the state-of-the-art variational inference algorithm for learning the parameters, which named as Split-Merge Monte Carlo method by Michael C. hughes et al. More information can be found in [28].

Anomaly Monitoring Based on Hidden State
The anomaly detector identifies whether a skill execution contains: no anomalies, one anomaly, or multiple anomalies. In this section, we introduce the mathematical modelling that makes up our detector.

• (A) Hidden State Representation
Hidden state space modeling of multivariate time-series is one of the most important tasks in representation learning by dimensional reduction. In this work, we propose the anomaly detector is a thresholding marginal log-likelihood with Bayesian nonparametric hidden Markov model, which leads to tackle the anomaly monitoring problem in a more natural way to meet the need of real-world applications.
As we discussed above, the sHDP-VAR-HMM interpreted each observation y t by assigning to a single hidden state z t , where the hidden state value is derived from a countably infinite set of consecutive integers z t = k ∈ {1, 2, 3, . . . , K }. Figure 4.10 shows the low-dimensional hidden state space of the multi-modal time series generated by the robot during manipulation, which contains 5 hidden states.
We denote Θ s to represent all the parameters of the trained sHDP-VAR-HMM from nominal demonstrations, including the hyperparameters for learning the posterior and the parameters for the observation model definition. Our anomaly detector performs anomaly monitoring by comparing the marginal log-likelihood log p(y t |Θ s ) based on current executing skill of the observation y t with a learned threshold ρ(z), that derives from the most likely estimated hidden state given current observations. Here, the z would be included variable values, that is, we should synchronously update the threshold in the same skill period. Let's say the definition of an anomaly data point at any time during the manipulation is one that logp(y t |Θ s ) < ρ(z), otherwise it deems the manipulation to be nominal. The proposed method generates the mapping pairs of marginal log-likelihood and the most likely hidden state at any time. As specific hidden state z, we calculated the fixed log-likelihood threshold from testing the nominal demonstrations by subtracting half of range between maximum and minimum from the minimum. Unlike the anomalous log-likelihood was deviate by a certain standard deviation from the mean in [11,12,21], that is ρ(z) = μ − cσ , where the symbol c is a constant value for adjusting the sensitivity of the detector, μ and σ are the estimated mean and standard deviation of the log-likelihood. Our implementation don't need to take the constant c into consideration for each skill or hidden state.

• (B) Joint Probability of the Observed Sequence
The threshold ρ(z) is associated with the joint probability of the observed sequence. Firstly, in order to derive the forward-backward procedure of first-order autoregressive HMM for the joint state sequence z 1:T given observation y 1:T , plus the initial observation y 0 , (T ≥ 1). We have this problem when dealing with conventional HMM, here the auto-regressive structure of ARHMM makes this situation more complicated, a modified forward-backward method could solve the problem efficiently. Let's first write the joint distribution with the Markov property p(y t |y t−1 , . . . , y 1 ) = p(y t |y t−1 ), that is p(y 0:T ) =  Where θ z t represents the observation parameters for the trained HMM and the z 1:T is a state path over the hidden state space. As such, if the state path is estimated, the maximum probability of the observation sequence can be obtained. However, since the true state path is hidden from the observations. For the computational convenience, the general approach would be to use the maximum likelihood state at any given moment, but this would neglect uncertainty. Instead, we use a marginal probabilistic representation over hidden states at each time step (see details in hidden state estimation of auto-regressive model section). Thus, the log-probability at time t is derived by computing the logarithm of the sum of exponentials over all hidden states where α t (k) = p(z t = k|y 0:t ), β t (k) = p(y t+1:T |z t ), are presented the forward message passing and backward message in the standard forward-backward algorithm, respectively.
α t (k) = 1 p(y t |y 0:t−1 ) p(y t |z t = k) p(z t = k|y 1:t−1 ), (4.38) The denominator represents the probability of a sequence observations, which can be calculated by Eq. (4.36). Assume that we recorded N sequential data for each skill and the length of sequence n ∈ N is T n . Thus, we can get a set of log-probabilities vector of each skill The above implies that the forward-backward procedure can be used to find the most likely state for any moment t. It cannot, however, be used to find the most likely sequence of states.

• (C) Hidden State Estimation of Auto-regressive Model
For offline constructing the state and log-likelihood pairs, we are interested in the most likely state sequence upon an observation sequence y 0:T . One might be derived from copulating the MAP sequence by maximizing the marginal probability at each observation iẑ i = arg max z i p(z i |y 0:T ). where, p(z 1:T , y 0:T ) was computed in Eq. (4.36). Actually, we note the most likely state sequence that minimum the cost path to z t is equivalent to the minimum cost path to node z t−1 plus the cost of a transition from z t−1 to z t , and the cost incurred by observation y t . For convenience, we abbreviate the joint log-likelihood function L (z 1:T ) log p(z 1:T , y 0:T ), and the transition probability π z t−1 z t−1 , then where s 1 (z 1 ) = log (π z 1 p(y 1 |θ z 1 , y 0 )), t = 1 s t (z t , z t−1 ) = log ( p(z t |z t−1 ) p(y t |θ z t , y t−1 )), t ≥ 2 (4.43) Obviously, L t (z 1:t ) = L t−1 (z 1:t−1 ) + s t (z t , z t−1 ), and set From the above, The maximizing state sequence is iteratively obtained, for n ∈ {T − 1, T − 2, . . . , 1}ẑ Assume that we recorded N sequential data for each skill and the length of sequence n ∈ N is T n . Thus, we can get a set of the most probable states vector of each skill (4.46)

• (D) Threshold Calculation
Having computed the joint probability of the observations and the most probable states for each skill s. we can concatenate them to compute the threshold of specific state z = k, that is ρ(z k ). Each state consists of its maximum L max z k and minimum L min z k . We cluster the log-likelihoods with the same state, which represents the clustering data by HMM belong to the same statistical distribution.
where the notation 1(η) to represent an indicator random variable that is 1 if the event η occurs and 0, otherwise. Using Eq. (4.47), we can set the anomaly monitoring threshold of each active state z k at any moment according to: In testing implementation, we preprocessed the incoming multimodal data in the same scheme as the training process and then get the most possible state z test by recursively compute the filtered marginals and the corresponding log-likelihood L test using Eqs. Here, this approach detects an anomaly when the log-likelihood is lower than the state dependent threshold ρ(z test ) This approach also extends to online detection. Given the trained sHDP-VAR-HMM and streaming sensor data, the detector can recursively compute the z t and L t at time t, and then perform the comparison, if L t < ρ(z test ), then anomaly, else nominal.

• Finite State Machine Based Kitting Experiment
In this section, we describe the kitting experimental setup, external disturbances, data collection, and anomaly monitoring and diagnoses model parameters.
A kitting experiment is setup, where a robot and a human co-worker closely collaborate to place a set of goods in a packaging box. Specifically, a human coworker is tasked to place a set of 6 objects on the robot's "collection bin" (located in front of the robot) in a one-at-a-time fashion. The objects may accumulate in a queue in front of the robot. As soon as the first object is placed on the table, the robot identifies the object and places it in a packaging box located to the right of the robot. Thus, the robot picks an object and transports it towards the box, after which, the robot appropriately places each of the six objects in different parts of the box. The whole implementation is shown in Fig. 4.11.
The  Objects that need to be packaged are placed by a human collaborator before the robot in a collection bin. The shared workspace affords possibilities for accidental contact and unexpected alteration of the environment. The robot is tasked to pick each one of the objects and place them in the packaging box to its right. The visualization module uses the ALVAR tags to provide a consistent global pose with respect to the base of the robot. Ten nominal executions and one snapshot of each skill are shown modeled by the dynamic movement primitives and the ROS-SMACH. 1 One-shot kinesthetic demonstrations were used to train DMP models for each of the skills of the kitting experiment. Note that a skill's starting and ending pose can be adapted if necessary thus providing a flexible and generalizable skill encoding procedure. The trajectory adaptations however increase the difficulty of assessing the sensory data for both anomaly monitoring and diagnoses-due to the variability in the data collection even from nominal executions.

• The Robot, Ojects, and Visual System
A Baxter humanoid robot's right arm is used to pick commonplace objects set before him. The robot is equipped with a 6 DoF Robotiq FT sensor, 2 Baxter-standard electric pinching fingers, as well as Baxter's left hand camera. Each finger is further equipped with a multimodal tactile sensor composed of: (i) a 4 × 7 taxel matrix that yields absolute pressure values, (ii) a dynamic sensor which provides a single capacitive reading in millivolts (mV) useful to detect tactile events, and (iii) an IMU and gyroscope [34]. Baxter's left hand camera is placed flexibly in a region that can capture objects in the collection bin with a resolution of 1280 × 800 at 1 fps (we are optimizing pose accuracy and lower computational complexity in the system). The use of the left hand camera facilitated calibration and object tracking accuracy.
A set of 6 common household objects consisting of box-like shapes and bottles were used in our work. The objects ranged in weight from 0.0308 to 1.055 kg and in volume from 3.2 × 10 −04 m 3 to 1 × 10 −03 m 3 . The object's surfaces also varied slightly: some heavier objects had sleeker surfaces to incite object slips-we believe not an unreasonable determination as warehouses contain a wide variety of objectswhilst other objects had rougher surfaces. Across executions, object locations and order was varied to promote generalization.
Alvar tags, 2 with 0.06 m sides, were placed around the circumference of the objects for robust visual recognition (ALVAR can handle change in lighting conditions, optical flow-based tracking, and good performance for multitag scenarios) regardless of orientation.

• External Disturbances
External disturbances may be introduced into a system for a variety of reasons. We list possible scenarios in which disturbances may occur in real-systems, as shown in Fig. 4.12: (1) For collaborative jobs (such as kitting in warehouses, see the description in experiment setup section), humans may experience monotony leading to boredom and loss-of-attention. In such cases, it is possible for a human co-worker to accidentally collide with the robot manipulator or alter the environment in unexpected ways. (2) The user may also accidentally collide or unintentional move packaging objects in ways a robot may not anticipate. Such variations in the world may lead to tool-collisions. (3) Picked objects may slip from a robot's gripper if the grasp is not optimal; or if upon motion, inertial forces acting on the object cause slippage that breaks the grasp. (4) Object slips may also be caused due to human collisions. Such phenomena depicts a chain of anomaly reactions where one anomaly leads to another. (5) The robot may also collide with packaging box during kitting. (6) Missed-grasps (where the robot pinches air) are possible when the target object moved without the robot's notice. (7) Additionally, it is possible to suffer from false-positives from the anomaly detector. False-positives may occur for numerous reasons: system errors, unreachable objects, unfeasible inverse kinematic solutions, unidentifiable objects from the visual system, to name a few.
The seven types of anomalies we just described will be declared more succinctly in the rest of the paper as: Human-Collisions (HC), Tool-Collisions (TC), Object Slips (OS), Human-Collision-with-Object (HCO), Wall-Collision (WC), No-Object (NO), and Other (OTHER) respectively. In Sect. 4.6.1, the introspection methodology used to model robot skills, continued by a presentation of anomaly monitoring in Sect. 4.6.2 and anomaly diagnoses in Chap. 5.

• Dataset Collection and Sensory Preprocessing
It is undeniable that bias exists on the side of robot researchers as to which anomalies may exist in the task. In this work, we first attempt to discover likely anomalies in this task by emulating a collaborative kitting task in which the human collaborator experiences tedium and monotony resulting in undesired events that lead to anomalous events. We tasked 5 participants-whose ages range from 23 to 28-to act as a collaborator in the kitting task under the monotonous conditions mentioned in experiment setup section. One expert user and four novice users participated in the experiments. Novice users learned from the expert to induce anomalies by observing a round of executions. Each participant performed 1 nominal and 6 anomalous executions by placing the set of six household objects in a one-by-one fashion. For training, when an anomaly occurs, the robot execution is halted and the data stream is labelled with an anomaly class (namely the labels with HC, TC, OS, NO, etc.) for supervised classification during testing. Subsequently, anomalous time-stamps are also extracted. In total, we collected a data set with 18 nominal executions and 180 anomalous executions (each anomalous trial has at least one anomaly across the entire task), and 38 failure trails are ignored.
As for our multimodal observation vector, it is comprised of 6 DoF force-torque signals, 6 DoF Cartesian velocity signals (from Baxter's right end-effector), and 56 DoF tactile signals from both the left and right tactile sensors. Empirical feature extraction is performed on this observation vector to improve diagnoses performance. We now describe how features for each modality at time t were engineered. Wrench modality: Assume the force and torque sequence is a time-series vector and that each element represents the magnitude of each dimension ( f x , f y , f z , t x , t y , t z ). We wish to extract structural information from the wrench instead of simply using raw values. In this way, we can find signal patterns that may occur across the different DoFs. To this end, we computed the norm of both the force n f and the torque n t as features respectively: Velocity modality: Similarly, we take the Cartesian linear (l x , l y , l z ) and angular (a x , a y , a z ) velocities and compute their norm n l and n a respectively: Tactile modality: Due to the computational cost of processing the tactile sensor's high dimensionality, we empirically tested a number of features for each tactile sensor, they include: the maximum taxel value, the largest five taxel values, the mean of all taxel values, and the standard deviation for all taxel values. It was the standard deviation, which proved to be the most useful feature for anomaly monitoring and diagnoses. The standard deviation for each tactile sensor s l and s r are defined as: where μ l = 1 28 28 i=1 l i and μ r = 1 28 28 i=1 r i are the mean of each tactile sensor respectively. Equations (4.50)-(4.52) facilitate the identification of diverging signals during anomalous situations. We empirically concatenate all the valuable features for representing the robot executions both in nominal and anomalous cases. If raw signals are used directly, they result in high false-positive detection rates (FPR) in our experimentation. For instance, when a robot carries objects of varying weights, the F/T signals for such operations vary significantly making it difficult to use raw values for detection. Hence, a standardization method is used to scale the extracted features through a normalization constant. We ultimately end with an feature vector of length 18 as shown below: , l x , l y , l z , a x , a y , a z , n l 0.4 , n a , s l 250 , s r 250 .
(4.53) Note, that all training and testing data are pre-processed in the same manner. We will also use the same pre-processing approach when comparing with other baseline approaches.

• Basic Parameter Settings of Each Model
For anomaly monitoring and diagnoses using the sHDP-VAR-HMM with a first-order autoregressive Gaussian likelihood, each state k has two parameters to be defined: the regression coefficient matrix A k and the precision matrix Λ k . The conjugate prior is the MNIW for which four parameters ν, Δ, V, M are assigned values in a fashion similar to the Motion-Capture-Dataset in [28]. In order to guarantee the prior has a valid mean, the degrees-of-freedom variable is set with ν = d + 2 and we set Δ by constraining the mean or expectation of the covariance matrix E[Λ −1 k ] under the Wishart prior in Eq. (4.30).
Assume that we record N sequential data for each skill and the length of sequence n ∈ N is T n . Thus, we can easily define the parameter Δ accordingly as Specifically, we placed a nominal prior on the mean parameter with mean equal to the empirical mean and expected covariance equal to a scalar s F times the empirical covariance, here s F = 1.0. This setting is motivated by the fact that the covariance is computed from polling all of the data and it tends to overestimate mode-specific covariances. A value slightly less than or equal to 1 of the constant in the scale matrix mitigates the overestimation. Also, setting the prior from the data can move the distribution mass to reasonable parameter space values. The mean matrix M and V are set such that the mass of the Matrix-Normal distribution is centered around stable dynamic matrices while allowing variability in the matrix values (see Sect. 4.6.1 for details).
where 0 d and I d are the matrix of all zeros and the identity matrix, respectively, each of size d × d. For the concentration parameters, a Gamma(a, b) prior is set on HDP concentration parameters α. A Beta(c, d) prior is set on the self-transition parameter μ and the degree of self-transition bias κ is set to 50. We choose a weekly informative setting and choose: a = 0.5, b = 5, c = 1, d = 5. Specifically, the initial transition proportion parameter is set μ ∼ Beta (1,10). The Split-Merge Monte Carlo method sampling parameter maximum iterations is set to 1000. The truncation states number is set to K = 5 for anomaly monitoring, and K = 10 for anomaly diagnoses. Then, the HDPHMM model of each anomaly class or nominal skill is trained. Note, that as with pre-processing, we also use the same model parameters in training and testing as well as in other baseline approaches. •

Results and Analysis of Anomaly Monitoring in Kitting Experiment
According to Sect. 4.6.2, the anomaly detector implementation consists of representing multimodal observations through latent states, concatenating the derived loglikelihood values of a latent state, and calculating the anomaly monitoring threshold. We first generate the latent state partitions for nominal demonstrations with varying speeds and goals. Notice how the non-parametric method yields varying dynamics across executions reflecting statistical variations in the robot motion trajectory due to changes in the desired trajectory completion time and varying goal distances.
Then, latent state log-likelihood values are concatenated and the corresponding thresholds computed. To this end, we visualized the performance of the proposed method as well as a baseline by assessing the anomaly time reaction of anomaly flags; that is, which algorithm can make a correct detection at a time closer to the ground truth. For training, ten nominal executions were used along with 3-fold cross validation to obtain nominal sHDP-VAR-HMM models for each skill. The other 20 nominal executions were used for testing.
The execution varying threshold for anomaly monitoring was set according to Eq. (4.48). The results for skill 3 are visualized in Fig. 4.13. As seen in Fig. 4.13, latent state zhat = 2 is the most prominent latent state (the majority of the observations in the skill are ascribed to this state). We now analyze the time sensitivity of our system in detecting anomalies. Anomaly identification sensitivity is intimately connected to threshold setting. We compare the results of our execution varying threshold with a fixed threshold presented in [35]. The comparison is seen in Fig. 4.14. Under nominal situations both methods have the same robustness in avoiding falsepositives. In anomalous situations, there are multiple occasions in which our varying  Fig. 4.14, we see that for executions 2 and 4, our varying threshold detector identified anomalies that the fixed threshold method could not. Notice that the distance between the anomaly_t_by_human and the anomaly_t_by_detector is larger for the fixed threshold (recall from the dataset collection section that ground truth time stamps were obtained during training). This improved sensitivity reduces false-negatives and helps us flag anomalies at times closer to the ground truth.
Finally, Table 4.5 shows the anomaly monitoring performance for our hiddenstate detector (HSD) and the baseline gradient-based detector (GD) for the six skills used in the kitting task for 368 nominal executions and 136 anomalous executions.
Our results indicate that while the GD system had the HSD resulted in higher 16.6% better precision, our system had 76.0% better recall. The difference is significant in reducing false-negatives and significantly raising the sensitivity of our system. The sensitivity improvement is almost four times bigger than the difference in precision. For this reason, our system reflects an F1-score that is 25.0% larger and an accuracy that is F1-score, and an accuracy that is 5.7% higher standing at 91.0%. Fig. 4.14 The comparison of two anomaly detectors for five robot anomalous executions. Each row represents the same testing execution. Left column: results of our hidden state-based anomaly detector. Right column: results of the gradient-based anomaly detector [21]. The ground truth (anomaly_t_by_human) are represented by green dashed lines and the detected anomalies are represented by yellow solid lines

Discussion
Comprehensive experimental results showed robust, sensitive, and better timed anomaly monitoring for a wide range of anomalous situations in an unstructured co-bot scenario where a human and a robot collaborated to complete kitting tasks. The improved anomaly detector exploits the mapping relationship between latent states and the marginal log-likelihood values from the sHDP-VAR-HMM for monitoring anomalies. Our evaluations showed that we could detect anomalies reliably (overall accuracy of 91.0%). With regards to anomaly monitoring, Table 4.5 indicated that our proposed detector achieved unexpectedly superior recall measurements compared to the baseline (the gradient-based method), an improvement of 61.7%. By adjusting the anomaly monitoring threshold during the task's progress execution task as a function of the maximum and minimum of the log-likelihood of observations for latent states, we gained enhanced sensitivity. Human-centric safety approaches dictate that increased sensitivity is highly desirable in human-robot collaboration tasks so as to provide higher safety guarantees for a human. Robotic workcells on the other hand may work better with approaches like the gradient-based detector that was less sensitive but had higher precision. Here the reduction in the false-positive rate may support longer-term autonomy in a workcell. The anomaly detector exhibited not only a better accuracy but also more accurately timed anomaly flags. That is, our anomaly flags occurred closer to the ground truth time. This would be useful. Timeliness may prove a critical role in cases where anomalies are triggered as a chain reaction. That is anomalies that cause more anomalies. By acquiring more timely detection, we may be able to also provide more timely recovery techniques or even prevention so as to minimize the possibility of a cascade of anomalies.

Summary
In this chapter, there are three kinds of anomaly monitoring methods in various threshold are presented based on the Bayesian nonparametric HMMs, namely: loglikelihood-based threshold, threshold based on the gradient of log-likelihood, and computing the threshold by mapping latent state to log-likelihood. All proposed methods and implementation processes were validated on the experimental platforms of "HIRO-NX robot performs electronic component assembly tasks" and "Baxter robot performs object packing tasks" using the ROC curve or accuracy-precisionrecall rate-F1 score as a performance evaluation indicator. The results are as follows: (1) The anomaly monitoring method using log-likelihood-based threshold can effectively realize the online monitoring, but there are a large number of false pos-itive (false triggering) and has an uncertain constant coefficient c. For the task of HIRO-NX robot performing the electronic components assembly, sHDP-VAR(2)-HMM obtains the optimal performance where the constant value of the threshold is c = 3 and the observation model is the second-order linear Gaussian regression model, resulting in a false positive rate of about 5.0% and a true positive rate of about 95.0%. (2) The method of threshold based on the gradient of log-likelihood can effectively solve the problem of false triggering, but resulting in low sensitivity and monitoring success rate with a fixed lower threshold. For the task of Baxter robot to performing object kitting experiment, we obtain the average monitoring accuracy of a total of 67 anomaly events is 93.8%, far better than the 59.2% of the log-likelihood-based threshold. (3) The anomaly monitoring method based on computing the threshold by mapping latent state to log-likelihood uses the dynamic threshold to solve the problem of false triggering, low sensitivity and success rate, but the dynamic threshold depends on the inference of the belief state, there will be a rapid transition phenomenon, resulting in a higher risk of false positive rate. For the task of Baxter robot to performing object kitting experiment, resulting in the average monitoring accuracy of a total of 136 abnormal events in the is 91.0%, which is much better than the 86.0% of the threshold based on the gradient of loglikelihood.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.