4.1 Introduction

Anomalies, also known as outliers, are observations that deviate too much from other observations to cause suspicion that they are caused by different mechanisms. Anomaly monitoring is a widely studied problem in machine learning, and has important significance in the fields of network intrusion monitoring, credit card fraud monitoring, and medical health monitoring. In robot manipulation, abnormality refers to abnormal behaviors, events, or situations. Robotic abnormality indicates that different from normal execution or observed data that is different from normal, especially in the dynamic environment of human-computer interaction, abnormalities Usually stems from changes in the environment and human beings. Assuming that the robot can monitor, classify, and repair these anomalies, it will reduce or prevent the robot from being potentially injured and improve the safety of human-robot collaboration. Due to the diversity and complexity of anomaly types, it is not possible to directly enumerate and model all anomalies, and to implement anomaly monitoring through supervised learning. In general, research on robot abnormality monitoring is often implemented using unsupervised learning methods. In other words, it is abnormal to identify abnormal phenomena from the normal behavior model of the robot.

4.2 Related Work

The premise of performing anomaly monitoring is to endow robots with longer-term autonomy. To this end, we would like to discriminate between nominal and anomalous behavior while the robot executes a task. Researchers usually use two different methods to solve the problem of robot multimodal anomaly monitoring [1, 2]. On the one hand, a probability statistical model is established for each modal signal such as Rodriguez et al. [3] classify the success and failure of the robot in the assembly task by extracting the magnitude of the single-modal output force signal [4]. Ando et al. [5] used the K-Nearest Neighbor (KNN) algorithm to independently measure the differences between modalities to monitor the abnormal behavior of a mobile robot. Pastor et al. [6] used multi-modal sensing signals to monitor the failure of the robot to flip the box with chopsticks. This method first uses DMP to describe the motion of each stage, and then models each modality independently and combines normal statistical features (e.g. mean, variance, etc.) of the data for establishing ASM of each movement primitive to realize abnormality monitoring. The occurrence of abnormality is defined as if the signal value at least five times within three consecutive seconds is lower than a given threshold. Chu et al. [7] proposed an anomaly monitoring approach by training multiple HMM models for each modality given the observed values, and then the log-likelihood of each HMM was calculated as the feature vector for support vector machine model SVM (Support Vector Machine, SVM).

On the other hand, a probabilistic statistical model is established for all modality signals. Clifton et al. [8] used Gaussian Mixture Model (GMM) to model multi-modal signals, and realized anomaly monitoring through extreme value theory. Qiao et al. [9] proposed the use of a sequence of recessive states in the HMM model to monitor abnormal instructions. Zhang et al. [10] proposed a hierarchical HMM (Hierarchical HMM, HHMM) for system intrusion anomaly monitoring under the premise of the HMM method. This method uses HHMM to model normal data and is reflected by anomalies in hidden state sequences. Out the system intrusion event. Park et al. [11,12,13] carried out an online anomaly monitoring method based on multi-modal sensory signals fusion during robot manipulation, where a parametric HMM (Left-to-right HMM) models a multi-modal time series of a robot performing normal motions, and proposes abnormal thresholds that change by the execution progress to achieve abnormal monitoring. We extend the HMM for anomaly monitoring in the subsequent part of this chapter that inspired by the HMM cases. In addition, the proposed method uses a variety of complementary modal information to detect a variety of potential anomalies in the human-robot interaction process on the system where the robot feeds the disabled, such as human abnormal contact and environmental collision.

Hu et al. [14] proposed an abnormal event monitoring method based on HDP-HMM. This method combines the Fisher Kernel into the OCSVM (One-Class Support Vector Machine) model, which has the advantages of generating models and discriminative models. Dilello et al. [15, 16] first proposed the application of the non-parameterized Bayesian time series model (sHDP-HMM) for abnormal monitoring and fault diagnoses during robot operation, using the model to transmit force/torque of the robot to perform normal alignment tasks [17, 18]. Sensing data is used for modeling, and the threshold value for abnormality monitoring is calculated from the validation data set. Then, the log-likelihood function value of the online observation data is calculated based on the learned model. Finally, the robot online is compared with the abnormality threshold value for anomaly monitoring.

In this chapter, we define an anomaly whenever a data stream is structurally different from prior learned models [19]. As Pimentel et al. review literature shows, many anomaly monitoring approaches have been proposed in the last decade [1]. However, few works focused on multivariate time series with spatial and temporal dependencies as in our case. In Milacski et al. [20], the method requires the full data sequence for processing, thus precluding the online detection capability. Besides our previous works [21, 22], others have used non-parametric Bayesian approaches to learn HMMs in contact tasks for abnormal detection and recognition. In [16, 23], DiLello et al. used the sticky hierarchical Dirichlet process prior to learn the model parameters of an HMM based on wrench signatures for an industrial robot alignment task. The work learned specific failure modes online. The approach is robust as it is models not just individual contact events but rather the evolution of signals, it does so while incorporating domain knowledge, and adjusting its model complexity according to the data patterns presented. Nonetheless, the approach is limited by the assumptions of the HMM in which observations are conditionally independent given the state. This is often an insufficient condition to capture the temporal dependencies of complex dynamical behaviors. In [24], Niekum et al. used a beta-process prior on the vector auto-regressive (BP-AR-HMM) to extract the state complexity from the data. The BP-AR-HMM has the ability to discover and model primitives from multiply related time-series. Robot skill discrimination was reliable and consistent; however, this work only modeled robot joint angles in a primarily non-contact oriented task. In [25], Hu et al. implemented anomaly monitoring based on HDP-HMM models and analyzed the efficiency and effectiveness of beam sampling in the HDP-HMM model. This implementation however suffers of low computational efficiency. Instead, we explore efficient online inference algorithms instead of the conventional sampling algorithms.

4.3 Problem Statement

Anomaly monitoring is used to continuously monitor the motion status of the robot to identify unexpected events during execution. Based on the recognition of robot motion in Chap. 3, this chapter describes in detail the abnormal event monitoring method for robots based on non-parametric Bayesian models. The proposed method is also applicable to the subsequent anomaly diagnoses. The main implementation ideas are: first, modeling the normally executed multi-modal data; then, test the normal movement behavior data based on this model. In order to obtain the definition of the boundary between normal and abnormal, that is, the calculation of the abnormal threshold; finally, monitoring the testing samples based on the model and the corresponding threshold. In order to better study the theory and methods of multivariate time series anomaly monitoring, this chapter proposes the non-parametric Bayesian models HDP-HMM and sHDP-HMM in different observation models based on the parametric HMM model. An anomaly monitor implemented under the combination of modal information and different anomaly threshold methods. The anomaly decision of the proposed anomaly monitor on robot motion behavior includes: normal, single anomaly, and multiple anomalies.

An ideal online anomaly monitoring system often requires the use of multimodal fusion methods to respond to different types of anomalous events, such as force/torque sensing information that can directly detect robotic collision anomalies and tactile sensing information that can directly sense objects. And the ability to respond to abnormal event occurrences in a timely manner (high response rate) and low false touch rate. In order to meet such requirements, this chapter addresses the problem of multi-modal fusion robot anomaly monitoring, and proposes the performance of anomaly monitors with three different probability models, different sensor information fusion, and different anomaly thresholds. The overview framework proposed after skill identification in Chap. 3 as shown in Fig. 4.1.

Fig. 4.1
figure 1

Illustration of the anomaly monitoring procedure after skill identification

4.4 Log-Likelihood for Anomaly Monitoring

4.4.1 Problem Formulation

The robot introspection model uses non-parametric Bayesian priors along with a Hidden Markov model (HMM) and either Gaussian or Autoregressive emission models. First, HMMs are a doubly stochastic and generative process used to make inference on temporal data. The underlying stochastic process contain a finite and fixed number of latent states or modes \(z_t\) which generate observations \(Y=\{ y_1,\ldots ,y_t \}\) through mode-specific emission distributions \(b(y_t)\). These modes are not directly observable and represents sub-skills or actions in a given node of a task. Transition distributions, encoded in transition matrix \(A_{ji}\), control the probability of transitioning across modes over time. The model assumes conditionally independent observations given the generating latent state. Given a set of observations, the Baum–Welch algorithm is used to infer model parameters \(\varPi =(A,b)\). The fixed-modes assumption in HMMs limits the model’s expressive power as it is unable to derive natural groupings. Bayesian nonparametric priors extend HMM models to learn latent complexity from data [26, 27]. We use Fox’s et al. sticky-Hierarchical Dirichlet Process (sHDP) prior with an auto-regressive switching system [26] to model nominal skills as in our previous work [21] and derive more robust failure identification methods in manipulation tasks, specially in recovery scenarios.

Bayesian statistics are combined with the sHDP prior to both learn model complexity k from the data but also to model the transition distribution of an HMM. The sticky parameter in the prior discourages fast-mode-switching otherwise present. Consider a set of training exemplars \(_t=\{ x_1,\ldots , x_T\}\) of observed multi-modal data \(\tau \) consisting of Cartesian pose and wrench values. Then, a mode-dependent matrix of regression coefficients \(\mathbf {A}^{(k)}=[A_1^{(k)} \ldots A_r^{(k)}] \in \mathbb {R}^{d\times (d*r)}]\) with an rth autoregressive order and d dimensions is used along with a measurement noise \(\varSigma \), with a symmetric positive-definite covariance matrix. The generative model for the sHDP-AR-HMM is summarized as:

$$\begin{aligned} G_0=\sum _{k-1}^\infty \beta _k \delta _{\theta _k}&\beta | \gamma \sim GEM(\gamma ).&\nonumber \\&\theta _k | G_0 \sim G_0.&\\ G_j = \sum _{k-1}^\infty \pi _{jk} \delta _{\theta _k}&\pi _j | \alpha ,\beta \sim DP(\alpha ,\beta ).&\nonumber \end{aligned}$$
(4.1)

The probability measure \(G_j\), which models the transition distribution of the modes \(\pi _j\) determines the weights (probabilities) of transitioning between modes \(\delta _{\theta _k}\). To avoid fixing the mode complexity, \(G_j\) uses a prior \(G_0\) that is unbounded and can grow with the complexity of the data. While \(G_j\) uses the same set of modes as \(G_0\), \(G_j\) introduces variations over those points. \(G_0\) provides support for a possibly infinite space, but due to the Dirichlet’s process properties (the Chinese Restaurant Process), a finite set of modes is selected. In fact, we can understand the hierarchical specification as \(G_j=DP(\alpha ,G_0)\).

Different observation models can be included into the HMM. A Gaussian distribution with different covariance models (full, diagonal, and spherical) are considered. For Gaussian models, mode specific means and standard deviations are used \(\theta _{z_t}=\mathscr {N}(\mu ,\sigma ^2)\). Additionally, the sHDP-HMM can be used to learn VAR processes, which can model complex phenomena. The transition distribution is defined as in the sHDP-HMM case, however, instead of independent observations, each mode now has conditionally linear dynamics, where the observations are a linear combination of the past r mode-dependent observations with additive white noise. A prior on the dynamic parameters \(\{ A^{(k)}, \varSigma ^{(k)} \}\) is necessary. A conjugate matrix-normal inverse Wishart (MNIW) was used to this end. The generative process for the resulting HDP-AR-HMM is then found in Eq. (4.2).

$$\begin{aligned} \text{ Observation } \text{ Dynamics: } y_t=\sum _{i=1}^r A_i^{(z_t)} y_{t-i} + e_t(z_t). \nonumber \end{aligned}$$
$$\begin{aligned} e_t \sim \mathscr {N}(0,\varSigma ). \end{aligned}$$
(4.2)
$$\begin{aligned} \text{ Mode } \text{ Dynamics: } z_t^{(i)} \sim \pi _{z_{t-1}^{(i)}}^{(i)}. \nonumber \end{aligned}$$

By using the model over a set of multi-modal exemplar data \(_t\), the sHDP-AR-HMM can discover and model shared behaviors in the data across exemplars. Scalable incremental or “memoized” variational coordinate ascent, with birth and merge moves [27] is used to learn the posterior distribution of the sHDP-HMM family of algorithms along with mean values for the model parameters \(\theta \) of a given skill, hence \(\theta _{S_m}=\{\varPi ,\mathbf{A} \}_{S_m}\).

4.4.2 Skill Identification

The robot introspection system simultaneously detects nominal skills and anomaly events. Models are trained for individual skills to capture dynamics from multi-modal observations through vector \(\tau _m\). \(\tau _m\) consists of end-effector pose and wrench values. Scalable incremental or “memoized” variational coordinate ascent, with birth and merge moves [27] is used to learn the posterior distribution of the sHDP-HMM along with mean values for the model parameters \(\varPi \) of a given skill s. Hence \(\varPi _s=\{\pi ,\mathbf{A} \}\): the transition matrix and regressor coefficients.

Given S trained models for M robot skills, scoring is used to compute the expected cumulative likelihood of a sequence of observations \(\mathbb {E} \left[ log P(Y | \varPi _s) \right] \) for each trained model \(s \in S\). Given a test trial r, the cumulative log-likelihood is computed for test trial observations conditioned on all available trained skill model parameters \(log P(y_{r_1:r_t} | \varPi )_s^S\) at a rate of 200 Hz (see Fig. 4.2 for an illustration). In Fig. 4.2, note the 4 probabilistic models. Given an indexed position in the graph, an anomaly threshold corresponding to that skill’s expected cumulative log likelihood. The figure also illustrates how at the end of a skill, all data is reset and restarted. The process is repeated when a new skill m is started. Given the position in the graph \(s_c\), we can index the correct log-likelihood \(\mathbb {I}(\varPi _s=s_c)\) and see if its probability density of the test trial given the correct model is greater than the rest:

$$\begin{aligned} log P(y_{r_1:r_t} | \varPi _{correct}) > log P(y_{r_1:r_t} \,\vert \,\varPi _s)v\nonumber \\ \quad \forall s(s \in S \wedge s \ne s_c). \nonumber \end{aligned}$$

If so, the identification is deemed correct, and the time required to achieve the correct diagnoses recorded. At the end of the cross-validation period, a diagnoses accuracy matrix is derived as well as the mean time threshold value (these results were also reported in [21]).

Fig. 4.2
figure 2

Illustration of cumulative log-likelihood (L-) curves for four skills s in a task. As such, 4 probabilistic models \(\varPi _s\) are derived. Given an indexed position in graph \(N_i\), one can generate four L-curves. The curve for the correct model, has a value increasing likelihood, while the other graphs have value decreasing likelihoods. One can also see how at the termination of a skill, all data is reset and restarted. No anomalies are shown in this plot

4.4.3 Anomaly Monitoring

Due to the characteristics of abnormality, diversity, and unpredictability in the robot’s motion behavior, unsupervised learning is the only way to achieve abnormality monitoring. In particular, this section proposes the use of log-likelihood function values based on non-parameterized Bayesian probability statistical models to implement anomaly monitoring. The main reason is that the probability model learned from normal data will behave when testing anomalous values The cumulative log-likelihood function value decreases sharply. Additionally, the curve of the cumulative log-likelihood function value that calculated from normal movement behavior will show an increasing trend, and the curve calculated with abnormal motion behavior will drop significantly when anomaly occurred. Therefore, a method based on the log-likelihood function of the model is proposed to realize the abnormal monitoring of the robot’s motion behavior by setting the abnormal lower limit threshold.

Anomaly monitoring assumes that the cumulative log-likelihood L of a set of nominal skill exemplars \(_{}\) share similar cumulative log-likelihood patterns. If so, the expected cumulative log-likelihood derived in training can be used to implement an anomaly threshold F1. Initially, we consider a likelihood curve generated from training data for a given skill s. Then, for each time step in an indexed skill \(s_c\), the anomaly threshold is set to

$$\begin{aligned} F1_{s_c}=\mu (L)-c*\sigma (L) \end{aligned}$$
(4.3)

where c is a real-valued constant that is multiplied by the standard deviation to change the threshold. Here, we are only interested in the lower (negative) bound. Then, an anomaly is flagged if the cumulative likelihood crosses the threshold at any time: \(\text{ if } log P(y_{r_1:r_t} \,\vert \,\varPi _{correct}) < F1_{s_c} \text{: } \text{ anomaly, } \text{ else } \text{ nominal }\). However, it is found that the threshold obtained by Eq. (4.3) makes the abnormal monitor frequently trigger falsely (False Positive) during actual application, which means that the normal execution condition monitoring becomes abnormal. In addition, the determination of constant variables requires tedious interactive verification. Aiming at this problem, this section refers to the experience of robot motion recognition, and determines the threshold of abnormality by solving the gradient information of each log-likelihood function value vector.

Upon our initial exploration of recovery schemes we noticed that after resetting the cumulative log-likelihood observations in the HMM model, false-positive anomalies were triggered at the beginning of the skill. Further examination revealed that the standard deviation of cumulative log-likelihood graphs during training began with small variances but grew over time as shown in Fig. 4.3. Given that variances are small at the beginning of the task, small variations in observations can trigger failure flags.

A second threshold definition was designed to overcome this situation and used in our work instead of F1. As the difference between L and F1 is minimal at first the new anomaly threshold F2 (for an indexed skill) is focused on computing the derivative of the difference:

$$\begin{aligned} F2_{s_c}=\frac{d\mid L-F1_{s_c} \mid }{dt} \end{aligned}$$
(4.4)

Figure 4.4, illustrates the derivative signal crossing the empirical anomaly threshold as anomalies are triggered by external disturbances.

Fig. 4.3
figure 3

This plot shows how the standard deviation of the cumulative log-likelihoods computed during training grows over time. The standard deviation grows as observations show greater variance due to the accumulation of error

Fig. 4.4
figure 4

Example of how the anomaly identification metric crosses the anomaly threshold as anomalies are triggered by external perturbations during a skill execution

4.4.4 Experiments and Results

Three manipulation experiments are designed to test the robustness of the online decision making system for anomaly recovery. We use accidental human collisions as disturbances for a pick-and-place and open-and-close-drawer tasks, and tool collisions for a pick-and-place task. For each of the three experiments, we test four separate conditions:

  1. (i)

    no anomalies

  2. (ii)

    anomalies without recovery

  3. (iii)

    one anomaly caused at each executed skill and

  4. (iv)

    multiple anomalies caused at each skill.

The pick-and-place task consists of 5 basic nodes (not counting the home node and the recovery node): pre-pick, pick, pre-pick (returns to an offset position), pre-place, and place. The open-and-close-drawer task consists of 5 nodes: pre-grip, grip, pull-to-open, push-to-close, and go-back-to-start. The direction and intensity of the human collisions was random but all executed under the sense that these are accidental contacts as a human user reaches into the workspace of the robot without noticing the robot’s motion. We note that while the tasks are simple, the main focus of the work is placed on the robustness of the recovery systems given different external disturbance scenarios. A dual-armed humanoid robot -Baxter- was used. All code was run in ROS Indigo and Linux Ubuntu 14.04 on a mobile workstation with an Intel Xeon processor, 16 GB RAM, and 8 cores.

For motion generation, two techniques were used for the different tasks. For the pick-and-place task, we used Baxter’s internal joint trajectory action server which uses cubic splines for interpolation. The goal target is identified online through image-processing routines. For the open-and-close-a-drawer task, we trained dynamic movement primitives using kinesthetic teaching.

Fig. 4.5
figure 5

Two examples (pick-and-place (UPPER) and open-and-close-drawer) in which a human collaborator accidentally collides the robot. The introspection system identifies an anomaly (see bottom left plots) and triggers a recovery behavior (see the fluorescent node in the graph on the right)

In terms of robot introspection, the sHDP-HMM code with “memoization” variational coordinate ascent, with birth and merge moves was implemented using BNPY [28] and wrapped with ROS. Training used 10-trial batches. Observations used 13 dimensional vectors composed of 7 DoF pose values (position and quaternion) and 6 DoF wrench values. A baseline HMM algorithm was implemented through hmmlearn [28] and wrapped with ROS. The anomaly threshold for each skill was computed through leave-one-out cross-validation.

Figure 4.5 shows a representative image of the Baxter robot attempting a pick operation. A human collaborator accidentally collides with robot before a pick action. Note that collisions were strong enough to move the current pose significantly from the intended path and sometimes collide with other parts of the environment. The robot introspection system identifies an anomaly and triggers a recovery behavior. The lower left part of the image shows the anomaly F2-metric flagging the anomaly. The system then begins recovery as seen in the directed graph on the right (implemented in ROS-SMACH). Video and auxiliary data for the three experiments under the four conditions are available in [21].

For results reporting, two markers are provided: (i) A success recovery rate for the recovery policy and (ii) an F-score, precision, and recall numbers for assessing anomaly identification. In particular, these markers are used under experimental conditions (iii) and (iv) conditions described at the beginning of the experiment and results section. The first two conditions are use as a baseline and compare with the generic recovery behavior success rates.

Table 4.1 Results for accidental human and tool collisions. Recovery strategy success rates are presented by the F-score, and robustness of the robot introspection system for anomaly is shown in the recall and precision settings. We also compare the performance of the more expressive sHDP-HMM model with an HMM that serves as a baseline. Generally, the sHDP-HMM model allowed for better introspection and thus better recovery rates and better recall

(1) Human Collisions

One anomaly per node: 27 test trials were used for the pick-and-place task and 24 test trials were used for the open-and-close drawer task. Given that both of the tasks have 5 nodes, a total of 5 anomalies were induced in the task, 1 per node. Table 4.1, shows the recovery success rate of the task (represented by the F-score) and the robustness of the identification system through the precision-and-recall quantities for both micro and macro settings. The pick-and-place recovery had an average success rate of 85% with a maximum of 88%. The precision was 100% (for both macro/micro) indicating strong resistance to false positives. The recall was \(\sim \)82% (for macro/micro) indicating the existence of some false-negatives. This might indicate that there might have been some collisions that were of lower magnitudes than the ones we might have trained with and were not detected by the system. For the drawer task, the recovery success rate was 91.67%. The precision was  95% (macro/micro) and recall was \(\sim \)84.5%. As with the human collision, weaker contact signals in tests compared to training might be the reason for the presence of false-negatives in our system.

Multiple anomalies per node: Under this scenario the pick-and-place task ran 42 test trials and the drawer task ran 30 test trials. Five anomalies were induced repeatedly one-after-the other for each node. The pick-and-place recovered 85% of the time with a precision of \(\sim \)95% and a recall of \(\sim \)84.5% (micro/macro). The drawer task recovered \(93.3\%\) of the time with a precision of \(\sim \)100%, and a recall of \(\sim \)93% (micro/macro).

(2) Tool Collisions

For the tool collision experiment, we only tested recovery in the pick-and-place task. The results for this scenario were similar to that of human collision. The recovery success rate was \(\sim \)88.89%, the precision 100%, and the recall 88.89%.

4.4.5 Discussion

This work showed the ability of a system to recover from unmodeled and accidental external disturbances that can’t be anticipated. Such disturbances will be more common in shared human-robot workspaces. Our work demonstrated that our recovery strategy in connection with our previous introspection framework recovered 88% of the time from accidental human and tool collisions under single-anomaly and multiple-anomaly scenarios per node. The results indicate the system can recover at any part of the task, even when it is abused and multiple anomalies occur consecutively. From the videos in [21], we can see that even when in cases where the robot is in constant duress, the robot recovers consistently. Such robustness will play a role in enabling robots have increasing levels of long-term autonomy.

We by using a more expressive model to do robot introspection, our recovery ability also increased. We will continue to explore improved models that can better capture spatio-temporal relationships of high-dimensional multi-modal data. As well as looking for representations that scale over time in order to acquire a useful and practical library of skill identification and motion generation.

Yet there is much work to be done. Manual annotations for state-dependency are an important weakness. Crucially we wish to move towards modeling human understanding for decision making in the midst of robot-object-environment interactions. Scalability is an important factor in this domain as the system must scale to ever growing number of learned tasks. Learning how anomalies and recovery decisions are made and re-used across similar scenarios will be important. The manual approach will not scale. Adaptability and not only reverting is also important. Incremental learning is also key. The current work is limited to reverting. By simply revering we don’t model recovery behaviors and also do not learn how to adapt. We also cannot handle new scenarios. We must develop action-confidence metrics that let us learn new scenarios on demand.

4.5 Gradient of Log-Likelihood for Anomaly Monitoring

Based on the implementation in Sect. 4.4, it is difficult to determine the constant parameters c in actual applications, which leads to frequent false positives. To address the problem, this section explores an anomaly monitoring method based on the log-likelihood function gradient value and mathematically verifies its implementation process to prove its feasibility and effectiveness. After identifying the motion behavior s, this method assumes the parameters of the specific motion behavior and its established probability model are known as \(\varTheta \). From Sect. , we can know that the log-likelihood function value for the observation sequence can be determined by the probability in all credible states, and the simplified description is

$$\begin{aligned} \mathscr {L}\left( t \right) = log\;\sum \limits _{i = 1}^N {{\alpha _i}} (t) = log\;\sum \limits _{i = 1}^N e xp\;(log\;{\alpha _i}(t)) \end{aligned}$$
(4.5)

Therefore, given an HMM model \(\varTheta \) and an incoming multimodal time series \(Y_{1:t}\), the natural logarithm of the filtered belief state associated with the forward model for latent state \(i\in \{1,2,\ldots ,N\}\) can be represented according to Eq. (4.5). To compute \(\mathscr {L}_t\), we first compute \(\log {{\alpha _i}(t)}\). According to the forward-algorithm, we derive:

$$\begin{aligned} {\alpha _i}(1)=\pi _i {b_i}(y_1), \nonumber \\ {\alpha _i}(t+1)={b_i}(y_{t+1})\sum _{j=1}^{N}{{\alpha _j}(t)A_{ji}}. \end{aligned}$$
(4.6)

From Eqs. (4.5) and (4.6), we know that \(\log {{\alpha _i}(t)}\) can be computed recursively through \(\log {{\alpha _*}(t-1)}\). Expanding the log in Eq. (4.5) we have:

$$\begin{aligned} \log {{\alpha _i}(t)}= \log {{b_i}(y_{t})}+\log {\sum _{j=1}^{N}{\exp ({\log {{{\alpha _j}(t-1)}+\log {A_{ji}}}})}} \end{aligned}$$
(4.7)

4.5.1 The Hidden Markov Model Forward Gradient

  • Viterbi Path in HMMs

The Viterbi algorithm, expanded in Eq. (4.8), attempts to estimate the most likely state sequence. Viterbi uses dynamic programming to estimate the underlying state sequence \(\hat{z}_{1:t}\) through MAP computations given a sequence of observations \(y_{1:t}\):

$$\begin{aligned} \hat{z}_{1:t}&=\arg \max _{z_{1:t}} P(z_{1:t} \mid y_{1:t}) \nonumber \\&=\begin{aligned}&\arg \max _{z_{t}} ( {b_{z_t}}(y_{t})\\&\quad \arg \max _{z_{1:t-1}}(A_{z_{t-1}z_t} P(z_{1:t-1}, y_{1:t-1})) ) \end{aligned} \nonumber \\&=\begin{aligned}&\arg \max _{z_{t}}({b_{z_t}}(y_{t})\\&\quad \arg \max _{z_{t-1}}(A_{z_{t-1}z_t}{b_{z_{t-1}}}(y_{t-1})\\&\quad \quad \ddots \\&\quad \quad \quad \arg \max _{z_{1}}(A_{z_{1}z_{2}}\pi _{z_1}{b_{z_1}(y_1)})\dots )) \end{aligned} \end{aligned}$$
(4.8)

First, we introduce simplified notation for the Viterbi algorithm, where a node j has a maximal belief state \(\delta _{t}(j)=\max _{z_{1:t}} P(z_{1:t}=j \mid y_{1:t})\) with associated traceback \(\hat{z_t}\).

Theorem 4.1

For an incremental time series Y, a good HMM model outputs an incremental Viterbi path that stably expands on the previous one. The stable expansion of the Viterbi path is as follows: given a Viterbi path “11223” for an input \(Y[1\text{: }t]\), then the path at \(Y[1\text{: }t+1]\) becomes “11223*”, where * is the newly appended hidden state.

Good models are those that predict their data as accurately as possible and can be achieve through two steps: (i) HMM parameters optimization: the Baum–Welch algorithm given a proper initialization or similar algorithm incrementally optimize HMM parameters until a local maximum of likelihood is reached; minimizing the perplexity of the model. (ii) Model selection optimization: this consists in selecting the number of hidden states and values for observation models. Many techniques exist including BIC, MCMC, Variational Bayes, or non-parametric HMMs [29].

Proof

Consider, without loss of generality, a Viterbi graph where we examine two consecutive time-steps \((t-1, t)\) along two possible latent states lk (the analysis generalizes to HMMs with more states). We also assume \(\forall i, i = \arg \max _{j}A_{ij}\); that is, all hidden states states tend to self-transition. Also at time \(t-1\):

$$\begin{aligned} \delta _{t-1}(l) > \delta _{t-1}(k). \end{aligned}$$
(4.9)

We also define the following symbol:

$$\begin{aligned} w_{ji}(t) = A_{ji}*b_{i}(y_t). \end{aligned}$$
(4.10)

Due to our first assumption, we have: \(\max _{j}w_{ji}(t) = w_{ii}(t)\). Then, at time t, the \(\delta \) values are:

$$\begin{aligned} \delta _{t}(l)&= max(\delta _{t-1}(l)*w_{ll}(t), \delta _{t-1}(k)*w_{kl} (t))\end{aligned}$$
(4.11)
$$\begin{aligned} \delta _{t}(k)&= max(\delta _{t-1}(l)*w_{lk}(t), \delta _{t-1}(k)*w_{kk}(t)) \end{aligned}$$
(4.12)

According to (4.9) and our max weight formulation, the max function in (4.11) is: \(\delta _{t}(l) = \delta _{t-1}(l)*w_{ll}(t)\). So, the max state l at time \(t-1\) will contribute to itself instead of k at time t. Therefore, there is only one condition under which the Viterbi sequence breaks:

$$\begin{aligned} \delta _{t}(k)=\delta _{t-1}(k)*w_{kk}(t) \quad \text {and} \quad \delta _{t}(k)>\delta _{t}(l) \end{aligned}$$

In other words, given our original assumption, the Viterbi sequence breaks if the following inequalities are met:

$$\begin{aligned} \delta _{t-1}(k)*w_{kk}(t)&>\delta _{t-1}(l)*w_{lk}(t) \\ \quad&\text {and} \quad \\ \delta _{t-1}(k)*w_{kk}(t)&>\delta _{t-1}(l)*w_{ll}(t) \end{aligned}$$

In ratio form:

$$\begin{aligned} \frac{w_{kk}(t)}{w_{lk}(t)}&> \frac{\delta _{t-1}(l)}{\delta _{t-1}(k)} \end{aligned}$$
(4.13)
$$\begin{aligned} \quad&\text {and} \quad \nonumber \\ \frac{w_{kk}(t)}{w_{ll}(t)}&>\frac{\delta _{t-1}(l)}{\delta _{t-1}(k)} \end{aligned}$$
(4.14)

If an observation is emitted by state l and it is not undergoing a state switch, (\(\delta _{t-1}(l)>\delta _{t-1}(k)\) and \(w_{ll}(t)>w_{kk}(t)\)), inequality (4.14) fails and the Viterbi sequence does not break.

When we transition from state l to k and begin emitting \(w_{kk}(t)>w_{ll}(t)\). However, the momentum in \(\delta _{t}(k)\) and \(\delta _{t}(l)\) prevent their inequality relationship to switch. Nonetheless, after p time steps, the inequality \(\delta _{t-1}(l) > \delta _{t-1}(k)\) becomes \(\delta _{t-1+p}(l) < \delta _{t-1+p}(k)\). The latter is reasonable when state k has been emitting for some time. It is before this time \(t-1+p\) that inequalities (4.13) and (4.14) are met. To see why, one will notice the left hand side inequalities in (4.13) and (4.14) are larger than 1 while the right hand side ratios become smaller than 1 at time \(t-1+p\), thus their inequality relationships must’ve swapped before this time. Note that the order in which inequalities (4.13) and (4.14) are met matters. If (4.14) is met but (4.13) is not, a clean cut occurs since we have:

$$\begin{aligned} \delta _{t-1}(k)*w_{kk}(t)&<\delta _{t-1}(l)*w_{lk}(t)\end{aligned}$$
(4.15)
$$\begin{aligned} \quad&\text {and} \quad \nonumber \\ \delta _{t-1}(k)*w_{kk}(t)&>\delta _{t-1}(l)*w_{ll}(t) \end{aligned}$$
(4.16)

Equation (4.16) asserts a switch from l to k. Equation (4.15) states that \(\delta _{t-1}(l)\) contributes to \(\delta _{t}(k)\), implying the previous max state contributes to the next max state. And since the max state has changed, the roles of l and k swap and inequality (4.14) begins to fail and renders sequence break unattainable. This yields a clean transition cut. If (4.13) and (4.14) are met simultaneously, a sequence break occurs. But since (4.14) is met, the states switch and the roles of l and k swap and preclude a further sequence break.

If, let’s say at time \(t_a\), inequality (4.13) is met but (4.14) is not, a future sequence break is destined to happen at the future moment when (4.14) is met, let’s say at time \(t_b\). When that sequence break occurs, the history between \(t_a\) and \(t_b\) flips and after the sequence breaks the state transitions from l to k. Again, roles switch and no further sequence breaks occur. Above all, we can safely conclude that the during the execution of the stable period of a hidden state, no sequence break will occur. It is only during state transitions that a sequence could break and even if it does, it only last for a single time step–the time step when inequality (4.14) is met. The analysis extends to HMM models with more than 2 hidden states, so long as we apply the analysis to pairs of hidden states composed of the current max state and another non-max state.

Corollary 4.1

Given Theorem 4.1 the gradient of the log-likelihood of the forward algorithm (from now on referred to as the forward gradient) will depend solely on the latest emission probabilities and the transition matrix.

Proof

According to the log-exp-sum trick [29], the approximation \(\log {\sum _{i=1}^{N}{\exp (y_i)}}\approx \max _{i\in \{1, \ldots , N\}}{y_i}\) is best approached for larger values of \(y_i\). Applying this approximation to Eqs. (4.5) and (4.7), which is supported by Theorem 4.1, we have:

$$\begin{aligned} \mathscr {L}_t \approx \max _{i\in \{1, \ldots , N\}} ( {\log {{\alpha _i}(t)}} ) \end{aligned}$$
(4.17)
$$\begin{aligned} \begin{aligned} \log { {\alpha _i}(t) }&\approx \log { {b_i}(y_{t}) } +\\&\quad \max _{j \in \{1, \ldots , N\} } ( {\log { {{\alpha _j}(t-1) } + \log {A_{ji}}}} ) \end{aligned} \end{aligned}$$
(4.18)

Substitute Eq. (4.18) into Eq. (4.17), rename \(i\) and \(j\) to \(z_t\) and \(z_{t-1}\), then recursively decompose \(\log \alpha \), we have:

$$\begin{aligned} \mathscr {L}_t&=\max _{{z_t}\in \{1, \ldots , N\}}({\log {{\alpha _{z_t}}(t)}}) \nonumber \\&=\begin{aligned}&\max _{{z_t}\in \{1, \ldots , N\}}(\log {{b_{z_t}}(y_{t})}+\\&\quad \max _{{z_{t-1}}\in \{1, \ldots , N\}}(\log {A_{{z_{t-1}}{z_t}}}+ {\log {{{\alpha _{z_{t-1}}}(t-1)}}})) \end{aligned} \nonumber \\&=\begin{aligned}&\max _{{z_t}\in \{1, \ldots , N\}}(\log {{b_{z_t}}(y_{t})}+\\&\quad \max _{{z_{t-1}}\in \{1, \ldots , N\}}(\log {A_{{z_{t-1}}{z_t}}}+\log {b_{z_{t-1}}(y_{t-1})}+\\&\quad \quad \ddots \\&\quad \quad \quad \max _{{z_{1}}\in \{1, \ldots , N\}}(\log {A_{{z_{1}}{z_{2}}}}+{\log {{\pi _{z_1}{b_{z_1}(y_1)}}}})\ldots )) \end{aligned} \end{aligned}$$
(4.19)

Equation (4.19) is the log version of Eq. (4.8). This suggests that, after approximations by Eqs. (4.17) and (4.18), the computation of the log-likelihood is the same as the computation of the Viterbi path using the Viterbi algorithm.

Now, since Theorem 4.1 shows that in general the Viterbi path at time \(t\), \(\hat{z}_{1:t}\), expands on the Viterbi path at time \(t-1\), \(\hat{z}_{1:t-1}\), we have:

(4.20)

Then, the forward gradient can be derived from Eq. (4.20) as:

$$\begin{aligned} \nabla \mathscr {L}_t&=\mathscr {L}_t-\mathscr {L}_{t-1} =\log {{b_{\hat{z}_t}}(y_{t})}+\log {A_{{\hat{z}_{t-1}}{\hat{z}_t}}} . \end{aligned}$$
(4.21)

Equation (4.21) supports Corollary 4.1, where the forward gradient depends on the latest emission probability \({b_{\hat{z}_t}}(y_{t})\) and transition probability from hidden state \(\hat{z}_{t-1}\) to \({\hat{z}_t}\). Also, given that good HMM models have strong inertia (high probabilities of self-transitions), state-switching should be sparse and then \(\hat{z}_{t-1}\) will equal to \({\hat{z}_t}\) most of the time.

  • Normal Skill Identification Based on the Forward Gradient

Corollary 4.1 led us to design a new method for skill identification. If we use n HMM models to represent n robot skills, with observations coming from a certain skill, the HMM model corresponding to that skill \(\hat{m}\), should output a value-increasing forward log-likelihood curve that is greater than the rest of the HMM models. This also means model \(\hat{m}\) will output a larger forward gradient value compared to other models.

The forward gradient depends on the latest emission probabilities, which in turn depend on the latest observation. The largest probabilities and thus gradients will belong to the HMM model of a currently executing robot skill. The key insight however is the inverse relationship: the use of the forward gradients to infer the currently executing skill, and then it can validate the strong correlation between the forward gradient and skill observations. This is contrasted with the log-likelihoods of the observations \(log P(Y_{1:t} | \varPi )\) do not. The forward-gradient measure for skill identification is defined as follows: given p skills \(s_1:s_p\), we have HMM models \(m_s\) for \(s \in \{1, \ldots , p\}\) and an input time series Y, then the most probable skill \(\hat{s}\) generating Y[t] is inferred as:

$$\begin{aligned} \hat{s}=\arg \max _{s \in \{ 1, \ldots , n \}} ( \nabla \mathscr {L}_t^{m_s}(Y) ) \end{aligned}$$
(4.22)

where, \(\nabla L_t^{m_s}(Y)\) is the forward gradient output by model \(m_s\) at time t computed using time series Y.

4.5.2 Anomaly Monitoring Based on the Forward Gradient

We now build on the premise established in Eq. (4.22). Furthermore, consider a set of nominal observations for an executing skill, we know that the corresponding skill HMM model will output a value-increasing forward log-likelihood curve, and hence, a stable positive forward gradient. So, when an anomaly occurs, the forward gradient decreases significantly as illustrated in Fig. 4.6. Given that anomalies influence the forward gradient value, we propose a gradient-based metric for HMM anomaly monitoring.

Fig. 4.6
figure 6

For each skill (denoted in different background color), the log-likelihood gradient can be used for skill identification

Consider an HMM model m representing a skill s with n time-series trials \(Y_i\) for \(i \in \{1, \ldots , n\}\) collected from successful executions of skills \(s \in S\). To detect anomalies in a new time series Y we can first derive:

$$\begin{aligned} \nabla \mathscr {L}_{max} = \max _{i\in \{1, \ldots , n\}}(\max _{t\in \{1, \ldots , T_i\}}(\nabla \mathscr {L}^{m}_{t}(Y_i))), \end{aligned}$$
$$\begin{aligned} \nabla \mathscr {L}_{min} = \min _{i\in \{1, \ldots , n\}}(\min _{t\in \{1, \ldots , T_i\}}(\nabla \mathscr {L}^{m}_{t}(Y_i))), \end{aligned}$$
$$\begin{aligned} \nabla \mathscr {L}_{range} = \nabla \mathscr {L}_{max}-\nabla \mathscr {L}_{min}, \end{aligned}$$

where \(T_i\) is the time length of trial \(Y_i\) and \(\nabla L^{m}_{t}(Y_i)\) is the forward gradient output by model m at time t computed using time series \(Y_i\). Then, we use an empirically-derived test to trigger an anomaly for Y:

$$\begin{aligned} \nabla \mathscr {L}^{m}_{t}(Y)<\nabla \mathscr {L}_{min}-\frac{\nabla \mathscr {L}_{range}}{2}. \end{aligned}$$
(4.23)

This test detects if the gradient is an outlier compared with gradients of successful skill executions.

4.5.3 Experiments and Results

As for experimental setup, a dual armed humanoid Baxter robot was used to perform a pick-and-place operation. The robot consisted of a Robotiq force-torque sensor and standard Baxter fingers. 5 nominal trials were used for training the HMM model. In testing, 5 trials were used for skill identification and anomaly monitoring respectively. The pick-and-place task consists of 5 skills:

  1. (i)

    hover over the picking position,

  2. (ii)

    grasp the object,

  3. (iii)

    lift the object,

  4. (iv)

    hover to the placing position, and

  5. (v)

    place the object.

Figure 4.7, shows the experimental setup and the execution of the five skills.

Fig. 4.7
figure 7

The Baxter humanoid robot performing a pick-and-place task. 5 independent skills used to perform the task. Executed skill motions are sketched with red arrows

For training, the observation vector concatenates a 7-dimensional Cartesian end-effector pose and a 6-dimensional wrench. For each skill, we train corresponding HMM models using the Baum–Welch algorithm. The number of hidden states is selected such that emission probabilities are maximized leading to distinct and uniquely grouped hidden states.

For results reporting, we use the three factors identified by Pettersson’s survey on Event Detection [30]. Namely: diagnoses accuracy, robustness (false-positive rate), and reaction time (the time it takes to identify a skill from the beginning of a skill execution). Note that for anomaly identification, internal and external perturbations are used including: unexpected movement of target object, object absence, slippery picks, and human collisions.

  • (A) Skill Identification Performance

Figure 4.8 presents The skill identification confusion matrix. Skills 1 and 3 were recognized with 100% accuracy, 2 and 4 with 99% accuracy, and skill 5 with the largest surface contacts with 94% accuracy. Overall accuracy was 98.4%.

Fig. 4.8
figure 8

Skill identification confusion matrix for pick-and-place

  • (B) Reaction Time Performance

In terms of reaction time, a percentage is computed to assess the time it takes for the identification to execute from the beginning of a skill. The reaction percentage, using t as “true”, is computed as: offset \(=\) predicted-t_beginning, length\(=\)t_end-t_beginning, and reaction\(=\)offset/length. The closer the reaction percentage is to 0% the better the identification method. A negative reaction percentage means the predicted start occurs earlier than ground truth, while a positive percentage implies a delayed identification. We assess two forms to determine the beginning of a skill (i) use the “first skill” occurrence, or (ii) use the “first 10 successive skill” occurrences. The reaction percentage for these two formats is found in Table 4.2. The average reaction time for absolute values (i.e. looking at the average time difference of the prediction, whether early or late) the “first skill” is 2.70% across all skills and the average reaction time for the “first 10 skills” is 0.97%. Between the two measures, we have a total average of 1.84%.

Table 4.2 Reaction time as a duration percentage of a skill
  • (C) Anomaly Monitoring Performance

For anomaly monitoring performance of our gradient-based method, we use two environments: (i) anomaly identification as it occurs and any false-positives before an actual anomaly occurs, and (ii) an external collision is given to the robot to trigger a recovery. Then, when the robot completes its recovery behavior, we count how many false-positives are triggered before moving to the next skill execution. The robot recovery behavior is detailed in [31]. Five nominal and five anomalous trials are used for the analysis. The results are compared with two other baseline methods: the magnitude-based metric from the description in normal skill identification based on the forward gradient, and the derivative-of-difference metric from the definition in anomaly monitoring based on the forward gradient.

The five anomalous trials contain a total of 14 anomalies, consisting of:

  1. (i)

    one anomaly caused by the displacement of the target object

  2. (ii)

    one anomaly caused by no target object

  3. (iii)

    two anomalies caused by slippery picks

  4. (iv)

    five anomalies caused by human collisions to the robot gripper during each skill execution

  5. (v)

    five anomalies caused by human collisions to the robot arm during each skill execution.

  • (D) Pre-recovery Performance

For pre-recovery performance computation, during each trial, we record the anomalies triggered by the testing metric and count its true positives, false positives and false negatives. The results are shown in Table 4.3. The result shows our proposed forward gradient detected all anomalies and triggered no false positives or false negatives. The other two baseline methods suffer from false positives though they deliver high true positives.

Table 4.3 Anomaly monitoring metrics performance comparison
  • (E) Post-recovery Performance

For post-recovery performance metrics, we trigger an intentional anomaly, after recovery is completed, we count any false-positives before next skill execution. Both the forward gradient method and the derivative-of-difference method had not false-positives. Whilst the magnitude-based metric had more than 10 and prevented the system from continuing its task execution.

  • Comparison with Related Works

Comparisons across works is challenging as results use different formats across experiments. Table 4.4 is an effort to harmonize results across related literatures. The comparison should be done loosely as different tasks (small levels of contact vs. large levels of contacts, structured environment vs. unstructured environment) present different challenges to event detection. For skill identification, our current approach ranks 2nd behind the tool breakage work that identified anomalies in structured milling processes. Our work did better than [16, 22], albeit these works modeled more complex dynamical phenomena. Similar statements can be made about anomaly identification. As for reaction times, our approach offers about double the speed-up compared to the only other work that reported this number. In conclusion, based on internal and external evidence, we hold that our measure is the most robust, stable, and fastest measure reported to date.

Table 4.4 Skill and anomaly identification, and reaction time comparison for state-of-the-art event-detection methods

4.5.4 Discussion

This work presented a theoretically derived event detection measure useful for nominal and anomalous behavior identification, even in post-recovery actions. Our results showed very strong performance compared with state-of-the-art results across the board. More experimental validation is certainly necessary: both in number of trials and robotic tasks. This work also remains to be tested in the area of anomaly diagnoses. The latter is concerned not only with the identification problem with the grouping of anomaly types which is more challenging. We anticipate working in conjunction with machine or deep learning models for the diagnoses of this signals. Some works [11, 12, 16, 23] already provide some characterizations.

4.6 Mapping Latent State to Log-Likelihood for Anomaly Monitoring

4.6.1 Hierarchical Dirichlet Process Vector Auto-regressive Hidden Markov Models

HMMs segment sequential data into interpretable hidden states that are assumed Markovian. They are limited by its a priori determination of hidden states (and fixed throughout the modeling process) as well as the HMMs limited ability to model complex dynamics due to the assumption that observations are conditionally independent given the state. Instead, this section uses Bayesian non-parametric techniques that model HMMs through priors that can model an unbounded number of hidden states. In particular, we use HDP priors to learn HMMs with infinite hidden states. We also use the switching vector auto-regressive (HDP-VAR-HMM) to model multimodal observation sequences and its element consists of features extracted from an F/T sensor, joint encoders, and a tactile sensor (for a discussion on the used features see experimental setup). Figure 4.9 illustrates graphical model representations for the sHDP-HMM and the sHDP-VAR-HMM.

Fig. 4.9
figure 9

Graphical representation of HDP-HMM and HDP-VAR-HMM. a The graphical representation of a sticky HDP-HMM over T time steps. The latent, discrete-valued Markov process \(z_t\) captures the temporal dependencies in the observations \(y_t\). b The graphical model for a sticky HDP-VAR (r)-HMM over T time steps. The latent, discrete-valued Markov process \(z_t\) dictates the linear dynamical model at time t. For an order r switching VAR process (\(r = 1\) in this paper), the conditionally linear dynamics are completely determined by previous observations \(y_{t-1}\)

Assume that we record N sequential executions for each skill and expect to learn the common dynamics among them. We utilize the HDP-VAR-HMM to jointly model those sequences, where sequence \(n \in N\) has multidimensional observation data \(y_n = [y_{1}, y_{2}, \ldots , y_{T}]\) at time t. For instance, \(y_{t} \in \mathbb {R}^d\) could be an instant of wrench signatures and end effector velocity, where d is the dimensionality of observed features. The HDP-VAR-HMM interprets each observation \(y_{t}\) by assigning it a single hidden state \(z_{t}\). The hidden state value is derived from: (i) a countably infinite set of consecutive integers \(z_{t} = k \in \{1, 2, 3, \ldots \}\), (ii) an initial probabilistic state distributions \(\pi _0\) generated with Markovian dynamics, and (iii) a transition distribution \(\{\pi \}_{k=1}^{\infty }\). Then, the Markovian structure on the state sequence dictates that for all \(t > 1\):

$$\begin{aligned} \begin{aligned}&z_{t} \sim \pi _{z_{t-1}}, \\&p(z_{1} = k) = \pi _{0}, \\&p(z_{t} = k^{'} | z_{t-1} = k) = \pi _{kk^{'}}. \end{aligned} \end{aligned}$$
(4.24)

Since \(z_t \sim \pi _{z_{t-1}}\), we see that \(z_{t-1}\) indexes all observations \(y_t\) are assigned with \(z_{t-1}\). We can draw the observation \(y_{t}\) given specific hidden state \(z_{t} = k\) from its corresponding observation likelihood functions. In our case, we consider the first-order auto-regressive Gaussian likelihood. As such, observations \(y_{t}\) are considered to be a noisy linear combination of the previous observation plus additive white noise \(\varepsilon \) and can be expressed as:

$$\begin{aligned} y_{t} = A_k y_{t-1} + \varepsilon _{t}(z_t = k), \quad \varepsilon _{t}(k) \sim \mathscr {N}(0, \varSigma _k). \end{aligned}$$
(4.25)

where, each state k has two dynamic parameters \(\{A_k, \varSigma _k\}\): \(A_k\), a \(d \times d\) matrix of regression coefficients that defines the expected value of each successive observation as \(\mathbb {E}[y_{t} | y_{t-1}] = A_ky_{t-1}\), and \(\varSigma _k\), a \(d \times d\) symmetric positive definite matrix that defines the covariance matrix of state k.

Therefore, the Gaussian regression observation model explains a observed data pairs of input y and output z. Each input is a vector of length \(y \in \mathbb {R}^d\), while each output is a scalar. In the paper, we assume that the input data y are fixed dependence on previous observation and focus on a generative model for learning the output data z which depends on y. Various generative models, such as full-covariance Gaussian, diagonal-covariance Gaussian, are possible for the observed input data. These can be directly define the multivariate Gaussian log-likelihood function given specific state \(z_k\),

$$\begin{aligned} \begin{aligned} log\ p(y_{t} | \theta _k, y_{t-1})&= log\ p(y_{t}|A_k,\varSigma _k, y_{t-1}) = log\ \mathscr {N}(y_{t}|A_ky_{t-1},\varSigma _k) \\&= - \frac{d}{2}log(2\pi )-\frac{1}{2}log|\varSigma _k|-\frac{1}{2}(y_{t}-A_ky_{t-1})^T\varSigma _k^{-1}(y_{t}-A_ky_{t-1}). \end{aligned} \end{aligned}$$
(4.26)

However, it is often the case that we don’t know the parameters of this distribution, but instead to estimate them. We need to learn the approximate values \(\theta _k = \{A_k, \varSigma _k\}\) for each state by defining a conjugate prior distribution on it. Particularly, the Matrix Normal Inverse Wishart (MNIW) distribution is conjugate for both the \(A_k\) and \(\varSigma _k\) are uncertain. This distribution assume that when only the covariance \(\varSigma _k\) is uncertain, the conjugate prior is defined as d-dimensional Inverse Wishart (IW) distribution with covariance parameter \(\varDelta \), a symmetric, positive definite scale matrix, and \(\nu \), the degrees of freedom, i.e.

$$\begin{aligned} \begin{aligned} \varSigma _k&\sim \mathscr {IW}(\nu , \varDelta ), \end{aligned} \end{aligned}$$
(4.27)

where the first moment is given by the mean

$$\begin{aligned} \mathbb {E}[\varSigma _k]=\frac{\nu \varDelta }{\nu -d-1}. \end{aligned}$$
(4.28)

After that, we place a conditionally Matrix-Normal (MN) prior on \(A_k\) given \(\varSigma _k\), which is given by

$$\begin{aligned} \begin{aligned} A_k&\sim \mathscr {MN}(M,\varSigma _k,V), \\ \end{aligned} \end{aligned}$$
(4.29)

where M is the mean matrix of \(A_k\) and V is a symmetric, positive definite matrix that determine the covariance of \(A_k\) relative to \(\varSigma _k\). Therefore, we derive the useful expectations

$$\begin{aligned} \begin{aligned} \mathbb {E}(A_k)&= M, \\ Cov[A_k]&= \frac{1}{\nu -d+1}(V^{-1} \times \varDelta ). \\ \end{aligned} \end{aligned}$$
(4.30)

The resulting log-likelihood function of joint distribution of \({A_k, \varLambda _k}\) is defined as \(log\ p(A_k, \varSigma _k | \nu , \varDelta , M, V^{-1})\)

$$\begin{aligned} log p(A_j, \varLambda _j | \nu , \varDelta , M, V^{-1}) = F_{\mathscr {IW}}(\nu , \varDelta ) + F_{\mathscr {MN}}(M, V) + F, \end{aligned}$$
(4.31)

where,

$$\begin{aligned} \begin{aligned} F_{WI}(\nu , \varDelta )&= -\frac{d(d-1)}{4}log\pi - \frac{d}{2}(log2)\nu -\sum ^{d}_{i=1}log \varGamma \left( \frac{\nu +1-d}{2}\right) +\frac{\nu }{2}log|\varDelta |, \\ F_{MN}(M, V)&= - \frac{d^2}{2}log2\pi +\frac{d}{2}log|V|, \\ F&= \frac{\nu -1}{2}log|\varLambda _k| - \frac{1}{2}tr(A_k^T\varLambda _kA_kV) + tr(\varLambda _kA_kVM^T) - \frac{1}{2}tr(\varLambda _k(\varDelta +MVM^T)). \end{aligned} \end{aligned}$$
(4.32)

where \(\varGamma () \) is represented the gamma function. Therefore, all the optional hyperparameters of HDP-VAR-HMM are specified by \(\nu , \varDelta , M, V\). Apparently this straight-forwardly method is computationally feasible when those four parameters are known. We will specify the detailed options in experimental section through all our experiments.

With reference to the Eq. (4.24), for the probabilistic representation of transition distribution \(\pi _j\) under the HDP-VAR-HMM prior, the number of states is unbounded and every observation assigned a unique state. The HDP prior encourages sharing states over time through a latent root probability vector \(\beta \) over the infinite set of states (see Fig. 4.9). Consider the probability \(\sum _{j=1}^\infty \beta _j = 1\) on a countably infinite set, where defines the sticking-breaking construction of the prior on \(\beta \) and first draws independent variable \(\mu _j \sim Beta(1, \gamma )\) for each state j, and then sets \(\beta _j = \mu _j \prod _{\ell =1}^{j-1}(1-\mu _\ell )\), where the concentration parameter \(\gamma \) controls the relative proportion of the weights \(\beta _j\). Here, we interpret \(\mu _j\) as the conditional probability of choosing state j amongs states \(\{j, j+1, j+2, \ldots \}\)

In expectation, we always describe the first K orders in stick-breaking as the most common states and represent their probabilities with the vector \([\beta _1, \beta _2, \ldots , \beta _K, \beta _{>K}]\), where \(\beta _{>K}=\sum _{j=K+1}^\infty \). Given this (K\(+\)1) dimensional probability vector \(\beta \), we simplify the HDP-VAR-HMM transition distributions \(\pi _j\) for each state j from a Dirichlet distribution with mean equal to \(\beta \) and variance governed by concentration parameter \(\alpha > 0\)

$$\begin{aligned}{}[\beta _1, \beta _2, \ldots , \beta _K, \beta _{>K}] \sim Dir(\alpha \beta _1, \alpha \beta _2,\ldots ,\alpha \beta _{>K}), \end{aligned}$$
(4.33)

Generally, we define the starting probability vector \(\pi _0\) from the same prior, but with much smaller variance, \(\pi _0 \sim Dir(\alpha _0\beta )\) with \(\alpha _0 \gg \alpha \), which denotes that few starting states are observed.

To capture the sufficient dynamics of given time series n with \(T_n\) observations. Our goal is to interpret these observation with the hidden state \(z_j\). However, instead of assigning variable states in a short time period, we should treat each observation in the context of its predecessors for improving the temporal consistency. A “sticky” parameter \(\kappa \) of favors self-transition by assigning extra prior mass on the transition probability \(\pi _{jj}\) was introduced. Thus, the transition representation will in a temporally consistent way, that is

$$\begin{aligned}{}[\beta _1, \beta _2, \ldots , \beta _K, \beta _{>K}] \sim Dir(\alpha \beta _1,\ldots , \alpha \beta _j + \kappa ,\ldots ,\alpha \beta _{>K}). \end{aligned}$$
(4.34)

where \(\kappa > 0\) governs the degree of self-transition bias. Informally, what we are doing is increasing the expected probability of self-transition for keeping the dynamical phenomenon consistency for many time-steps.

After constructing the sHDP-VAR-HMM model, we utilize the state-of-the-art variational inference algorithm for learning the parameters, which named as Split-Merge Monte Carlo method by Michael C. hughes et al. More information can be found in [28].

4.6.2 Anomaly Monitoring Based on Hidden State

The anomaly detector identifies whether a skill execution contains: no anomalies, one anomaly, or multiple anomalies. In this section, we introduce the mathematical modelling that makes up our detector.

  • (A) Hidden State Representation

Hidden state space modeling of multivariate time-series is one of the most important tasks in representation learning by dimensional reduction. In this work, we propose the anomaly detector is a thresholding marginal log-likelihood with Bayesian non-parametric hidden Markov model, which leads to tackle the anomaly monitoring problem in a more natural way to meet the need of real-world applications.

As we discussed above, the sHDP-VAR-HMM interpreted each observation \(y_{t}\) by assigning to a single hidden state \(z_{t}\), where the hidden state value is derived from a countably infinite set of consecutive integers \(z_{t} = k \in \{1, 2, 3, \ldots , K\}\). Figure 4.10 shows the low-dimensional hidden state space of the multi-modal time series generated by the robot during manipulation, which contains 5 hidden states.

Fig. 4.10
figure 10

An illustration for the representation of multimodal time series in low dimensional hidden state space

We denote \(\varTheta _s\) to represent all the parameters of the trained sHDP-VAR-HMM from nominal demonstrations, including the hyperparameters for learning the posterior and the parameters for the observation model definition. Our anomaly detector performs anomaly monitoring by comparing the marginal log-likelihood \(log\ p(y_{t}|\varTheta _s)\) based on current executing skill of the observation \(y_{t}\) with a learned threshold \(\rho (z)\), that derives from the most likely estimated hidden state given current observations.

(4.35)

Here, the z would be included variable values, that is, we should synchronously update the threshold in the same skill period. Let’s say the definition of an anomaly data point at any time during the manipulation is one that \(log p(y_{t}|\varTheta _s) < \rho (z)\), otherwise it deems the manipulation to be nominal.

The proposed method generates the mapping pairs of marginal log-likelihood and the most likely hidden state at any time. As specific hidden state z, we calculated the fixed log-likelihood threshold from testing the nominal demonstrations by subtracting half of range between maximum and minimum from the minimum. Unlike the anomalous log-likelihood was deviate by a certain standard deviation from the mean in [11, 12, 21], that is \(\rho (z) = \mu - c\sigma \), where the symbol c is a constant value for adjusting the sensitivity of the detector, \(\mu \) and \(\sigma \) are the estimated mean and standard deviation of the log-likelihood. Our implementation don’t need to take the constant c into consideration for each skill or hidden state.

  • (B) Joint Probability of the Observed Sequence

The threshold \(\rho (z)\) is associated with the joint probability of the observed sequence. Firstly, in order to derive the forward-backward procedure of first-order auto-regressive HMM for the joint state sequence \(z_{1:T}\) given observation \(y_{1:T}\), plus the initial observation \(y_0,\ (T \ge 1)\). We have this problem when dealing with conventional HMM, here the auto-regressive structure of ARHMM makes this situation more complicated, a modified forward-backward method could solve the problem efficiently. Let’s first write the joint distribution with the Markov property \(p(y_t|y_{t-1},\ldots ,y_1) = p(y_t|y_{t-1})\), that is

$$\begin{aligned} \begin{aligned} p(y_{0:T})&= \sum _{t=1}^T p(y_t | y_{1:t-1}), \\ p(y_t | y_{1:t-1})&= \sum _{k=1}^K p(z_t=k|y_{1:t-1})p(y_t|z_t). \end{aligned} \end{aligned}$$
(4.36)

Where \(\theta _{z_t}\) represents the observation parameters for the trained HMM and the \(z_{1:T}\) is a state path over the hidden state space. As such, if the state path is estimated, the maximum probability of the observation sequence can be obtained. However, since the true state path is hidden from the observations. For the computational convenience, the general approach would be to use the maximum likelihood state at any given moment, but this would neglect uncertainty. Instead, we use a marginal probabilistic representation over hidden states at each time step (see details in hidden state estimation of auto-regressive model section). Thus, the log-probability at time t is derived by computing the logarithm of the sum of exponentials over all hidden states

$$\begin{aligned} \mathscr {L}_t =log\ \sum _{k = 1}^K \exp \left( \frac{\alpha _t(k)\cdotp \beta _t(k)}{p(y_{0:T})}\right) , \end{aligned}$$
(4.37)

where \(\alpha _t(k)=p(z_t=k|y_{0:t})\), \(\beta _t(k)=p(y_{t+1:T}|z_t)\), are presented the forward message passing and backward message in the standard forward-backward algorithm, respectively.

$$\begin{aligned} \begin{aligned} \alpha _t(k)&= \frac{1}{p(y_t | y_{0:t-1})} p(y_t|z_t=k)p(z_t = k|y_{1:t-1}), \\ p(z_t = k |y_{0:t-1})&= \sum _jp(z_t |z_{t-1}=j)p(z_{t-1}|y_{0:t-1}). \end{aligned} \end{aligned}$$
(4.38)

The denominator represents the probability of a sequence observations, which can be calculated by Eq. (4.36). Assume that we recorded N sequential data for each skill and the length of sequence \(n \in N\) is \(T_n\). Thus, we can get a set of log-probabilities vector of each skill

$$\begin{aligned} \mathscr {L}^{1:N} = \{\mathscr {L}_{T_1}^1,\mathscr {L}_{T_2}^2, \ldots , \mathscr {L}_{T_N}^N\}. \end{aligned}$$
(4.39)

The above implies that the forward—backward procedure can be used to find the most likely state for any moment t. It cannot, however, be used to find the most likely sequence of states.

  • (C) Hidden State Estimation of Auto-regressive Model

For offline constructing the state and log-likelihood pairs, we are interested in the most likely state sequence upon an observation sequence \(y_{0:T}\). One might be derived from copulating the MAP sequence by maximizing the marginal probability at each observation i

$$\begin{aligned} \hat{z}_i = \underset{z_i}{arg\ max}\ p(z_i|y_{0:T}). \end{aligned}$$
(4.40)

We note this combined single observation estimated sequence might yield unlikely (even 0-probability) sequences. We therefore consider

$$\begin{aligned} \hat{z}_{1:T} = \underset{z_{1:T}}{arg\ max}\ p(z_{1:T}, y_{0:T}). \end{aligned}$$
(4.41)

where, \(p(z_{1:T}, y_{0:T})\) was computed in Eq. (4.36). Actually, we note the most likely state sequence that minimum the cost path to \(z_t\) is equivalent to the minimum cost path to node \(z_{t-1}\) plus the cost of a transition from \(z_{t-1}\) to \(z_t\), and the cost incurred by observation \(y_t\).

For convenience, we abbreviate the joint log-likelihood function \(\mathscr {L}(z_{1:T}) \triangleq log\ p(z_{1:T}, y_{0:T})\), and the transition probability \(\pi _{z_{t-1}} \triangleq z_{t-1}\), then

$$\begin{aligned} \mathscr {L}_T(z_{1:T}) = s_1(z_1)+\sum _{t=2}^{T}s_t(z_t, \pi _{z_{t-1}}), \end{aligned}$$
(4.42)

where

$$\begin{aligned} \begin{aligned} s_1(z_1)&=log\ (\pi _{z_1} p(y_1|\theta _{z_1},y_0)),\quad t=1 \\ s_t(z_t, z_{t-1})&=log\ (p(z_t|z_{t-1})p(y_t|\theta _{z_t},y_{t-1})),\ t \ge 2 \end{aligned} \end{aligned}$$
(4.43)

Obviously, \(\mathscr {L}_t(z_{1:t}) = \mathscr {L}_{t-1}(z_{1:t-1}) + s_t(z_t, z_{t-1})\), and set

$$\begin{aligned} \begin{aligned} \upsilon _t(z_t)&= \underset{z_{1:t-1}}{max}\ \mathscr {L}_t(z_{1:t}), \qquad z_t \in \{1,2,3,\ldots ,K\} \\ \upsilon _t(z_{t-1})&= \underset{z_{t}}{max}\ \{\upsilon _{t-1}(z_{t})+s_t(z_t, z_{t-1}\}, \end{aligned} \end{aligned}$$
(4.44)

From the above, The maximizing state sequence is iteratively obtained, for \(n\in \{T-1,T-2,\ldots ,1\}\)

$$\begin{aligned} \hat{z}_T = \underset{z_{t-1}}{arg\ max}\ \upsilon _n(z_{t-1}), \end{aligned}$$
(4.45)

Assume that we recorded N sequential data for each skill and the length of sequence \(n \in N\) is \(T_n\). Thus, we can get a set of the most probable states vector of each skill

$$\begin{aligned} \mathscr {Z}^{1:N} = \{\mathscr {Z}_{T_1}^1,\mathscr {Z}_{T_2}^2, \ldots , \mathscr {Z}_{T_N}^N\}. \end{aligned}$$
(4.46)
  • (D) Threshold Calculation

Having computed the joint probability of the observations and the most probable states for each skill s. we can concatenate them to compute the threshold of specific state \(z=k\), that is \(\rho (z_k)\). Each state consists of its maximum \(\mathscr {L}_{z_k}^{max}\) and minimum \(\mathscr {L}_{z_k}^{min}\). We cluster the log-likelihoods with the same state, which represents the clustering data by HMM belong to the same statistical distribution.

$$\begin{aligned} \begin{aligned} \mathscr {L}_{z_k}^{max}&= \underset{i\in \{1,\ldots ,N\}}{max} \mathscr {L}_{i}^{n} \mathbf {1}(z_k == k), \\ \mathscr {L}_{z_k}^{min}&= \underset{i\in \{1,\ldots ,N\}}{min} \mathscr {L}_{i}^{n} \mathbf {1}(z_k == k), \\ \mathscr {L}_{z_k}^{range}&= \mathscr {L}_{z_k}^{max} - \mathscr {L}_{z_k}^{min}, \end{aligned} \end{aligned}$$
(4.47)

where the notation \(\mathbf {1}(\eta )\) to represent an indicator random variable that is 1 if the event \(\eta \) occurs and 0, otherwise. Using Eq. (4.47), we can set the anomaly monitoring threshold of each active state \(z_k\) at any moment according to:

$$\begin{aligned} \rho (z_k) = \mathscr {L}_{z_k}^{min} - \frac{\mathscr {L}_{z_k}^{range}}{2}. \end{aligned}$$
(4.48)

In testing implementation, we preprocessed the incoming multimodal data in the same scheme as the training process and then get the most possible state \(z_{test}\) by recursively compute the filtered marginals and the corresponding log-likelihood \(\mathscr {L}_{test}\) using Eqs. (4.38) and (4.37) respectively.

$$\begin{aligned} z_{test} = \underset{1,\ldots ,K}{arg\ max}\ \{p(z_t=1|y_{0:t}),\ldots ,p(z_t=K|y_{0:t})\}. \end{aligned}$$
(4.49)

Here, this approach detects an anomaly when the log-likelihood is lower than the state dependent threshold \(\rho (z_{test})\) This approach also extends to online detection. Given the trained sHDP-VAR-HMM and streaming sensor data, the detector can recursively compute the \(z_t\) and \(\mathscr {L}_t\) at time t, and then perform the comparison, if \(\mathscr {L}_t < \rho (z_{test})\), then anomaly, else nominal.

4.6.3 Experiments and Results

  • Finite State Machine Based Kitting Experiment

In this section, we describe the kitting experimental setup, external disturbances, data collection, and anomaly monitoring and diagnoses model parameters.

A kitting experiment is setup, where a robot and a human co-worker closely collaborate to place a set of goods in a packaging box. Specifically, a human co-worker is tasked to place a set of 6 objects on the robot’s “collection bin” (located in front of the robot) in a one-at-a-time fashion. The objects may accumulate in a queue in front of the robot. As soon as the first object is placed on the table, the robot identifies the object and places it in a packaging box located to the right of the robot. Thus, the robot picks an object and transports it towards the box, after which, the robot appropriately places each of the six objects in different parts of the box. The whole implementation is shown in Fig. 4.11.

Fig. 4.11
figure 11

A kitting experiment consists of 6 skills (in various color) that were modeled by DMP, respectively. Objects that need to be packaged are placed by a human collaborator before the robot in a collection bin. The shared workspace affords possibilities for accidental contact and unexpected alteration of the environment. The robot is tasked to pick each one of the objects and place them in the packaging box to its right. The visualization module uses the ALVAR tags to provide a consistent global pose with respect to the base of the robot. Ten nominal executions and one snapshot of each skill are shown

The kitting experiment consists of 6 skills (Skill 3) : Home \(\rightarrow \) Pre-pick; (Skill 4) : Pre-pick \(\rightarrow \) Pick; (Skill 5): Pick \(\rightarrow \) Pre-pick; (Skill 7) : Pre-pick \(\rightarrow \) Pre-place; (Skill 8) : Pre-place \(\rightarrow \) Place; (Skill 9) : Place \(\rightarrow \) Pre-place that were effectively modeled by the dynamic movement primitives and the ROS-SMACH.Footnote 1 One-shot kinesthetic demonstrations were used to train DMP models for each of the skills of the kitting experiment. Note that a skill’s starting and ending pose can be adapted if necessary thus providing a flexible and generalizable skill encoding procedure. The trajectory adaptations however increase the difficulty of assessing the sensory data for both anomaly monitoring and diagnoses—due to the variability in the data collection even from nominal executions.

  • The Robot, Ojects, and Visual System

A Baxter humanoid robot’s right arm is used to pick commonplace objects set before him. The robot is equipped with a 6 DoF Robotiq FT sensor, 2 Baxter-standard electric pinching fingers, as well as Baxter’s left hand camera. Each finger is further equipped with a multimodal tactile sensor composed of: (i) a 4 \(\times \) 7 taxel matrix that yields absolute pressure values, (ii) a dynamic sensor which provides a single capacitive reading in millivolts (mV) useful to detect tactile events, and (iii) an IMU and gyroscope [34]. Baxter’s left hand camera is placed flexibly in a region that can capture objects in the collection bin with a resolution of \(1280\times 800\) at 1 fps (we are optimizing pose accuracy and lower computational complexity in the system). The use of the left hand camera facilitated calibration and object tracking accuracy.

A set of 6 common household objects consisting of box-like shapes and bottles were used in our work. The objects ranged in weight from 0.0308 to 1.055 kg and in volume from \(3.2 \times 10^{-04}\,\mathrm{m}^3\) to \(1\times 10^{-03}\,\mathrm{m}^3\). The object’s surfaces also varied slightly: some heavier objects had sleeker surfaces to incite object slips—we believe not an unreasonable determination as warehouses contain a wide variety of objects—whilst other objects had rougher surfaces. Across executions, object locations and order was varied to promote generalization.

Fig. 4.12
figure 12

Seven representative unexpected anomalies in the kitting experiment (left-to-right): Human Collision (HC), Tool Collision (TC), Object Slip (OS), Human Collision with Object (HCO), Wall Collision (WC), No Object (NO), Other (OTHER)

Alvar tags,Footnote 2 with 0.06 m sides, were placed around the circumference of the objects for robust visual recognition (ALVAR can handle change in lighting conditions, optical flow-based tracking, and good performance for multitag scenarios) regardless of orientation.

  • External Disturbances

External disturbances may be introduced into a system for a variety of reasons. We list possible scenarios in which disturbances may occur in real-systems, as shown in Fig. 4.12:

  1. (1)

    For collaborative jobs (such as kitting in warehouses, see the description in experiment setup section), humans may experience monotony leading to boredom and loss-of-attention. In such cases, it is possible for a human co-worker to accidentally collide with the robot manipulator or alter the environment in unexpected ways.

  2. (2)

    The user may also accidentally collide or unintentional move packaging objects in ways a robot may not anticipate. Such variations in the world may lead to tool-collisions.

  3. (3)

    Picked objects may slip from a robot’s gripper if the grasp is not optimal; or if upon motion, inertial forces acting on the object cause slippage that breaks the grasp.

  4. (4)

    Object slips may also be caused due to human collisions. Such phenomena depicts a chain of anomaly reactions where one anomaly leads to another.

  5. (5)

    The robot may also collide with packaging box during kitting.

  6. (6)

    Missed-grasps (where the robot pinches air) are possible when the target object moved without the robot’s notice.

  7. (7)

    Additionally, it is possible to suffer from false-positives from the anomaly detector. False-positives may occur for numerous reasons: system errors, unreachable objects, unfeasible inverse kinematic solutions, unidentifiable objects from the visual system, to name a few.

The seven types of anomalies we just described will be declared more succinctly in the rest of the paper as: Human-Collisions (HC), Tool-Collisions (TC), Object Slips (OS), Human-Collision-with-Object (HCO), Wall-Collision (WC), No-Object (NO), and Other (OTHER) respectively. In Sect. 4.6.1, the introspection methodology used to model robot skills, continued by a presentation of anomaly monitoring in Sect. 4.6.2 and anomaly diagnoses in Chap. 5.

  • Dataset Collection and Sensory Preprocessing

It is undeniable that bias exists on the side of robot researchers as to which anomalies may exist in the task. In this work, we first attempt to discover likely anomalies in this task by emulating a collaborative kitting task in which the human collaborator experiences tedium and monotony resulting in undesired events that lead to anomalous events.

We tasked 5 participants—whose ages range from 23 to 28—to act as a collaborator in the kitting task under the monotonous conditions mentioned in experiment setup section. One expert user and four novice users participated in the experiments. Novice users learned from the expert to induce anomalies by observing a round of executions. Each participant performed 1 nominal and 6 anomalous executions by placing the set of six household objects in a one-by-one fashion. For training, when an anomaly occurs, the robot execution is halted and the data stream is labelled with an anomaly class (namely the labels with HC, TC, OS, NO, etc.) for supervised classification during testing. Subsequently, anomalous time-stamps are also extracted. In total, we collected a data set with 18 nominal executions and 180 anomalous executions (each anomalous trial has at least one anomaly across the entire task), and 38 failure trails are ignored.

As for our multimodal observation vector, it is comprised of 6 DoF force-torque signals, 6 DoF Cartesian velocity signals (from Baxter’s right end-effector), and 56 DoF tactile signals from both the left and right tactile sensors. Empirical feature extraction is performed on this observation vector to improve diagnoses performance. We now describe how features for each modality at time t were engineered.

Wrench modality: Assume the force and torque sequence is a time-series vector and that each element represents the magnitude of each dimension \((f_x, f_y, f_z, t_x, t_y, t_z)\). We wish to extract structural information from the wrench instead of simply using raw values. In this way, we can find signal patterns that may occur across the different DoFs. To this end, we computed the norm of both the force \(n_f\) and the torque \(n_t\) as features respectively:

$$\begin{aligned} n_{f} = \root \of {f_x^2 + f_y^2 + f_z^2}, \quad n_{t} = \root \of {t_x^2 + t_y^2 + t_z^2}. \end{aligned}$$
(4.50)

Velocity modality: Similarly, we take the Cartesian linear \((l_x, l_y, l_z)\) and angular \((a_x, a_y, a_z)\) velocities and compute their norm \(n_{l}\) and \(n_{a}\) respectively:

$$\begin{aligned} n_{l} = \root \of {l_x^2 + l_y^2 + l_z^2}, \quad n_{a} = \root \of {a_x^2 + a_y^2 + a_z^2}. \end{aligned}$$
(4.51)

Tactile modality: Due to the computational cost of processing the tactile sensor’s high dimensionality, we empirically tested a number of features for each tactile sensor, they include: the maximum taxel value, the largest five taxel values, the mean of all taxel values, and the standard deviation for all taxel values. It was the standard deviation, which proved to be the most useful feature for anomaly monitoring and diagnoses. The standard deviation for each tactile sensor \(s_l\) and \(s_r\) are defined as:

$$\begin{aligned} s_{l} = \root \of {\frac{1}{28}\sum _{i=1}^{28}(l_i-\mu _l)}, \quad s_{r} = \root \of {\frac{1}{28}\sum _{i=1}^{28}(l_i-\mu _r)}. \end{aligned}$$
(4.52)

where \(\mu _l=\frac{1}{28}\sum _{i=1}^{28}l_i\) and \(\mu _r=\frac{1}{28}\sum _{i=1}^{28}r_i\) are the mean of each tactile sensor respectively. Equations (4.50)–(4.52) facilitate the identification of diverging signals during anomalous situations. We empirically concatenate all the valuable features for representing the robot executions both in nominal and anomalous cases. If raw signals are used directly, they result in high false-positive detection rates (FPR) in our experimentation. For instance, when a robot carries objects of varying weights, the F/T signals for such operations vary significantly making it difficult to use raw values for detection. Hence, a standardization method is used to scale the extracted features through a normalization constant. We ultimately end with an feature vector of length 18 as shown below:

$$\begin{aligned} \begin{aligned} y_t = \bigg [&\frac{f_x}{10}, \frac{f_y}{10}, \frac{f_z}{10}, \frac{t_x}{10}, \frac{t_y}{10}, \frac{t_z}{10}, \frac{n_f}{10},\frac{n_t}{0.4},l_x,l_y,l_z,a_x,a_y,a_z,\frac{n_l}{0.4},n_a,\frac{s_l}{250},\frac{s_r}{250}\bigg ]. \end{aligned} \end{aligned}$$
(4.53)

Note, that all training and testing data are pre-processed in the same manner. We will also use the same pre-processing approach when comparing with other baseline approaches.

  • Basic Parameter Settings of Each Model

For anomaly monitoring and diagnoses using the sHDP-VAR-HMM with a first-order autoregressive Gaussian likelihood, each state k has two parameters to be defined: the regression coefficient matrix \(A_k\) and the precision matrix \(\varLambda _k\). The conjugate prior is the MNIW for which four parameters \(\nu , \varDelta , V, M\) are assigned values in a fashion similar to the Motion-Capture-Dataset in [28]. In order to guarantee the prior has a valid mean, the degrees-of-freedom variable is set with \(\nu = d + 2\) and we set \(\varDelta \) by constraining the mean or expectation of the covariance matrix \(\mathbb {E}[\varLambda _k^{-1}]\) under the Wishart prior in Eq. (4.30).

$$\begin{aligned} \mathbb {E}[\varLambda _k^{-1}] = s_F \sum _{n=1}^N\sum _{t=1}^{T_n}(y_t - \overline{y})(y_t - \overline{y})^T. \end{aligned}$$
(4.54)

Assume that we record N sequential data for each skill and the length of sequence \(n \in N\) is \(T_n\). Thus, we can easily define the parameter \(\varDelta \) accordingly as

$$\begin{aligned} \varDelta = (\nu - d -1) \mathbb {E}[\varLambda _k^{-1}]. \end{aligned}$$
(4.55)

Specifically, we placed a nominal prior on the mean parameter with mean equal to the empirical mean and expected covariance equal to a scalar \(s_F\) times the empirical covariance, here \(s_F = 1.0\). This setting is motivated by the fact that the covariance is computed from polling all of the data and it tends to overestimate mode-specific covariances. A value slightly less than or equal to 1 of the constant in the scale matrix mitigates the overestimation. Also, setting the prior from the data can move the distribution mass to reasonable parameter space values. The mean matrix M and V are set such that the mass of the Matrix-Normal distribution is centered around stable dynamic matrices while allowing variability in the matrix values (see Sect. 4.6.1 for details).

$$\begin{aligned} \begin{aligned} M&= \mathbf {0}_d, \\ V&= 1.0*\mathbf {I}_d. \end{aligned} \end{aligned}$$
(4.56)

where \(\mathbf {0}_d\) and \(\mathbf {I}_d\) are the matrix of all zeros and the identity matrix, respectively, each of size \(d \times d\). For the concentration parameters, a Gamma(ab) prior is set on HDP concentration parameters \(\alpha \). A Beta(cd) prior is set on the self-transition parameter \(\mu \) and the degree of self-transition bias \(\kappa \) is set to 50. We choose a weekly informative setting and choose: \(a = 0.5, b = 5, c =1, d =5\). Specifically, the initial transition proportion parameter is set \(\mu \sim Beta(1, 10)\). The Split-Merge Monte Carlo method sampling parameter maximum iterations is set to 1000. The truncation states number is set to \(K=5\) for anomaly monitoring, and \(K=10\) for anomaly diagnoses. Then, the HDPHMM model of each anomaly class or nominal skill is trained. Note, that as with pre-processing, we also use the same model parameters in training and testing as well as in other baseline approaches.

  • Results and Analysis of Anomaly Monitoring in Kitting Experiment

According to Sect. 4.6.2, the anomaly detector implementation consists of representing multimodal observations through latent states, concatenating the derived log-likelihood values of a latent state, and calculating the anomaly monitoring threshold.

We first generate the latent state partitions for nominal demonstrations with varying speeds and goals. Notice how the non-parametric method yields varying dynamics across executions reflecting statistical variations in the robot motion trajectory due to changes in the desired trajectory completion time and varying goal distances.

Then, latent state log-likelihood values are concatenated and the corresponding thresholds computed. To this end, we visualized the performance of the proposed method as well as a baseline by assessing the anomaly time reaction of anomaly flags; that is, which algorithm can make a correct detection at a time closer to the ground truth. For training, ten nominal executions were used along with 3-fold cross validation to obtain nominal sHDP-VAR-HMM models for each skill. The other 20 nominal executions were used for testing.

Fig. 4.13
figure 13

All the log-likelihood values of individual hidden state of skill 3 when the truncation states number is set to \(K=5\). Each subplot shows the mean value (black dotted line) and the corresponding log-likelihood threshold value (red solid line) in Eq. (4.48)

The execution varying threshold for anomaly monitoring was set according to Eq. (4.48). The results for skill 3 are visualized in Fig. 4.13. As seen in Fig. 4.13, latent state \(zhat=2\) is the most prominent latent state (the majority of the observations in the skill are ascribed to this state). We now analyze the time sensitivity of our system in detecting anomalies. Anomaly identification sensitivity is intimately connected to threshold setting. We compare the results of our execution varying threshold with a fixed threshold presented in [35]. The comparison is seen in Fig. 4.14. Under nominal situations both methods have the same robustness in avoiding false-positives. In anomalous situations, there are multiple occasions in which our varying threshold yielded better time sensitivity and detection accuracy. From Fig. 4.14, we see that for executions 2 and 4, our varying threshold detector identified anomalies that the fixed threshold method could not. Notice that the distance between the \(anomaly\_t\_by\_human\) and the \(anomaly\_t\_by\_detector\) is larger for the fixed threshold (recall from the dataset collection section that ground truth time stamps were obtained during training). This improved sensitivity reduces false-negatives and helps us flag anomalies at times closer to the ground truth.

Fig. 4.14
figure 14

The comparison of two anomaly detectors for five robot anomalous executions. Each row represents the same testing execution. Left column: results of our hidden state-based anomaly detector. Right column: results of the gradient-based anomaly detector [21]. The ground truth (\(anomaly\_t\_by\_human\)) are represented by green dashed lines and the detected anomalies are represented by yellow solid lines

Finally, Table 4.5 shows the anomaly monitoring performance for our hidden-state detector (HSD) and the baseline gradient-based detector (GD) for the six skills used in the kitting task for 368 nominal executions and 136 anomalous executions. Our results indicate that while the GD system had the HSD resulted in higher 16.6% better precision, our system had 76.0% better recall. The difference is significant in reducing false-negatives and significantly raising the sensitivity of our system. The sensitivity improvement is almost four times bigger than the difference in precision. For this reason, our system reflects an F1-score that is 25.0% larger and an accuracy that is F1-score, and an accuracy that is 5.7% higher standing at 91.0%.

Table 4.5 The comparison of anomaly monitoring performance measures between our proposed method and a baseline method

4.6.4 Discussion

Comprehensive experimental results showed robust, sensitive, and better timed anomaly monitoring for a wide range of anomalous situations in an unstructured co-bot scenario where a human and a robot collaborated to complete kitting tasks. The improved anomaly detector exploits the mapping relationship between latent states and the marginal log-likelihood values from the sHDP-VAR-HMM for monitoring anomalies. Our evaluations showed that we could detect anomalies reliably (overall accuracy of 91.0%).

With regards to anomaly monitoring, Table 4.5 indicated that our proposed detector achieved unexpectedly superior \(\textit{recall}\) measurements compared to the baseline (the gradient-based method), an improvement of 61.7%. By adjusting the anomaly monitoring threshold during the task’s progress execution task as a function of the maximum and minimum of the log-likelihood of observations for latent states, we gained enhanced sensitivity. Human-centric safety approaches dictate that increased sensitivity is highly desirable in human-robot collaboration tasks so as to provide higher safety guarantees for a human. Robotic workcells on the other hand may work better with approaches like the gradient-based detector that was less sensitive but had higher precision. Here the reduction in the false-positive rate may support longer-term autonomy in a workcell. The anomaly detector exhibited not only a better \(\textit{accuracy}\) but also more accurately timed anomaly flags. That is, our anomaly flags occurred closer to the ground truth time. This would be useful. Timeliness may prove a critical role in cases where anomalies are triggered as a chain reaction. That is anomalies that cause more anomalies. By acquiring more timely detection, we may be able to also provide more timely recovery techniques or even prevention so as to minimize the possibility of a cascade of anomalies.

4.7 Summary

In this chapter, there are three kinds of anomaly monitoring methods in various threshold are presented based on the Bayesian nonparametric HMMs, namely: log-likelihood-based threshold, threshold based on the gradient of log-likelihood, and computing the threshold by mapping latent state to log-likelihood. All proposed methods and implementation processes were validated on the experimental platforms of “HIRO-NX robot performs electronic component assembly tasks” and “Baxter robot performs object packing tasks” using the ROC curve or accuracy-precision-recall rate-F1 score as a performance evaluation indicator. The results are as follows:

  1. (1)

    The anomaly monitoring method using log-likelihood-based threshold can effectively realize the online monitoring, but there are a large number of false positive (false triggering) and has an uncertain constant coefficient c. For the task of HIRO-NX robot performing the electronic components assembly, sHDP-VAR(2)-HMM obtains the optimal performance where the constant value of the threshold is \(c=3\) and the observation model is the second-order linear Gaussian regression model, resulting in a false positive rate of about 5.0% and a true positive rate of about 95.0%.

  2. (2)

    The method of threshold based on the gradient of log-likelihood can effectively solve the problem of false triggering, but resulting in low sensitivity and monitoring success rate with a fixed lower threshold. For the task of Baxter robot to performing object kitting experiment, we obtain the average monitoring accuracy of a total of 67 anomaly events is 93.8%, far better than the 59.2% of the log-likelihood-based threshold.

  3. (3)

    The anomaly monitoring method based on computing the threshold by mapping latent state to log-likelihood uses the dynamic threshold to solve the problem of false triggering, low sensitivity and success rate, but the dynamic threshold depends on the inference of the belief state, there will be a rapid transition phenomenon, resulting in a higher risk of false positive rate. For the task of Baxter robot to performing object kitting experiment, resulting in the average monitoring accuracy of a total of 136 abnormal events in the is 91.0%, which is much better than the 86.0% of the threshold based on the gradient of log-likelihood.