1 Introduction

Robots are progressively spreading to logistic, social and assistive domains. However, in order to become handy co-workers and helpful assistants, they must be endowed with quite different abilities than their industrial ancestors (Torras 2016). Moving robots from simple problems to unstructured environments requires a very specific set of skills and knowledge (Billard et al. 2022).

For enabling complex robotics applications, it is much easier for a human to demonstrate the desired behavior rather than attempt to engineer it. This is the main principle behind robot learning from demonstration (LfD). End-users could easily teach robots new tasks without the need of expert programming.

1.1 Learning from demonstration

Learning from demonstration (LfD) is the paradigm in which robots implicitly learn task constraints and requirements from demonstrations of a human teacher (Ravichandar et al. 2020). This allows more intuitive skill transfer, satisfying a need of opening policy development to non-robotic-experts as robots extend to assistive domains. Flexible models that allow learning the task by extracting relevant motion patterns from the demonstrations, and subsequently apply these patterns to perform the task in different situations, are essential for transferring human skills to robots. Over the last decade, learning from demonstration has been an intensive field of study, for which research interest has done nothing but steadily increase. Also note that, although we use the term learning from demonstration to encompass the field as a whole, other popular terms are used in the literature such as imitation learning, programming by demonstration, and behavioral cloning, among others.

Different learning approaches, namely supervised, reinforcement, and unsupervised, have been used to address a plethora of problems in robot learning. The choice between the different methods is not trivial and depends on the problem of interest (Chen et al. 2020). From a general perspective, to allow robots to learn skills from human demonstrations, we need to develop a system that records demonstrations by experts, learns the ideal behavior from the available demonstrations, and reproduces it.

Several survey papers on robot learning from demonstration provide a distinct overview of the field by answering them from different perspectives (Ravichandar et al. 2020; Osa et al. 2018a).

1.2 Trajectory-based robot learning methods

Algorithms that encode skills using trajectory-based representations, are the most dominant family in learning from demonstration research (Colomé and Torras 2020). These methods rely on low-level controllers to execute the trajectories required to perform the taught skill. Skills are encoded by extracting trajectory patterns from demonstrations (Fig. 1), using a variety of techniques to retrieve a generalized shape of the trajectory (Calinon and Lee 2019). The main reason behind the popularity of these algorithms, is that, assuming that the system is fully actuated (which is the case for most robot manipulators) we do not need any knowledge of the robot dynamics.

Fig. 1
figure 1

The Gaussian-Process-based LfD approach allows to teach robot tasks such as opening doors

For addressing the learning from demonstration problem, we can assume that there exists a direct and learnable function (i.e., the policy) that generates the desired behavior. This policy can be defined as a function that maps available information onto an appropriate action space

$$\begin{aligned} \pi :\mathcal {X}\longrightarrow \mathcal {Y} \end{aligned}$$
(1)

where \(\mathcal {X}\) represents the inputs required to execute the policy and \(\mathcal {Y}\) the action space. The objective is to learn this policy \(\pi ()\), which allows the reproduction of the skill taught by the expert. For this, the robot is presented with a demonstration (i.e. training) dataset which consists of sample input-action pairs

$$\begin{aligned} \mathcal {D}=\left\{ \left( {\varvec{x}}_i,{\varvec{y}}_i\right) \right\} _{i=1}^N=\left( X,Y\right) \end{aligned}$$
(2)

where \({\varvec{x}}_i\in \mathcal {X}\), \({\varvec{y}}_i\in \mathcal {Y}\), N stands for the number of samples, and \(X\in \mathbb {R}^{\dim \left( \mathcal {X}\right) \times N}\) and \(Y\in \mathbb {R}^{\dim \left( \mathcal {Y}\right) \times N}\) represent the matrices where all the column input and output vectors are aggregated, respectively. From the formulation of the problem, we can see that the first key aspect in LfD involves identifying the appropriate inputs and outputs to the policy. Trajectories are the most popular choice since in a myriad of robotic systems these govern the robot actions.

Another fundamental feature of LfD methods is the possibility of retrieving a probabilistic representation of the policy. This allows a complete description of the task, encoding the uncertainty along with the motion; which is crucial for reflecting the importance of certain points of the task, leading to better generalization capabilities.

Also, in LfD is interesting to adapt the learned motion to unseen scenarios while maintaining the general trajectory shape as in the demonstrations without re-training the model. Commonly, these requirements are expressed as via-point constraints or the blending of multiple movement policies.

In this work, we present a general Gaussian-Process-based learning from demonstration approach. By exploiting the potential that Gaussian Process models offer, we aim to unify in a single, entirely GP-based framework, the main features required for a state-of-the-art LfD approach.

2 State-of-the-art

Over the past two decades, trajectory-based robot learning from demonstration has been intensive field of study. Among the most relevant contributions, the following methods can be highlighted: Dynamic Movement Primitives (DMP) (Ijspeert et al. 2001; Pastor et al. 2009; Saveriano et al. 2021), Probabilistic Movement Primitives (ProMP) (Paraschos et al. 2018; Ewerton et al. 2019; Frank et al. 2021), Gaussian Mixture Model-Gaussian Mixture Regression (GMM-GMR) (Calinon 2016; Pignat and Calinon 2019; Pignat et al. 2022), Kernelized Movement Primitives (Huang et al. 2019b, c), and Gaussian Processes (GP) (Nguyen-Tuong and Peters 2008; Forte et al. 2010; Schneider and Ertel 2010). These representations have proved successful at learning and generalizing trajectories. However, each model presents its strengths and shortcomings.

The main advantage of probabilistic-based methods (GMM-GMR, ProMP, KMP and GP) is that they not only retrieve an estimate of the underlying trajectory across multiple demonstrations, but also encode its variability by means of a covariance matrix. This information, which can be inferred from the dispersion of the collected data, can be exploited for the execution of the task, i.e., specifying the robot tracking precision or switching the controller (Silvério et al. 2018).

Unlike probabilistic-based methods, at the cost of not encoding the variability of the task, DMP only requires a single demonstration. Generalization is achieved by assuming trajectories to be solutions of a deterministic dynamical system, achieving remarkable success in generating smooth trajectories from an arbitrary initial state. For capturing higher-order statistics, a unified framework fusing dynamic and probabilistic movement primitives (ProDMPs) (Li et al. 2022), that recovered a linear basis-function representation for the trajectories by solving the dynamical system, has recently been proposed. However, a drawback of DMP, and also ProMP, is that they rely on the manual specification of basis functions, which requires expert knowledge and makes the learning problem with high-dimensional inputs almost intractable. GMM-GMR, in contrast, has proven successful in handling this kind of demonstrations. KMP and GP, by their kernel treatment, can be implemented for manipulation tasks where high-dimensional inputs and outputs are required (Huang et al. 2021d).

In LfD is also interesting to transfer the learned motion to unseen scenarios while maintaining the general trajectory shape as in the demonstrations. By exploiting the properties of probability distributions, ProMP, KMP and GP allow for trajectory adaptations with via-points. On the other hand, despite GMM-GMR being formulated in terms of Gaussian distributions, the re-optimization of the learned policy requires to re-estimate the model parameters, which lie in a high-dimensional space. This makes the adaptation process very expensive, which prevents its use in unstructured environments, where the policy adjustment is key.

Besides the generation of adaptive trajectories, another desired property in LfD is extrapolation. In this regard, there is an interesting duality between GMM-GMR and GP representations. The former covariance matrices, model the variability of the trajectories. Conversely, the latter provide a measure of the prediction uncertainty, the variance increasing with the absence of training data. This information is relevant when trying to generalize the learned motion outside of the demonstrated action space. The simultaneous exploitation of both measures is considered in KMP (Silvério et al. 2019). Moreover, in a recent work, Jaquier et al. (2019) propose a a GMM-based GP for encoding the trajectory (GMR-GP), which is a method with enough similarities with KMP, since both are kernel-based. GMR-GP take advantage of the ability of GP to encode prior beliefs through the mean and kernel functions and the capability of GMR to make predictions far from training data. Nevertheless, the improvement with respect to GMR comes at the cost, in both KMP and GMR-GP methods, of an increasing complexity with respect to GP representations. Further, the framework of GP and GMR-GP allows the representation of more complex behaviors that KMP defining a prior for the process.

In the recent years, there has been a growing interest in Gaussian Processes (Schulz et al. 2018). The main advantage of GP over the previously discussed methods, is their ability to encode prior beliefs through the mean and kernel functions. This allows the representation of more complex behaviors in the regions of the action space where demonstration data is sparse. The evaluation of GP models requires however major computational resources with respect to calculation and memory (Nelles 2020). A few works have studied the use of an entirely GP-based representation in the LfD context (Nguyen-Tuong and Peters 2008; Forte et al. 2010). Among the most representative is the one presented by presented by Schneider and Ertel (2010). They propose a representation of a pick-and-place task that effectively encodes the task variability using a heteroscedastic GP. Similarly, Umlauft et al. (2017) estimate the prediction uncertainty separately, using Wishart Processes. The learned trajectory is retrieved combining GP and DMP. Neither of these works consider the adaptation of the learned policy. Other works formulate the learning and motion planning problem within a single GP-based framework (Osa et al. 2018b; Rana et al. 2017). In these works the entire trajectory is retrieved from an optimization perspective. However, this becomes inefficient as the length of the trajectory and the dimensionality of the learning problem increase. Finally, in a recent work, Wilcox and Yip (2020) apply GP regression for online non-parametric Bayesian model learning for real-time robot control. However, they do not focus in the trajectory learning problem, but on robot teleoperation.

A drawback of GP is that they are usually only defined in Euclidean space, even though a formulation with non-Euclidean input space is possible in principle (Lang et al. 2018). Thus, when it comes to the modeling of task space trajectories, representation of orientation imposes great challenges, since is accompanied with additional constraints. This is an aspect disregarded in the aforementioned GP-based methods, which is critical in LfD. Some works have successfully addressed this question with DMP (Koutras and Doulgeri 2019; Abu-Dakka and Kyrki 2020), GMM-GMR (Zeestraten et al. 2017; Jaquier et al. 2021) and KMP (Huang et al. 2019a; Abu-Dakka et al. 2021). However, in recents works, Lang et al. (2015), Lang and Hirche (2017) and Jaquier et al. (2022) has proposed efficient GP representations for 6-DoF rigid motions. We have adopted in our framework, due to its greater simplicity, the approach developed by Lang and Hirche (2017).

3 Structure and contributions of this paper

In this work, we present a general Gaussian-Process-based learning from demonstration approach. For the purpose of clear comparison, the main contributions of the state-of-the-art and our approach are summarized in Table 1.

Table 1 Comparison between the LfD state-of-the-art and our approach

We show how to achieve an effective representation of the manipulation skill, inferred from the demonstrated trajectories. We unify both, the task variability and the prediction uncertainty, in a single concept we refer to as task uncertainty in the remainder of the paper. Furthermore, in order to achieve an effective generalization across demonstrations, we propose the novel Task Completion Index, for temporal alignment of task trajectories. Finally, we address the adaptation of the policy through via-points, and the modulation of the robot behavior depending on the task uncertainty through variable admittance control.

The paper is structured as follows: in Sect. 4 we discuss the theoretical aspects of the considered GP models; in Sect. 5 we present the proposed learning from demonstration framework; in Sect. 6 we illustrate the main aspects of the paper through a real-world application with the TIAGo robot; finally, in Sect. 7, we summarize the final conclusions.

4 Gaussian process models

In this section we discuss the theoretical background of the proposed LfD approach. First, we present the fundamentals of GP. Then, we address the challenges of modeling rigid-body dynamics with them. Finally, we present how heteroscedastic GP allows to accurately represent the uncertainty of the taught manipulation task.

4.1 Gaussian process fundamentals

Intuitively, one can think of a Gaussian process as defining a distribution over functions, and inference taking place directly in the space of functions. Formally, GP are a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen and Williams 2006). It can be completely specified by its mean m(t) and covariance \(k(t,t')\) functions:

$$\begin{aligned} m(t)= & {} \mathbb {E}\left[ f(t)\right] \end{aligned}$$
(3)
$$\begin{aligned} k(t,t')= & {} \mathbb {E}\left[ \left( f(t)-m(t)\right) \left( f(t')-m(t')\right) \right] \end{aligned}$$
(4)

where f(t) is the underlying process, m(t) depicts the prior knowledge of its mean, and \(k(t,t')\) is symmetric and positive semi-definite (usually referred to as kernel) that must be specified. We are interested in incorporating the knowledge that the training data \(\mathcal {D}=\left\{ \left( t_i,y_i\right) \right\} ^N_{i=1}\) provides about f(t). We consider that we do not have available direct observations, but only noisy versions y.

Let \(\textbf{m}(t)\) be the vector of the mean function evaluated at all training points t and \(K(t,t^*)\) be the matrix of the covariances evaluated at all pairs of training and prediction points \(t^*\). Assuming additive independent identically distributed Gaussian noise with variance \(\sigma _n^2\), we can write the joint distribution of the observed target values \(\textbf{y}\) and the function values at the test locations \(\textbf{f}^*\) under the prior as (Nelles 2020):

$$\begin{aligned} \left[ \begin{array}{c} \textbf{y} \\ \textbf{f}^* \end{array}\right] \sim \mathcal {N}\left( \left[ \begin{array}{c} \textbf{m}(t) \\ \textbf{m}(t^*) \end{array}\right] ,\left[ \begin{array}{cc} K(t,t)+\sigma _n^2I &{} K(t,t^*) \\ K(t^*,t) &{} K(t^*,t^*) \end{array}\right] \right) \end{aligned}$$
(5)

The posterior distribution over functions can be computed by conditioning the joint Gaussian prior distribution on the observations \(p\left( \textbf{f}^*|t,\textbf{y},t^*\right) \sim \mathcal {N}\left( \varvec{\mu }^*,\varvec{\Sigma }^*\right)\) where (Nelles 2020):

$$\begin{aligned} \varvec{\mu }^*= & {} \textbf{m}\left( t^*\right) +K\left( t^*,t\right) \left[ K(t,t)+\sigma _n^2I\right] ^{-1}\left[ \textbf{y}-\textbf{m}(t)\right] \end{aligned}$$
(6)
$$\begin{aligned} \varvec{\Sigma }^*= & {} K\left( t^*,t^*\right) -K\left( t^*,t\right) \left[ K(t,t)+\sigma _n^2I\right] ^{-1}K\left( t,t^*\right) \end{aligned}$$
(7)

When we consider only the prediction of one output variable, \(k(t,t')\) is a scalar function. The previous concepts can be extended to multiple-output GP (MOGP) by taking a matrix covariance function \(\textbf{k}(t,t')\). Usual approaches to MOGP modelling are mostly formulated around the Linear Model of Coregionalization (LMC) (Alvarez et al. 2012). For a d-dimensional output the kernel is expressed in the following form:

$$\begin{aligned} \textbf{B}\otimes \textbf{k}\left( t,t'\right) =\left[ \begin{array}{ccc} B_{11}k_{11}\left( t_1,t_1'\right) &{} \ldots &{} B_{1d}k_{1d}\left( t_1,t_d'\right) \\ \vdots &{} \ddots &{} \vdots \\ B_{d1}k_{d1}\left( t_d,t_1'\right) &{} \ldots &{} B_{dd}k_{dd}\left( t_d,t_d'\right) \end{array}\right] \end{aligned}$$
(8)

where \(\textbf{B}\in \mathbb {R}^{d\times d}\) is regarded as the coregionalization matrix and \(t_i\) represents the input corresponding to the i-th output. Diagonal elements correspond to the single-output case, while the off-diagonal elements represent the prior assumption on the covariance of two different output dimensions (Liu et al. 2018).

If no a-priori assumption is made, \(B_{ij}=0\) for \(i\ne j\) and the MOGP is equivalent to d independent GP. Regarding the form of \(k(t,t')\), typically kernel families have free hyperparameters \(\Theta\). Such parameters can be determined by maximizing the log marginal likelihood (Rasmussen and Williams 2006):

$$\begin{aligned} \log {p\left( \textbf{y}|t,\Theta \right) }=-\frac{1}{2}\textbf{y}^TK_y^{-1}\textbf{y}-\frac{1}{2}\log \left| K_y\right| -\frac{N}{2}\log 2\pi \end{aligned}$$
(9)

where \(K_y=K(t,t)+\sigma _n^2I\). This problem might suffer from local optima.

4.2 Rigid-body motion representation

In the LfD context, representation of trajectories in task space is usually required. However, the modelling of rotations is not straightforward with GP, since the standard formulation is defined for an underlying Euclidean space. A common approach is to use the Euler angles, and exploit that locally the rotation group \(SO(3) \simeq \mathbb {R}^3\), allowing distances to be computed as Euclidean. However, when this approximation is no longer valid (e.g. at low sampling frequency or if collected data is sparse) it might lead to inaccurate predictions. To overcome this issue, as proposed in Lang and Hirche (2017), from the Euler’s fixed point theorem (Palais and Palais 2007) rotations can also be parametrizes by a set of unit length Euler axes \(\textbf{u}\) together with a rotation angle \(\theta\):

$$\begin{aligned} SO(3)\subset \left\{ \theta \textbf{u}\in \mathbb {R}^3/\Vert \textbf{u}\Vert =1\wedge \theta \in \left[ 0,\pi \right] \right\} \end{aligned}$$
(10)

This set defines the solid ball \(B_\pi (0)\) in \(\mathbb {R}^3\) with radius \(0\le r\le \pi\) which is closed, dense and compact. Ambiguity in the representation occurs for \(\theta =\pi\). To obtain an isomorphism between the rotation group SO(3) and the axis-angle representation, we fix the axis representation for \(\theta =\pi\):

$$\begin{aligned} \tilde{B}_{\pi }(0)=B_{\pi }(0)\setminus \{\pi \textbf{u}/u_z<0\;\vee \left( u_z=0\wedge u_y<0 \right) \vee \left( u_z=u_y=0\wedge u_x<0 \right) \} \end{aligned}$$
(11)

where \(\textbf{u}=\left( u_x,u_y,u_z\right)\). This parametrization is a minimal and unique \(SO(3)\simeq \tilde{B}_{\pi }(0)\).

Rigid motion dynamics is given by a mapping from time, to translation and rotation \(h:\mathbb {R}\longrightarrow SE(3)\). Let the translational components be defined by the Euclidean vector \(\textbf{v}\in \mathbb {R}^3\). Then SE(3) is defined isomorphically by \(SE(3)\simeq \mathbb {R}^3\times \tilde{B}_{\pi }(0)\). Thus, rigid body motion can be represented in MOGP with the 6-dimensional output vector structure \(\left( \textbf{v},\theta \textbf{u}\right) =\left( x,y,z,\theta u_x,\theta u_y, \theta u_z\right)\).

Another possible, more accurate representation, can be achieved with dual quaternions (Lang et al. 2015). However, as shown in Lang and Hirche (2017), with the proposed parametrization, a good performance is attained and computations are more efficient.

4.3 Heteroscedastic Gaussian process

The standard Gaussian Process model assumes a constant noise level. This can be an important limitation when encoding a manipulation task. Consider the example shown in Fig. 2: it is evident that while the initial and final positions are highly constrained, that is not the case for the path to follow between such positions. In graphs (a) and (b) we can see that with a standard approach we accurately represent the mean but not the variability of demonstrations.

Fig. 2
figure 2

Standard GP do not accurately model the variability of the demonstrated task as can be seen in (a, b), where it is underestimated and overestimated, respectively. On the other hand, the heteroscedastic GP approach, in (c), encodes the variability in the different phases of the task, considering the local noise in (d)

Considering an independent normally distributed noise, \(\lambda \sim \mathcal {N}\left( 0,r(t)\right)\), where the variance is input-dependent and modeled by r(t). The mean and covariance of the predictive distribution can be modified to Goldberg et al. (1998):

$$\begin{aligned} \varvec{\mu }^*= & {} \textbf{m}(t^*)+K(t^*,t)\left[ K(t,t)+R(t)\right] ^{-1}\left[ \textbf{y}-\textbf{m}(t)\right] \end{aligned}$$
(12)
$$\begin{aligned} \varvec{\Sigma }^*= & {} K(t^*,t^*)+R(t^*)-K(t^*,t)\left[ K(t,t)+R(t)\right] ^{-1}K(t,t^*) \end{aligned}$$
(13)

where R(t) is a diagonal matrix, with elements r(t).

Taking into account the input-dependent noise shown in Fig. 2d the variability in the different phases of the manipulation task is effectively encoded in Fig. 2c. This approach is commonly referred to as heteroscedastic Gaussian Process. The main limitation of this method is the trade-off between accuracy in the estimation of the latent noise function, for which more demonstrations are preferred, and the computational complexity of the learning algorithm.

5 Learning from demonstration framework

In this section, we present the proposed GP-based LfD framework. First, we formalize the problem of learning manipulation skills from demonstrated trajectories. Then, we propose an approach for encoding the learned policy with GP. Next, we discuss the temporal alignment of demonstrations. We also present a method that allows to adapt the learned policy through via-points. Finally, we study how the uncertainty model of GP can be exploited to stably modulate the robot behavior, varying end-effector virtual dynamics.

5.1 Problem statement

In LfD we assume that a dataset of demonstrations is available. In the trajectory-learning case, the dataset consists of a set of trajectories \(\textbf{s}\) together with a timestamp \(t\in \mathbb {R}\), \(\mathcal {D}=\left\{ \left( t_i,\textbf{s}_i\right) \right\} ^N_{i=1}\).

Without loss of generality, we will consider \(\textbf{s}_i\in SE(3)\). The aim is to learn a policy \(\pi\) that infers, for a given time, the desired end-effector pose \(\textbf{s}^d_i\) to perform the taught manipulation task: \(\textbf{s}^d_i=\pi (t_i)\). The policy must generate continuous and smooth paths, and generalize over multiple demonstrations.

5.2 Manipulation task representation with GP

Representing a manipulation task using heteroscedastic GP models requires the specification of m(t), \(k(t,t')\) and r(t). As we have discussed in Sect. 4.2, a suitable mapping for representing a trajectory is given by the following MOGP:

$$\begin{aligned} \pi (t)\sim \mathcal{G}\mathcal{P}\left( \varvec{\mu }^*,\varvec{\Sigma }^*\right) : t\longrightarrow \left( x,y,z,\theta u_x,\theta u_y, \theta u_z\right) \end{aligned}$$
(14)

The prior mean function is commonly defined as \(m(t)=0\). Although not necessary in general, if no prior knowledge is available this is a simplifying assumption. The GP covariance function controls the policy function shape. The chosen kernel must generate continuous and smooth paths. Note also that the time parametrization of trajectories is invariant to translations in the time domain. Thus, the covariance function must be stationary. That is, it should be a function of \(\tau =t-t'\). The Radial Basis Function (RBF) kernel fulfils all these requirements (Nelles 2020):

$$\begin{aligned} k(t,t')=\sigma _f^2\exp \left( -\frac{\left[ t-t'\right] ^2}{2l^2}\right) \end{aligned}$$
(15)

with hyperparameters l and \(\sigma _f\).

Moreover, for multidimensional outputs, we have to consider the prior interaction. In the general case, we usually do not have any previous knowledge about how the different components of the demonstrated trajectories relate to each other. Thus, we can assume that the six components are independent a-priori. The matrix covariance function can then be written as (Nelles 2020):

$$\begin{aligned} \textbf{k}(t,t')=\text {diag}\left( \sigma _{f1}^2e^{\left( \left[ t-t'\right] ^2/l_1^2\right) },\dots ,\sigma _{f6}^2e^{\left( \left[ t-t'\right] ^2/l_6^2\right) }\right) \end{aligned}$$
(16)

where \(\text {diag}()\) refers to diagonal, and \(l_i\) and \(\sigma _{fi}\) correspond to output dimension i.

In Sect. 4.3 we discussed the convenience of specifying an input-dependent noise function r(t) for encoding the manipulation skill with GP. Usually, it is not known a-priori and must be inferred from the demonstrations. As proposed in Kersting et al. (2007), first an standard GP can be fit to the data. Its predictions can be used to estimate the input-dependent noise empirically. Then, a second independent GP can be used to model \(z(t)=\log \left[ r(t)\right]\). Let \(\mathcal {Z}\) be the set of noise data \(\textbf{z}=\left\{ z_i\right\} _{i=1}^n\) and its predictions \(\textbf{z}^*\). The posterior predictive distribution can be approximated by:

$$\begin{aligned} p\left( \textbf{f}^*|\mathcal {D},t^*\right) =\iint p\left( \textbf{f}^*|\mathcal {D},\mathcal {Z},t^*\right) p\left( \mathcal {Z}|\mathcal {D},t^*\right) \simeq p\left( \textbf{f}^*|\mathcal {D},\mathcal {Z},t^*\right) \end{aligned}$$
(17)

where

$$\begin{aligned} \mathcal {Z}=\mathop {\mathrm {arg\,max}}\limits _{\textbf{z},\textbf{z}^*}p\left( \textbf{z},\textbf{z}^*|\mathcal {D},t^*\right) \end{aligned}$$
(18)

At this point we have specified all the required functions of the model.

5.3 Temporal alignment of demonstrations

For inferring a time dependent policy, the correlation between the temporal and spatial coordinates of two demonstrations of the same task must remain constant. In general, it is very difficult for a human to repeat them at the same velocity. Thus, a time distortion appears (Fig. 3a), and should be adequately corrected. Dynamic Time Warping (DTW) (Senin 2008) is a well-known algorithm for finding the optimal match between two temporal sequences, which may vary in speed.

Fig. 3
figure 3

In (a) we observe that due to distortion in time, task constraints are not encoded correctly. In (b) the trajectories are aligned with DTW using the Euclidean distance as similarity measure. In (c) we show the resulting alignment using the proposed TCI (d), as similarity measure

The algorithm finds a non-linear mapping of the demonstrated trajectories and a reference based on a similarity measure. A common measure in the LfD context is the Euclidean distance. This relies on the assumption that the manipulation task can be performed always following the same path. For instance, consider the case of a pick-and-place task where the objects have to be placed in shelves at different levels (Fig. 3).

Using the Euclidean distance as similarity measure will lead to an erroneous temporal alignment (Fig. 3b), since intermediate points for placing the object at a higher level can be mapped to ending points of a lower level. We propose to use an index which considers the portion of the trajectory that has been covered for task completion as a similarity measure. We will refer to it as the Task Completion Index (TCI). We define it in discrete form as:

$$\begin{aligned} \zeta (t_k)=\frac{\sum _{j=1}^{k}d(\textbf{s}_{j},\textbf{s}_{j-1})}{\sum _{j=1}^{M}d(\textbf{s}_{j},\textbf{s}_{j-1})}\quad \quad \forall \ k=1,\dots M\end{aligned}$$
(19)

where \(\textbf{s}_{j}\in SE(3)\) refers to the trajectory point at time instant \(t_j\), d(, ) to an scalar distance function and M to the total number of discrete points. Note that \(0=\zeta (t_0)\le \zeta (t_k)\le \zeta (t_M)=1\). As a distance function on SE(3), using the representation discussed in Sect. 4.2, we define:

$$\begin{aligned} d(\textbf{s}_{i},\textbf{s}_{j})=\sqrt{\omega _1\left[ d_{arc}(\theta _{i}\textbf{u}_{i},\theta _{j}\textbf{u}_{j})\right] ^2+\omega _2\Vert \textbf{v}_{i}-\textbf{v}_{j}\Vert ^2} \end{aligned}$$
(20)

where \(\omega _k\) are a convex combination of weights for application dependent scaling and \(d_{arc}(,)\) is the length of the geodesic between rotations (Lang and Hirche 2017):

$$\begin{aligned} d_{arc}(\theta _{i}\textbf{u}_{i},\theta _{j}\textbf{u}_{j})=2\arccos \left| \cos \frac{\theta _{i}}{2}\cos \frac{\theta _{j}}{2}+\sin \frac{\theta _{i}}{2}\sin \frac{\theta _{j}}{2}\textbf{u}_i^T\textbf{u}_j\right| \end{aligned}$$
(21)

In Fig. 3c we show that the trajectories are warped correctly, allowing then an effective encoding of the manipulation task, with the proposed TCI (Fig. 3d).

5.4 Policy adaptation through via-points

The modulation of the learned policy through via-points is an important property to adapt to new situations. Let \(\mathcal {V}=\left\{ \left( t_i,\textbf{s}_i^v\right) \right\}\) be the set of via-points \(\textbf{s}_i^v\) which are desired to be reached by the policy at time instant \(t_i\). In the proposed probabilistic framework, generalization can be implemented by conditioning the policy on both \(\mathcal {D}\) and \(\mathcal {V}\). Assuming that the predictive distribution of each set can be computed independently, the conditioned policy is (Deisenroth and Ng 2015):

$$\begin{aligned} p\left( \textbf{f}^*|\mathcal {D},\mathcal {V},t^*\right) =p\left( \textbf{f}^*|\mathcal {D},t^*\right) p\left( \textbf{f}^*|\mathcal {V},t^*\right) \end{aligned}$$
(22)

If \(p\left( \textbf{f}^*|\mathcal {D},t^*\right) \sim \mathcal {N}\left( \mu ^d,\Sigma ^d\right)\) and \(p\left( \textbf{f}^*|\mathcal {V},t^*\right) \sim \mathcal {N}\left( \mu ^v,\Sigma ^v\right)\), then, it holds that \(p\left( \textbf{f}^*|\mathcal {D},\mathcal {V},t^*\right) \sim \mathcal {N}\left( \mu ^{**},\Sigma ^{**}\right)\), where:

$$\begin{aligned} \mu ^{**}=\Sigma ^v\left( \Sigma ^d+\Sigma ^v\right) ^{-1}\mu ^d+\Sigma ^d\left( \Sigma ^d+\Sigma ^v\right) ^{-1}\mu ^v\end{aligned}$$
(23)
$$\begin{aligned} \Sigma ^{**}=\Sigma ^d\left( \Sigma ^d+\Sigma ^v\right) ^{-1}\Sigma ^v\end{aligned}$$
(24)

The resulting distribution is computed as a product of Gaussians, and is a compromise between the via-point constraints and the demonstrated trajectories, weighted inversely by their variances.

Considering an heteroscedastic GP model for \(\mathcal {V}\) (Eqs. 12 and 13), the strength of the via-point constraints can then be easily specified by means of the latent noise function. For instance, via-points with low noise will have a higher relative weight, modifying significantly the learned policy. On the other hand, via-points with a high noise level will produce a more subtle effect. In Fig. 4 we illustrate how the distribution adapts to strong and weak defined via-points.

Fig. 4
figure 4

On the left, a GP model based on the demonstrated trajectories. On the right, the policy adapted through via-points

It should be remarked that the posterior predictive distribution of \(\mathcal {D}\) only needs to be computed once. Thus, adaptation of the policy just involves a computational cost of \(\mathcal {O}\left( m^3\right)\), where m is the number of predicted outputs. Since m can be specified, the proposed approach is suitable for on-line applications [for further insight on GP complexity see Bilj (2018)].

5.5 Modulation of the robot behavior

In LfD is often convenient to adapt the behavior of the robot as a function of the uncertainty in the different phases of the task (Suomalainen et al. 2022). Let the robot end-effector be controlled through a spring-mass-damper model dynamics (Abu-Dakka and Saveriano 2020):

$$\begin{aligned} {\textbf {M}}\left( t\right) {\varvec{\ddot{{\textbf {e}}}}}\left( t\right) +{\textbf {D}}\left( t\right) {\varvec{\dot{{\textbf {e}}}}}\left( t\right) +{\textbf {K}}_p\left( t\right) {\textbf {e}}\left( t\right) ={\textbf {F}}_{\textbf {ext}}\left( t\right) \end{aligned}$$
(25)

where \({\textbf {M}}\left( t\right) ,{\textbf {D}}\left( t\right) ,\textbf{K}_p\left( t\right) \in \mathbb {R}^{6\times 6}\) refer to inertia, damping and stiffness, respectively, and \(\textbf{e}\left( t\right) \in \mathbb {R}^{6\times 1}\) is the tracking error, when subjected to an external force \(\mathbf {F_{ext}}\left( t\right) \in \mathbb {R}^{6\times 1}\). It can be proved [see Kronander and Billard (2016)] that for a constant, symmetric, positive definite \(\textbf{M}\), and \(\textbf{D}\left( t\right)\), \(\textbf{K}_p\left( t\right)\) continuously differentiable, the system is globally asymptotically stable if there exists a \(\gamma >0\) such that:

  1. 1.

    \(\gamma \,\textbf{M}-\textbf{D}\left( t\right)\) is negative semidefinite

  2. 2.

    \({\varvec{\dot{{\textbf {K}}}}}_p\left( t\right) +\gamma \,{\varvec{\dot{{\textbf {D}}}}}\left( t\right) -2\gamma \,\textbf{K}_p\left( t\right)\) is negative definite

Without loss of generality, we can assume that \(\textbf{M}\), \(\textbf{D}\left( t\right)\), and \(\textbf{K}_p\left( t\right)\) are diagonal matrices, since they can always be expressed in a suitable reference frame. Therefore, the system can be uncoupled in six independent scalar systems. Now consider a constant damping ratio \(\delta\). Substituting \(d\left( t\right) =2\delta \sqrt{m\,k_p\left( t\right) }\)—where m, \(d\left( t\right)\) and \(k_p(t)\) are an arbitrary diagonal element of \(\textbf{M}\), \(\textbf{D}\left( t\right)\) and \(\textbf{K}_p\left( t\right)\), respectively—on the second stability condition, it yields the following upper bound for the stiffness derivative:

$$\begin{aligned} {\varvec{\dot{{\textbf {k}}}}}_p\left( t\right) <\frac{2\gamma \sqrt{k_p\left( t\right) ^3}}{\sqrt{k_p\left( t\right) }+2\delta \,\gamma \sqrt{m}}\end{aligned}$$
(26)

In order to modulate the robot behavior, we propose the following variable stiffness profile:

$$\begin{aligned} k_p(t)=k_p^{max}-\frac{k_p^{max}-k_p^{min}}{1+e^{-\alpha \left( \sigma (t)-\beta \right) }}\end{aligned}$$
(27)

which increases the stiffness inversely to the uncertainty \(\sigma (t)\) and saturates at \(k_p^{min}\) and \(k_p^{max}\) for high and low values respectively. Also, note that higher values of the design parameter \(\alpha\) give a faster transition between stiff and compliant robot behavior, while \(\beta\) determines a threshold value of \(\sigma (t)\) at which the transition starts. Differentiating we have:

$$\begin{aligned} {\varvec{\dot{{\textbf {k}}}}}_p(t)=\alpha k_p(t)\left( 1-\frac{k_p(t)}{k_p^{max}-k_p^{min}}\right) \frac{d\sigma (t)}{dt}\end{aligned}$$
(28)

For a constant \(d\sigma (t)/dt\), the maximum value of the stiffness derivative \({\varvec{\dot{{\textbf {k}}}}}_p(t)\) is obtained for \(k_p(t)=\left( k_p^{max}-k_p^{min}\right) /2\). Thus, substituting in (28), it yields the following upper bound:

$$\begin{aligned} {\varvec{\dot{{\textbf {k}}}}}_p(t)\le \frac{\alpha }{4}\left( k_p^{max}-k_p^{min}\right) \frac{d\sigma (t)}{dt} \end{aligned}$$
(29)

Then, from inspection of the first stability condition, we can see that \(\gamma\) defines a lower bound for the minimum allowed damping d(t).

Given the variable stiffness profile in Eq. 27, and assuming constant damping ratio, the most restrictive value is \(\gamma =2\delta \sqrt{k_p^{min}/m}\). Substituting in (26), we can obtain the following lower bound:

$$\begin{aligned} {\varvec{\dot{{\textbf {k}}}}}_p(t)<\frac{4\delta \sqrt{\left( k_p^{min}\right) ^3}}{\left( 1+4\delta ^2\right) \sqrt{m}} \le \frac{2\gamma \sqrt{k_p\left( t\right) ^3}}{\sqrt{k_p\left( t\right) }+2\delta \,\gamma \sqrt{m}} \end{aligned}$$
(30)

Then, from equations (29) and (30) the following sufficient stability condition can be derived:

$$\begin{aligned} \frac{d\sigma (t)}{dt}<\frac{16\delta }{\alpha }\frac{\sqrt{\left( k_p^{min}\right) ^3}}{\left( k_p^{max}-k_p^{min}\right) \left( 1+4\delta ^2\right) \sqrt{m}}\end{aligned}$$
(31)

The control parameters can then be tuned to ensure the satisfaction of this inequality. Note that sharper uncertainty profiles \(\sigma (t)\) are more restrictive with respect to variations of the stiffness. For instance, stability is favored by a smaller range \(\left( k_p^{max}-k_p^{min}\right)\) or lower values of \(\alpha\), i.e. slower transition between stiff and compliant behaviors. For the limit cases \(k_p^{max}\longrightarrow k_p^{min}\) and \(\alpha \longrightarrow 0\), that is, constant stiffness, stability can be achieved regardless of \(\sigma (t)\). It can also be observed, since the right-hand side of the inequality is always positive, that with the proposed variable stiffness profile, stability is ensured if the uncertainty decreases.

6 An example application: door opening task

In order to test the proposed GP-based LfD approach, we applied it to the real-world task of opening doors using a TIAGo robot. This is a relevant skill for robots operating in domestic environments (Kim et al. 2004), since they need to open doors when navigating, to pick up objects in fetch-and-carry applications or assist people in their mobility.

6.1 Policy inference from human demonstrations

Fig. 5
figure 5

Demonstrations were recorded using an Xsens MVN motion capture system. The teacher opens three doors with different radius

We performed human demonstrations using an Xsens MVN motion capture system. Right hand trajectories of the human teacher relative to the initial closed door position were recorded for three different doors (Fig. 5).

Fig. 6
figure 6

Right-hand trajectories demonstrations dataset

Fig. 7
figure 7

Inference of the door opening policy from human demonstrations. The outputs are the position, defined by (xz), and the orientation defined by \(\theta u_z\) taking the axis-angle representation. On the left column, we have the demonstrations of the door opening motion. On the middle column, the trajectories temporally aligned. On the right column, the inferred Gaussian Process policy, where the dark and light shaded area correspond to the 63% and 95% confidence intervals

Coordinate axes were chosen such as the pulling direction is parallel to the x axis and the y axis is perpendicular to the floor. The demonstration dataset consisted in a total of 6 trajectories, two per each door (Fig. 6).

Fig. 8
figure 8

Door opening policy projected on the \(x-z\) plane

The main steps of the learning process of the door opening policy are illustrated in Fig. 7. The rotation component is encoded using the axis-angle representation. The demonstrated trajectories are aligned with the Dynamic Time Warping algorithm using the task completion index. We can see that the trajectories are warped effectively since they are clearly clustered in three different groups, one for each type of door. Once the trajectories are aligned, we infer the task policy training a heteroscedastic Gaussian Process model on the demonstration data.

We can observe that the model effectively captures the door opening skill. This is more clear in Fig. 8, where the task uncertainty has been projected onto the x-z plane. In this case, the variability in the task comes from the uncertainty in the radius of the door, which is reflected in the resulting policy.

6.2 Policy adaptation and modulation of the robot behavior

During the execution of the task, we can exploit the observations of the motion of the door which is currently being opened to adapt the learned policy. Specifically, we can gather these data by solving the forward kinematics of the robot, and use it to define a set of via-point constraints. By updating this set at each time step we can adapt the motion online to the current task requirements. In order to evaluate quantitatively the performance of the adaptive policy against the one based solely on the demonstrations, we use the mean squared prediction error (MSPE). Assuming that there exists a ground truth policy \(\widetilde{\pi }()\), which is the case when opening a door, the MSPE summarizes the predictive ability of the model. Ideally, this value should be close to zero:

$$\begin{aligned} MSPE=\left[ \varepsilon ^2\right] =\left( E\left[ {\pi }(t)\right] -\widetilde{\pi }(t)\right) ^2+V\left[ \pi (t)\right] \end{aligned}$$
(32)

where E[] and V[] refer to the expectancy and the variance, respectively. The evolution of the adaptive policy and the MSPE during the execution of the door opening motion is shown in Fig. 9.

Fig. 9
figure 9

a Evolution of the posterior predictive distribution considering as via-points the observations of the door motion in the light-blue shaded area. b The first row shows the comparison between the predictive distribution considering the adaptive policy or the policy based only on human demonstrations. In the second row we can see the mean squared prediction error (32) of each policy

We can see that by conditioning on the current observations of the door we are able to reduce the task uncertainty in the near future, converging also the mean to the ground truth. This translates into better performance in terms of the MSPE, as we can see in Fig. 9b). It is reduced by almost two orders of magnitude in the final stages of the task. With the proposed approach we are able to successfully open the door (Fig. 10).

Fig. 10
figure 10

TIAGo robot opening the door

Fig. 11
figure 11

On top, variable stiffness profile for x and z. Below, the evolution of the uncertainty derivative of the adaptive policy

The resulting variable stiffness profile is shown in Fig. 11. We have tuned the parameters empirically, being the used values \(k_p^{max}=500\), \(k_p^{min}=100\), \(m=1\), \(\delta =1\), \(\alpha =600\) and \(\beta =0.01\). For simplicity, we have considered the same law for the 6 degrees of freedom. We can observe that the robot behavior is modulated towards a more compliant behavior towards the final phases, where the policy is more uncertain. We can also see that the stability bound is not crossed, which is coherent with the behavior observed in the conducted experiments, where no instabilities occurred.

7 Conclusions

We propose an heteroscedastic multi-output GP policy representation, inferred from demonstrations.

This model considers a suitable parametrization of task space rotations for GP and ensures that only continuous and smooth paths are generated. The introduction of an input-dependent latent noise function allows an effective simultaneous encoding of the prediction uncertainty and the variability of demonstrated trajectories.

In order to establish a correlation between temporal and spatial coordinates, demonstrations must be aligned. We introduce the novel Task Completion Index, a similarity measure that allows to achieve an effective warping when the learned task requires the consideration of different paths.

Adaptation of the policy can be performed by conditioning it on a set of specified via-points. We also introduce a novel computationally efficient method, where the relative importance of the constraints can also be defined. Additionally, we propose an innovative variable stiffness profile that takes advantage of the uncertainty measure provided by the GP model to stably modulate the robot end-effector dynamics.

We applied the proposed learning from demonstration framework to the door opening task and evaluated the performance of the learned policy through real-world experiments with the TIAGo robot. Results show that the manipulation skill is effectively encoded and a successful reproduction can be achieved by taking advantage of the policy adaptation and robot behavior modulation approaches.

In future works we intend to improve the scalability of the learning algorithm by exploiting the structure of replications, and the adaptability of the model by incorporating task variables. This would allow us to apply our method for learning complex robot skills, such as cloth manipulation.