1 Introduction

Robotic cloth manipulation has a wide range of applications, from textile industry to assistive robotics [5, 8, 14, 19, 23, 29]. However, the complexity of cloth behavior results in a high uncertainty in the state transition given a certain action. This uncertainty is what makes manipulating cloth much more challenging than handling rigid objects. Intuitively, learning the cloth’s dynamics is the solution to reduce such uncertainty. In the literature, we can find several cloth models that simulate the internal cloth state [3, 25, 30]. They represent cloth as a mesh of material points and simulate their behavior taking into account physical constraints. However, fitting those models to real data can be a complex task. Moreover, such models need not only to behave similarly enough to the cloth garment, but to have a tractable dimensionality, for computational reasons. As an example, an \(8\times 8\) mesh representing a square towel results in a 192-dimensional manifold. Such dimensionality is unmanageable, not only in terms of computational costs, but also for building a tractable state-action space policy. Such is the case of [4], where simulated results are obtained after hours of computations.

Hence, Dimensionality Reduction (DR) methods can be very beneficial. In [11], linear DR techniques were used for learning cloth manipulation by biasing the latent space projection with each execution’s performance. Nonlinear methods, such as Gaussian Process Latent Variable Models (GPLVMs) [20] have also been applied for this purpose. In [18], GPLVM was employed to project task-specific motor-skills of the robot onto a much smaller state representation, whereas in [13] a GPLVM was also used to represent a robot manipulation policy in a latent space, taking contextual features into account. However, these approaches focus the dimensionality reduction on the robot action characterization, rather than on the manipulated object’s dynamics. Instead, in [17] a GPLVM learns a latent representation of the cloth state from point clouds. However, such approach did not consider the cloth handling task dynamics, limiting the application to quasi-static manipulations.

In this paper, we assume to have recorded data from several cloth motions, as a time-varying mesh of points. To fit such data into a tractable dynamical model, we consider Gaussian Process Dynamical Models (GPDMs), first introduced in [32], which are an extension of the GPLVM structure explicitly oriented to the analysis of high-dimensional time series. GPDMs have been applied in several different fields, from human motion tracking [31, 33] to dynamic texture modeling [35]. In the context of cloth manipulation, GPDMs were adopted in [16] to learn a latent model of the dynamics of a cloth handling task. However, this framework, as it stands, lacks in its structure a fundamental component to correctly describe the dynamics of a system, namely control actions, limiting generalization capacity.

Fig. 1
figure 1

Latent trajectories predicted by a trained CGPDM in response to two different sequences of unseen actions. Each latent state has associated a particular configuration of the cloth model (some of them are shown as an example)

Therefore, we propose here an extension of the GPDM structure, that takes into account the influence of external control actions on the modeled dynamics. We call it Controlled Gaussian Process Dynamical Models (CGPDM). In this new version, control actions directly affect the dynamics in the latent space. Thus, a CGPDM, trained on a sufficiently diverse set of interactions, is able to predict the effects of control actions never experienced before inside a space of reduced dimensions, and then reconstruct high-dimensional motions by projecting the latent state trajectories into the observation space. CGPDM has proved capable of fitting different types of cloth movements, in both a simulated and a real cloth manipulation scenario, and being able to predict the results of control actions never seen during training (example reported in Fig. 1). Finally, we compared two possible CGPDM parameterizations. The first is a straightforward extension of standard GPDM, whereas in the second we propose to employ squared exponential (SE) kernels with automatic relevance determination (ARD) [24] and inhomogeneous linear kernels, together with tunable dynamical map scaling factors, obtaining a better accuracy and generalization, especially in the low-data regime.

To summarize, the main contributions of this article are:

  • The proposal of the CGPDM structure, an extension of the GPDM capable of taking into account the presence of exogenous inputs.

  • The definition of a more rich parameterization able to achieve better accuracy and generalization w.r.t. the standard structure previously employed in the GPDM context.

  • The successful application of the proposed CGPDM to (both simulated and real) dynamic robotic cloth manipulation problems.

The remainder of the paper is structured as follows. Section 2 provides the details of the proposed CGPDM approach. Results obtained by CGPDM in cloth dynamics modeling are described in Sect. 3, both in simulation and in a real case scenario. Finally, the obtained results are discussed in Sect. 4 and conclusions are drawn in Sect. 5.

2 Methods

This section thoroughly describes the proposed method. We start by providing some background notions about the models we build on top: GP, GPLVM, and GPDM (Sect. 2.1). Then, we present the CGPDM (Sect. 2.2), detailing the structure of its latent and dynamics maps. In particular, we present two alternative CGPDM structures: naive and advanced. The first is a straightforward inclusion of exogenous inputs into standard GPDM, while the latter is the proposed CGPDM characterized by a richer parameterization. Finally, we conclude by describing the model training and prediction procedures (Sect. 2.3).

2.1 Background: From GP to GPDM

GPs [27] are the infinite-dimensional generalization of multivariate Gaussian distributions. They are defined as infinite-dimension stochastic processes such that, for any finite set of input locations \({\textbf{x}}_1, ..., {\textbf{x}}_n\), the random variables \(f({\textbf{x}}_1), ..., f({\textbf{x}}_n)\) have joint Gaussian distributions. A GP is defined by its mean function \(m({\textbf{x}})\) and kernel \(k({\textbf{x}}, {\textbf{x}}')\), that must be a symmetric and positive semi-definite function. Usually GPs are denoted as \(f({\textbf{x}}) \sim \mathcal{G}\mathcal{P}(m({\textbf{x}}), k({\textbf{x}}, {\textbf{x}}'))\).

GPs can be used for regression models of the form \(y = f({\textbf{x}}) + \varepsilon \), with \(\varepsilon \) an i.i.d. Gaussian noise, as they provide closed formulae to predict new target \(y^*\), given new input \({\textbf{x}}^*\). GP regression has been widely applied as a data-driven tool for dynamical system identification [15], usually describing each state by its own GP. Nevertheless, such approach struggles to scale to high-dimensional systems. Thus, DR strategies must be considered.

GPLVMs [20, 22] emerged as feature extraction methods that can be used as multiple-output GP regression models. These models, under a DR perspective, associate and learn low-dimensional representations of higher-dimensional observed data, assuming that observed variables are determined by the latent ones. Finally, GPLVMs provide, as a result of an optimization, a mapping from the latent space to the observation space, together with a set of latent variables representing the observed values. However, GPLVMs are not explicitly thought to deal with time series, where a dynamics relate the values observed at consecutive time steps.

Thus, [32] first introduced Gaussian Process Dynamical Models (GPDM), an extension of the GPLVM structure explicitly oriented to the analysis of high-dimensional time series. A GPDM entails essentially two stages: (i) a latent mapping that projects high-dimensional observations to a low-dimensional latent space; (ii) a discrete-time Markovian dynamics that captures the evolution of the time series inside the reduced latent space. GPs are used to model both maps.

2.2 Controlled GPDM

Let us consider a system governed by an unknown dynamics. At each time step t, \({\varvec{u}}_t \in {\mathbb {R}}^E\) represents the applied control action and \({\varvec{y}}_t \in {\mathbb {R}}^D\) the observation. For high-dimensional observation spaces, it could be unfeasible to directly model the evolution of a sequence of observations in response to a series of inputs. For instance, in the case of a robot moving a piece of cloth, we can consider as control actions \({\varvec{u}}_t\) the instantaneous movement of the end-effector, while the observations \({\varvec{y}}_t\) could be the coordinates of a mesh of material points, representing the cloth configuration. In this context, it could be convenient to capture the dynamics of the system in a low-dimensional latent space \({\mathbb {R}}^d\), with \(d<<D\). Let \({\varvec{x}}_t \in {\mathbb {R}}^d\) be the latent state associated with \({\varvec{y}}_t\). We propose to use a variation of the GPDM that keeps into account the influence of control actions, while maintaining the dimensionality reduction properties of the original model. We call it Controlled Gaussian Process Dynamical Model (CGPDM).

A CGPDM consists of a latent map (1) projecting observations \({\varvec{y}}_t\) into latent states \({\varvec{x}}_t\), and a dynamics map (2) that describes the evolution of \({\varvec{x}}_t\), subject to \({\varvec{u}}_t\). We denote the two maps as,

$$\begin{aligned}{} & {} {\varvec{y}}_t = g({\varvec{x}}_t) + {\varvec{n}}_{y,t}\text {,} \end{aligned}$$
(1)
$$\begin{aligned}{} & {} {\varvec{x}}_{t+1} - {\varvec{x}}_t = h({\varvec{x}}_t, {\varvec{u}}_t) + {\varvec{n}}_{x,t}\text {.} \end{aligned}$$
(2)

where \({\varvec{n}}_{y,t}\) and \({\varvec{n}}_{x,t}\) are two zero-mean isotropic Gaussian noise processes, while g and h are two unknown functions. Differently from original GPDM, here the latent transition function (2) is also influenced by exogenous control inputs \({\textbf{u}}_t\). Note that we consider \({\varvec{x}}_{t+1} - {\varvec{x}}_t\) to be the output of the CGPDM dynamic map, [33] suggested that this choice can improve latent trajectories smoothness. In the following, we report how we modeled (1) and (2) by means of GPs , while Fig. 2 illustrates the relation assumed by CGPDM between the latent, input, and output spaces along N time steps.

Fig. 2
figure 2

Symbolic representation of a CGPDM rollout along N time steps. Note how output \({\varvec{y}}\) depends exclusively on the latent state \({\varvec{x}}\), while control action \({\varvec{u}}\) influences only the latent dynamics

2.2.1 Latent variable mapping

Each component of the observation vector \({\varvec{y}}_t = [y_t^{(1)}, \dots , y_t^{(D)}]^T\) can be modeled a priori as a zero-mean GP that takes as input \({\varvec{x}}_t\), for \(t=1,\dots ,N\). Let \({\textbf{Y}} = [ {\varvec{y}}_1,\dots , {\varvec{y}}_N]^T \in {\mathbb {R}}^{N \times D}\) be the matrix that collects the set of N observations, and \({\textbf{X}} = [ {\varvec{x}}_1,\dots , {\varvec{x}}_N]^T \in {\mathbb {R}}^{N \times d}\) be the matrix of associated latent states. We denote with \({\textbf{Y}}_{:,j}\) the vector containing the j-th components of all the N observations. Then, if we assume that the D observation components are independent variables, the probability over the whole set of observations can be expressed by the product of the D GPs. In addition, if we choose the same kernel function \(k_y(\cdot ,\cdot )\) for each GP, differentiated only through a variable scaling factor \(w_{y,j}^{-2}\), with \(j=1,\dots ,D\), the joint likelihood over the whole set of observations is given by

$$\begin{aligned} p({\textbf{Y}}\vert {\textbf{X}}) = \frac{\vert {\textbf{W}}_y\vert ^N}{\sqrt{(2\pi )^{ND} \vert {\textbf{K}}_y({\textbf{X}})\vert ^D}} \cdot \nonumber \\ \text {exp}\left( -\frac{1}{2} \text {tr}\left( \left( {\textbf{K}}_y({\textbf{X}})\right) ^{-1} {\textbf{Y}} {\textbf{W}}_y^2 {\textbf{Y}}^T\right) \right) \text {,} \end{aligned}$$
(3)

where \({\textbf{W}}_y=\text {diag}(w_{y,1},\dots , w_{y,D})\), \({\textbf{K}}_y(X)\) is the covariance matrix defined element-wise by \(k_y(\cdot ,\cdot )\). Independence assumption may be relaxed by applying coregionalization models [1], at the cost of greater computational demands. In previous GPDM works [31,32,33], the GPs of the latent map were equipped with an isotropic SE kernel,

$$\begin{aligned} k_y'({\varvec{x}}_r, {\varvec{x}}_s) = \text {exp}\left( -\frac{\beta _1}{2}\vert \vert {\varvec{x}}_r-{\varvec{x}}_s\vert \vert ^2\right) + \beta _2^{-1} \delta ({\varvec{x}}_r,{\varvec{x}}_s)\text {,} \end{aligned}$$
(4)

with parameters \(\beta _1\) and \(\beta _2\) (with \(\delta ({\varvec{x}}_r,{\varvec{x}}_s)\) we indicate the Kronecker delta). Instead here, we adopt a richer ARD structure for the SE kernel, characterized by a different length-scale for each latent state component:

$$\begin{aligned} k_y({\varvec{x}}_r, {\varvec{x}}_s) = \text {exp}\left( -\vert \vert {\varvec{x}}_r-{\varvec{x}}_s\vert \vert _{\varvec{\Lambda }_y^{-1}}\right) + \sigma _y^2 \delta ({\varvec{x}}_r,{\varvec{x}}_s)\text {.} \end{aligned}$$
(5)

\(\varvec{\Lambda }_y^{-1} = \text {diag}(\lambda _{y,1}^{-2},\dots ,\lambda _{y,D}^{-2})\) is a positive definite diagonal matrix, which weights the norm used in the SE function, and \(\sigma _y^2\) is the variance of the isotropic noise in (1). The trainable hyper-parameters of the latent map model are then \(\varvec{\theta }_y = \left[ w_{y,1},\dots , w_{y,D}, \lambda _{y,1},\dots ,\lambda _{y,D}, \sigma _y\right] ^T\).

2.2.2 Dynamics mapping

Similarly to Sect. 2.2.1, we can model a priori each component of the latent state difference \({\varvec{x}}_{t+1}-{\varvec{x}}_t = [x_{t+1}^{(1)}-x_t^{(1)}, \dots , x_{t+1}^{(d)}-x_t^{(d)} ]^T\) as a zero-mean GP that takes as input the pair \(({\varvec{x}}_t,{\varvec{u}}_t)\), for \(t=1,\dots ,N-1\).

Let \({\textbf{X}} = [ {\varvec{x}}_1,\dots , {\varvec{x}}_N]^T \in {\mathbb {R}}^{N\times d}\) be the matrix collecting the set of N latent states, we can denote by \({\textbf{X}}_{r:s,i}\) the vector of the i-th components from time step r to time step s, with \(r,s=1,\dots ,N\). We indicate the vector of differences between consecutive latent states along their i-th component with \({\varvec{{\Delta }}}_{:,i} = ({\textbf{X}}_{2:N,i} - {\textbf{X}}_{1:N-1,i})\in {\mathbb {R}}^{N-1}\). \({\varvec{{\Delta }}} = [{\varvec{{\Delta }}}_{:,1},\dots ,{\varvec{{\Delta }}}_{:,d} ]\in {\mathbb {R}}^{(N-1)\times d}\) is the matrix that collects differences along all the components.

Finally, we compactly represent the GP input of the dynamic model as \(\tilde{{\varvec{x}}}_t = [{\varvec{x}}_t^T, {\varvec{u}}_t^T]^T \in {\mathbb {R}}^{d+E}\), and refer to the the matrix collecting \(\tilde{{\varvec{x}}}_t\) for \(t=1,\dots ,N-1\) with \(\tilde{{\textbf{X}}} = \left[ \tilde{{\varvec{x}}}_1,\dots , \tilde{{\varvec{x}}}_{N-1}\right] ^T \in {\mathbb {R}}^{(N-1) \times (d+E)}\). With similar assumptions to the ones made for the latent map, and denoting the common kernel function for all the GPs with \(k_x(\cdot ,\cdot )\), and the different scaling factors with \(w_{x,i}\), for \(i=1,\dots ,d\), the joint likelihood is given by

$$\begin{aligned}{} & {} p({\varvec{{\Delta }}}\vert \tilde{{\textbf{X}}}) = \frac{\vert {\textbf{W}}_x\vert ^{N-1}}{\sqrt{(2\pi )^{(N-1)d}\vert {\textbf{K}}_x(\tilde{{\textbf{X}}})\vert ^d}} \cdot \nonumber \\{} & {} \text {exp}\left( -\frac{1}{2} \text {tr}\left( \left( {\textbf{K}}_x(\tilde{{\textbf{X}}})\right) ^{-1} {\varvec{\Delta }} {\textbf{W}}_x^2 {\varvec{\Delta }}^T\right) \right) \text {,} \end{aligned}$$
(6)

where \({\textbf{W}}_x=\text {diag}(w_{x,1},\dots ,w_{x,d})\) and \({\textbf{K}}_x(\tilde{{\textbf{X}}})\) is the covariance matrix defined by \(k_x(\cdot ,\cdot )\). In standard GPDM [32], dynamic mapping GPs have been proposed with constant scaling factors \(w_{x,i}=1\) for \(i=1,\dots ,d\), and equipped with a naive kernel resulting from the sum of an isotropic SE and an homogeneous linear function, with only four trainable parameters:

$$\begin{aligned} k_x'(\tilde{{\varvec{x}}}_r, \tilde{{\varvec{x}}}_s)= & {} \alpha _1\text {exp}\left( -\frac{\alpha _2}{2}\vert \vert \tilde{{\varvec{x}}}_r-\tilde{{\varvec{x}}}_s\vert \vert ^2\right) \nonumber \\{} & {} +\alpha _3 \tilde{{\varvec{x}}}_r^T\tilde{{\varvec{x}}}_s + \alpha _4^{-1} \delta (\tilde{{\varvec{x}}}_r,\tilde{{\varvec{x}}}_s)\text {.} \end{aligned}$$
(7)

Analogously to the latent mapping case, we decided to adopt the following kernel function,

$$\begin{aligned} k_x(\tilde{{\varvec{x}}}_r, \tilde{{\varvec{x}}}_s)= & {} \text {exp}\left( -\vert \vert \tilde{{\varvec{x}}}_r-\tilde{{\varvec{x}}}_s\vert \vert _{\varvec{\Lambda }_{x}^{-1}}\right) \nonumber \\{} & {} +{[}\tilde{{\varvec{x}}}_r^T, 1]\varvec{\Phi } {[}\tilde{{\varvec{x}}}_s^T 1]^T + \sigma _x^2\delta (\tilde{{\varvec{x}}}_r,\tilde{{\varvec{x}}}_s) \text {.} \end{aligned}$$
(8)

\(\varvec{\Lambda }_{x}^{-1} = \text {diag}(\lambda _{x,1}^{-2},\dots ,\lambda _{x,d+E}^{-2})\) is a positive definite diagonal matrix, which weights the norm used in the SE component of the kernel. Also \(\varvec{\Phi } = \text {diag}(\phi _{1}^2,\dots ,\phi _{d+E+1}^2) \) is a positive definite diagonal matrix that describes the linear component. \(\sigma _x^2\) is the variance of the isotropic noise in (2). In comparison with (7), the adopted kernel weights differently the various components of the input in both SE and linear part, where the GP input is also extended as \(\left[ \tilde{{\varvec{x}}}_s^T, 1\right] ^T\). The trainable hyper-parameters of the dynamic map model are then \(\varvec{\theta }_x = \left[ w_{x,1},\dots , w_{x,d}, \lambda _{x,1},\dots ,\lambda _{x,d}, \phi _{1},\dots ,\phi _{d+E+1}, \sigma _x\right] ^T\).

In the following, we will refer with naive CGPDM to the model that straightforwardly extends the standard GPDM structure from [32], using its same kernels, (4),(7), and constant scaling factors; while we denote with advanced CGPDM the proposed model characterized by kernels (5),(8) and trainable scaling factors in the dynamical map. Although ARD kernels are commonly adopted in GP regression [27], they were not tested before in GPDMs. Trainable scaling factors constitute a novelty for this kind of model too.

2.2.3 Working with multiple sequences

It is possible to easily extend the CGPDM formulation to P multiple sequences of observations, \({\textbf{Y}}^{(1)}, \dots , {\textbf{Y}}^{(P)}\), and control inputs, \({\textbf{U}}^{(1)}, \dots , {\textbf{U}}^{(P)}\). Let the length of each sequence p, for \(p=1,\dots ,P\), be equal to \(N_p\), with \(\sum _{p=1}^PN_p = N\). Define the latent states associated with each sequence as \({\textbf{X}}^{(1)}, \dots , {\textbf{X}}^{(P)}\). Following the notation of Sect. 2.2.2, define \(\tilde{{\textbf{X}}}^{(1)}, \dots , \tilde{{\textbf{X}}}^{(P)}\), as the sequence of the aggregated matrices of latent states and control inputs, and \({\varvec{\Delta }}^{(1)}, \dots , {\varvec{\Delta }}^{(P)}\) as the difference matrices. Hence, model joint likelihoods can be calculated by using the following concatenated matrices inside (3) and (6): \({\textbf{Y}} = [{\textbf{Y}}^{(1)T}\vert \dots \vert {\textbf{Y}}^{(P)T}]^T\), \({\textbf{X}} = [{\textbf{X}}^{(1)T}\vert \dots \vert {\textbf{X}}^{(P)T}]^T\), \({\varvec{\Delta }} = [{\varvec{\Delta }}^{(1)T}\vert \dots \vert {\varvec{\Delta }}^{(P)T}]^T\) and \(\tilde{{\textbf{X}}} = [\tilde{{\textbf{X}}}^{(1)T}\vert \dots \vert \tilde{{\textbf{X}}}^{(P)T}]^T\). Note that, when dealing with multiple sequences, the number of data points in the dynamic mapping becomes \(N-P\), and expression (6) must change accordingly.

Fig. 3
figure 3

Flowchart summarizing the CGPDM training process. Given a set of training data Y and U (1) and a desired latent dimension d (2), the associated latent states X are initialized via PCA (3) and then optimized together with other CGPDM hyper-parameters (4). After the training, we obtain the probabilistic predictive model (5), and the set of optimized latent trajectories capturing the high-dimensional dynamics (6)

2.3 CGPDM training and prediction

Training a CGPDM entails using numerical optimization techniques to estimate the unknowns in the model, i.e., latent states \({\textbf{X}}\) and the hyper-parameters \(\varvec{\theta }_x,\varvec{\theta }_y\). Latent coordinates \({\textbf{X}}\) are initialized by means of PCA [6], selecting the first d principal components of \({\textbf{Y}}\). A natural approach for training CGPDMs is to maximize the joint log-likelihood \(\text {ln}\;p({\textbf{Y}}\vert {\textbf{X}}) +\text {ln}\;p({\varvec{\Delta }}\vert \tilde{{\textbf{X}}})\) w.r.t. \(\{{\textbf{X}}, \varvec{\theta }_x,\varvec{\theta }_y\}\). To do so, in this work, we adopted the L-BFGS algorithm [9].

The overall loss to be optimized can be written as \({\mathcal {L}} = {\mathcal {L}}_y+ {\mathcal {L}}_x\), with \({\mathcal {L}}_y\) and \({\mathcal {L}}_x\) defined as

$$\begin{aligned}{} & {} {\mathcal {L}}_y = \frac{D}{2}\text {ln}\vert {\textbf{K}}_y({\textbf{X}})\vert + \frac{1}{2}\text {tr}({\textbf{K}}_y({\textbf{X}})^{-1}{\textbf{Y}} {\textbf{W}}_y^2 {\textbf{Y}}^T)-N \text {ln} \vert {\textbf{W}}_y\vert \text {,} \end{aligned}$$
(9)
$$\begin{aligned}{} & {} {\mathcal {L}}_x = \frac{d}{2}\text {ln}\vert {\textbf{K}}_x(\tilde{{\textbf{X}}})\vert \!+\! \frac{1}{2}\text {tr}({\textbf{K}}_x(\tilde{{\textbf{X}}})^{-1} {\varvec{\Delta }} {\textbf{W}}_x^2 {\varvec{\Delta }}^T) \!-\!(N-1) \text {ln} \vert {\textbf{W}}_x\vert \text {.}\nonumber \\ \end{aligned}$$
(10)

In case the CGPDM is trained on multiple sequences of inputs and observations, make sure to employ the aggregated matrices defined in Sect. 2.2.3 when computing loss functions 9-10. It is also necessary to use the factor \(N-P\) instead of \(N-1\) inside the \({\mathcal {L}}_x\) expression. The overall training procedure is represented schematically in Fig. 3.

A trained CGPDM can be used to fulfill two different purposes: (i) map a given new latent state \({\varvec{x}}_t^*\) to the corresponding \({\varvec{y}}_t^*\) in observation space, (ii) predict the evolution of the latent state at the next time step \({\varvec{x}}_{t+1}^*\), given \({\varvec{x}}_{t}^*\) and a certain control \({\varvec{u}}_{t}^*\). The two processes, together, can predict the observations produced by a given series of control actions.

2.3.1 Latent prediction

Given \({\varvec{x}}_t^*\), its corresponding \({\varvec{y}}_t^*\) is distributed as \(p({\varvec{y}}_t^*\vert {\varvec{x}}_t^*, {\textbf{X}}, \varvec{\theta }_y) = {\mathcal {N}}(\varvec{\mu }_y({\varvec{x}}_t^*),v_y({\varvec{x}}_t^*){\textbf{W}}_y^{-2})\), with

$$\begin{aligned}{} & {} \varvec{\mu }_y({\varvec{x}}_t^*) = {\textbf{Y}}^T{\textbf{K}}_y({\textbf{X}})^{-1} {\varvec{k}}_y({\varvec{x}}_t^*,{\textbf{X}})\end{aligned}$$
(11)
$$\begin{aligned}{} & {} v_y({\varvec{x}}_t^*) = k_y({\varvec{x}}_t^*,{\varvec{x}}_t^*) -{\varvec{k}}_y({\varvec{x}}_t^*,{\textbf{X}})^T {\textbf{K}}_y({\textbf{X}})^{-1} {\varvec{k}}_y({\varvec{x}}_t^*,{\textbf{X}}) \text {,}\nonumber \\ \end{aligned}$$
(12)

where \({\varvec{k}}_y({\varvec{x}}_t^*,{\textbf{X}}) = \left[ k_y({\varvec{x}}_t^*,{\varvec{x}}_1),\dots , k_y({\varvec{x}}_t^*,{\varvec{x}}_N)\right] ^T\).

2.3.2 Dynamics prediction

Given \({\varvec{x}}_t^*\) and \({\varvec{u}}_t^*\), let’s define \(\tilde{{\varvec{x}}}_t^*=[{\varvec{x}}_t^{*T},{\varvec{u}}_t^{*T}]^T\). The probability density of the latent state at the next time step \({\varvec{x}}_{t+1}^*\) is \(p({\varvec{x}}_{t+1}^*\vert \tilde{{\varvec{x}}}_t^*, {\textbf{X}}, \varvec{\theta }_x) = {\mathcal {N}}(\varvec{\mu }_x({\varvec{x}}_t^*),v_x({\varvec{x}}_t^*){\textbf{W}}_x^{-2})\), with

$$\begin{aligned}{} & {} \varvec{\mu }_x({\varvec{x}}_t^*) = {\varvec{x}}_t^* + {\varvec{\Delta }}^T {\textbf{K}}_x(\tilde{{\textbf{X}}})^{-1} {\varvec{k}}_x(\tilde{{\varvec{x}}}_t^*,\tilde{{\textbf{X}}})\text {,}\end{aligned}$$
(13)
$$\begin{aligned}{} & {} v_x({\varvec{x}}_t^*) \!=\! k_x(\tilde{{\varvec{x}}}_t^*,\tilde{{\varvec{x}}}_t^*) -{\varvec{k}}_x(\tilde{{\varvec{x}}}_t^*,\tilde{{\textbf{X}}})^T{\textbf{K}}_x(\tilde{{\textbf{X}}})^{-1} {\varvec{k}}_x(\tilde{{\varvec{x}}}_t^*,\tilde{{\textbf{X}}})\text {,}\nonumber \\ \end{aligned}$$
(14)

with \({\varvec{k}}_x(\tilde{{\varvec{x}}}_t^*,\tilde{{\textbf{X}}})=\left[ k_x(\tilde{{\varvec{x}}}_t^*,\tilde{{\varvec{x}}}_1)\dots k_x(\tilde{{\varvec{x}}}_t^*,\tilde{{\varvec{x}}}_{N-1})\right] ^T\).

2.3.3 Trajectory prediction

Starting from an initial latent state \({\varvec{x}}_1^*\), one can predict the system evolution over a desired horizon of length \(N_d\), when subject to a given sequence of control actions \({\varvec{u}}_1^*,\dots ,{\varvec{u}}_{N_d-1}^*\). At each time step \(t=1,\dots ,N_d-1\), \({\varvec{x}}_{t+1}^*\) can be sampled from the normal distribution \(p({\varvec{x}}_{t+1}^*\vert \tilde{{\varvec{x}}}_t^*, {\textbf{X}}, \varvec{\theta }_x)\) defined in Sec. 2.3.2. Hence, the generated trajectory in the latent space \({\varvec{x}}_1^*,\dots ,{\varvec{x}}_{N_d}^*\) can be mapped into the associated sequences of observations \({\varvec{y}}_1^*,\dots ,{\varvec{y}}_{N_d}^*\) by considering the previously defined probability distribution \(p({\varvec{y}}_t^*\vert {\varvec{x}}_t^*, {\textbf{X}}, \varvec{\theta }_y)\).

3 Results

We employed the proposed CGPDM to model the high-dimensional dynamics that characterizes the motion of a piece of cloth held by a robotic system. This section reports the results obtained in two sets of experiments: a simulated session (Sect. 3.1) and one conducted on a real setup (Sect. 3.2). We exploited simulation to assess the performance of CGPDM over a wide set of scenarios (different amount of training data, motion ranges, and model structure), while the real-world experiment served as validation over non-synthetic data. The objective of the experiments was to learn the high-dimensional cloth dynamics using CGPDM, in order to make predictions about cloth movements in response to sequences of actions that were not seen during training. In particular, we aimed to evaluate how model prediction accuracy is affected by:

  • the number of data used for training,

  • the oscillation range of the cloth movements,

  • the use of advanced or naive CGPDM structures (as defined in Sect. 2.2).

Such high-dimensional task would be unfeasible to model by standard GP regression without DR. CGPDMs were implemented in PythonFootnote 1, employing PyTorch [26].

3.1 Simulated cloth experiment

In the simulated scenario, we considered a bimanual robot moving a squared piece of cloth by holding its two upper corners, as shown in Fig. 4. The cloth was modeled as an 8\(\times \)8 mesh of material points. We made the assumption that the two upper corner points are attached to the robot’s end-effectors, while the other points move freely following the dynamical model proposed in [12].

Fig. 4
figure 4

Simulated setup for cloth manipulation with bimanual robot. The cloth is positioned in its starting configuration

Fig. 5
figure 5

Oscillation ranges \(R\in \{30{^\circ },60{^\circ },90{^\circ },120{^\circ }\}\) defining the sampling intervals for \(\gamma \) during data collection

Fig. 6
figure 6

True (top) and predicted (bottom) simulated cloth oscillation frames for one of the considered test trajectories

In this context, the observation vector is given by the Cartesian coordinates of all the points in the mesh (measured in meters); hence \({\varvec{y}}_t\in {\mathbb {R}}^D\) with \(D=192\). We assumed to control exactly the two robot arms in the operational space, keeping the same orientation and relative distance between the two end-effectors and producing oscillation in the Y-Z plane. Thus, we considered as control actions the differences between consecutive commanded end-effector positions in the Y and Z directions, resulting in a \({\varvec{u}}_t\in {\mathbb {R}}^E\) with \(E=2\).

3.1.1 Data collection

Training and test data were obtained by recording mesh trajectories associated with several types of cloth oscillation, obtained by applying different sequences of control actions. All the considered trajectories start from the same cloth configuration and last 5 seconds. Observations were recorded at 20 Hz, hence \(N=100\) total number of steps for each sequence.

Robot end-effectors move in a coordinate fashion drawing oscillations on the Y-Z plane. Let \({\varvec{u}}_t = \left[ \Delta ee^Y_t, \Delta ee^Z_t\right] ^T\), where \(\Delta ee^Y_t,\) and \(\Delta ee^Z_t,\) indicate the difference between consecutive end-effector commanded positions along the Y and Z axes. Specifically, their values were given by the two following periodic expressions:

$$\begin{aligned} \Delta ee^{Y}_t=A\cdot \text {cos}(2\pi f_{Y} t) \left[ -\text {cos}(\gamma ),\text {sin}(\gamma )\right] \text {,}\nonumber \\ \Delta ee^{Z}_t=A\cdot \text {cos}(2\pi f_{Z} t) \left[ -\text {cos}(\gamma ),\text {sin}(\gamma )\right] \text {.} \end{aligned}$$
(15)

Such controls make the end-effectors oscillate on the Y-Z plane of the operational space. The maximum displacement is regulated by A, that we set to 0.01 meters. Parameter \(\gamma \) can be interpreted as the inclination of \({\varvec{u}}_1\) w.r.t. the horizontal, and it loosely defines a direction of the oscillation. \(f_Y\) and \(f_Z\) define the frequencies of the oscillations along Y and Z axes. If they are similar, the end-effectors move mostly along the direction defined by \(\gamma \), if not, they swipe in a broader space.

In order to obtain a heterogeneous set of trajectories for the composition of training and test sets, we collected several movements obtained by choosing in a random fashion the control parameters \(\gamma \), \(f_Y\) and \(f_Z\). Angles \(\gamma \) were uniformly sampled inside a variable range \([-\frac{R}{2},\frac{R}{2}]\) (deg); in the following, we indicate this range with the amplitude of its angular area, R (deg). Instead, frequencies \(f_Y\) and \(f_Z\) were uniformly sampled inside the fixed interval [0.3, 0.6] (Hz). We considered four movement ranges of increasing width, namely \(R\in \{30{^\circ },60{^\circ },90{^\circ },120{^\circ }\}\) (Fig. 5), and collected a specific dataset \({\mathcal {D}}_R\) associated with each range. Every set contains 50 cloth trajectories obtained by applying control actions of the form (15) with 50 different random choices for parameters \(\gamma \), \(f_Y\) and \(f_Z\). From each \({\mathcal {D}}_R\), 10 trajectories were extracted and used as test sets \({\mathcal {D}}_R^{test}\) for the corresponding movement range, while several training sets \({\mathcal {D}}_R^{train}\) were built by randomly picking from the remaining sequences.

3.1.2 Model training

In all the models, we adopted a latent space of dimension \(d=3\), resulting in a dimensionality reduction factor of \(D/d=64\). This d value was chosen empirically after preliminary tests and allows to easily visualize the latent variables behavior in a three-dimensional space, see for instance Fig. 1. Other choices are possible, but such sensitivity analysis is left out of the scope of this experimental analysis.

The objective of the experiment was to evaluate CGPDM prediction accuracy at different movement ranges, and for different amounts of training data. Moreover, we wanted to observe if the use of the proposed advanced CGPDM structure yields a substantial difference in terms of accuracy when compared to the naive model. Consequently, for each considered movement range R, we trained two different sets of CGPDMs, adopting in one the naive structure and in the other the advanced one. Each model in the two sets was trained employing an increasing number of sequences randomly picked from \({\mathcal {D}}_R^{train}\). Specifically, we used five different random combinations of 5, 10, 15 and 20 sequences for each oscillation range (varying each time the random seed). In this way, we were able to reduce the dependencies on the specific training trajectories considered, and to average prediction accuracy over different possible sets of training data.

3.1.3 Model prediction

We used each learned CGPDM to predict the cloth movements when subject to the control actions observed for each test sequence inside \({\mathcal {D}}_R^{test}\), with \(R\in \{30{^\circ },60{^\circ },90{^\circ },120{^\circ }\}\). Let \({\varvec{y}}_t^{(R,k)}\) and \({\varvec{u}}_t^{(R,k)}\) denote, respectively, the observation and control action at time step t of the k-th test trajectory in \({\mathcal {D}}_R^{test}\) (with \(k=1,\dots ,10\)).

For each considered range R, one can follow the procedure of Sect. 2.3.3 and employ the trained CGPDMs to predict the trajectories resulting from the application of \(\{{\varvec{u}}_t^{(R,k)}\}_{t=1}^{N-1}\), for \(k=1,\dots ,10\). Let \({\varvec{x}}_t^{*(R,k)}\) be the predicted latent state at time t, and \({\varvec{y}}_t^{*(R,k)}\) the corresponding predicted observation. As an example, in Fig. 6 we show a sequence of true and predicted cloth configurations for one of the considered test trajectory. Please, refer to the videoFootnote 2 for a clearer visualization of the obtained results.

For every predicted trajectory, we measured the average distance between the real and the predicted mesh points. Figure 7 represents the observed errors by means of boxplots, indicating also the statistical relevance of the naive-advanced difference in each experiment configuration (T-test performed by using the open-source library StatannotationsFootnote 3). Moreover, Table 1 reports the average distances between true and predicted mesh points obtained in the test sets by the different CGPDM configurations in all the movement ranges. Results are expressed in terms of mean and 95% confidence intervals obtained by averaging over the different training sets adopted (all the experiments were repeated five times, using a randomly composed \({\mathcal {D}}_R^{train}\)).

Fig. 7
figure 7

Boxplot representing the test prediction errors obtained by the advanced and the naive CGPDM structures at different oscillation ranges in the simulation experiment. Each configuration was tested with five different randomly composed \({\mathcal {D}}_R^{train}\). Mean values are indicated with red triangles and statistical significance of T-test comparing advanced and naive results are represented with the following notation: \(\texttt {ns}: 5.0\textrm{e}{-2}< p \le 1.0\), \(*: 1.0\textrm{e}{-2} < p \le 5.0\textrm{e}{-2}\), \(**: 1.0\textrm{e}{-3} < p \le 1.0\textrm{e}{-2}\), \(***: 1.0\textrm{e}{-4} < p \le 1.0\textrm{e}{-3}\), \(****: p \le 1.0\textrm{e}{-4}\))

Table 1 Mean prediction errors (with 95% C.I.) between true and predicted cloth trajectories obtained by the advanced and the naive CGPDM structures at different oscillation ranges in both the simulated and real-world experiment (the number of data used for training is indicated inside the squared brackets)

3.2 Real cloth experiment

In this second set of experiments, we tested CGPDM on data collected in a real cloth manipulation scenario. For this purpose, we used a Barrett WAM ArmFootnote 4, whose end-effector consists of a coat rack that can firmly grip a piece of cloth from its corners. The overall setup is depicted in Fig. 8. We controlled the robot’s end-effector in position, recording the resulting movement of the cloth through a motion capture system based on information extracted from an RGBD camera. We combined object detection, image and point cloud processing for segmenting cloth-like objectsFootnote 5, following [7, 28] and [34].

3.2.1 Data collection

As in the simulated scenario, we captured the cloth as an 8\(\times \)8 mesh of points, whose spatial coordinates constitute the observation vector \({\varvec{y}}_t\in {\mathbb {R}}^D\) with \(D=192\).

Control actions were defined following again expressions (15) and commanded to the robot at 100 Hz. Parameters \(f_Y\) and \(f_Z\) were uniformly sampled within [0.2, 0.5] (Hz) and A was set to 0.004 meters. In this experiment, we considered only the \(R=30^\circ \), \(R=60^\circ \), and \(R=90^\circ \) oscillation ranges (\(R=120^\circ \) was excluded because of robot workspace limitations).

The motion capture system could work only at rates lower than 100 Hz, with no guaranteed sampling interval length. Thus, it was necessary to post-process the data to make them ready for modeling. Firstly, motion capture data were smoothed by a moving average filter. Then we interpolated the positions of both the end-effector and the cloth mesh, to obtain two synchronized sequences of observations and control actions, sampled at 20 Hz. For each of the three ranges, we collected 10 trajectories each 3 seconds long.

Fig. 8
figure 8

Real experimental setup with the Barrett WAM Arm holding a piece of cloth whose motion can be tracked by a RGBD camera

3.2.2 Model training & prediction

For every considered oscillation range, we trained two sets of CGPDMs, one using the naive and one the advanced model structure. Each set of trajectories is composed of 10 sequences, hence we followed a cross-validation method for training and testing the models. At every range, we trained the models using all the sequences but one, left out for testing, repeating the procedure ten times varying the test sequence each time.

Fig. 9
figure 9

True and predicted mesh points for one of the registered real cloth movement

Fig. 10
figure 10

Boxplot representing the test prediction errors obtained by the advanced and the naive CGPDM structures at different oscillation ranges in the real-world experiment. Each configuration was tested following a cross-validation method on a set of 10 cloth trajectories (using 9 for training and one for test). Mean values are indicated with red triangles and statistical significance of T-test comparing advanced and naive results are represented following the same notation of Fig. 7

The models were used to predict the cloth movements obtained in response to the control actions of each test trajectory, measuring the average distance between the real and the predicted mesh points. In Fig. 9, we provide a visual representation of the cloth movements, by representing the true and predicted trajectories of a subset of mesh points, in one of the example test cases. Please refer to the video\(^2\) for better visualizing the obtained results. Similarly to the simulated experiment case, Fig. 10 represents the observed errors by means of boxplots and the last row of Table 1 reports the mean distances between true and predicted mesh points obtained in all the considered movement ranges.

4 Discussion

The experimental results obtained in simulation confirm the capacity of CGPDM to capture the cloth dynamics of oscillations along axes Y and Z. When trained with a sufficient amount of data, CGPDMs obtained satisfying results in a variety of movement ranges. Training with only five sequences seems insufficient to properly capture the considered dynamics. When training from 10 to 15 sequences, the observed errors diminish significantly; instead working with 20 training trajectories generate minor signs of over-fitting.

For smaller movement ranges (\(R=30^\circ \) or \(R=60^\circ \)), the reconstructed trajectories of the mesh of points appear similar to the true ones. Conversely, for wider ranges (\(R=90^\circ \) or \(R=120^\circ \)), discrepancies between true and predicted points begin to be more evident, but the CGPDMs are still able to capture the overall movement of the cloth.

Moreover, the proposed advanced CGPDM structure significantly improves accuracy and consistency of the results in the majority of cases, when compared to the naive model. This effect is clearer in a low-data regime and when dealing with wide oscillation ranges.

Finally, results obtained in the real-world experiments confirm the trends observed in the simulated scenario. The advanced CGPDM structure drastically outperforms the naive model that seems unable to cope with the high noise that afflicts the real experimental setup.

5 Conclusion

We presented CGPDM, a modeling framework for high-dimensional dynamics governed by control actions. Essentially, this model projects observations into a latent space of low dimension, where dynamical relations are easier to infer. CGPDMs were applied to a robotic cloth manipulation task, where the observations are the coordinates of the cloth mesh. We tested CGPDMs in both simulated and real experiments. The observed results empirically demonstrate that the proposed advanced CGPDM structure can capture the complex high-dimensional cloth dynamics given a small number of trajectories to learn from by leveraging the data efficiency that characterizes GP-based methods.

In future works, we aim to apply CGPDM within Model-Based Reinforcement Learning algorithms (such as [2, 10]) to automatically learn control policies for high-dimensional systems. Moreover, CGPDM formulation could be extended through the introduction of back constraints [21] to preserve local distances and obtain an explicit formulation of the mapping from the observation to latent space. Finally, the integration of context variables within the CGPDM formulation could permit generalizing over different types of cloth fabric.