Analysis of Methods for Incremental Policy Refinement by Kinesthetic Guidance

Traditional robot programming is often not feasible in small-batch production, as it is time-consuming, inefficient, and expensive. To shorten the time necessary to deploy robot tasks, we need appropriate tools to enable efficient reuse of existing robot control policies. Incremental Learning from Demonstration (iLfD) and reversible Dynamic Movement Primitives (DMP) provide a framework for efficient policy demonstration and adaptation. In this paper, we extend our previously proposed framework with improvements that provide better performance and lower the algorithm’s computational burden. Further, we analyse the learning stability and evaluate the proposed framework with a comprehensive user study. The proposed methods have been evaluated on two popular collaborative robots, Franka Emika Panda and Universal Robot UR10.


Introduction
Despite recent changes in the manufacturing paradigm, robots are still predominantly used in large-scale and largebatch production manufacturing plants such as the automotive and electronics industry [1]. The global competitive arena forces companies to react quickly to the demands of the market while maintaining economic feasibility. Consequently, the robotized production plants should adapt to handle a wide variety of products by frequent reconfiguration of both hardware sub-components and robot programs [2]. Another issue is to shorten the robot programming time and reduce the required skill level of the operators [3]. Traditional robot programming requires profound knowledge in computer science and understanding issues arising from robot kinematics and dynamics [4]. Learning from Demonstration (LfD), often referred to as Imitation Learning, is a promising and powerful technology that can among others solve the issues mentioned above [5,6]. Within this framework, the robot tasks are demonstrated naturally rather than coding in a robot-oriented programming language or using expensive and complex robot simulation systems. It is important to emphasize that LfD concept goes beyond the simple showing and copying of a policy performed by a demonstrator. It includes many aspects such as generalization of the robot policy to different contexts, the ability to adapt to changes in the environment, or refinement of the captured policy [5,7]. Within LfD, techniques like Reinforcement Learning (RL) [8] or Iterative Learning Control (ILC) [9] are often used for autonomous adaptation of the policy. ILC generally adapts the policy in a few iteration cycles, but it requires at least partial knowledge of the environment and robot dynamics. RL, on the other hand, is generally model-free but needs lengthy adaptation and is currently not yet considered as mature technology for real-word applications.
If the robot is not required to act autonomously, it is more efficient to adapt the existing policy through human-robot interaction, e.g., by utilizing kinesthetic guidance. During this skill transfer from human to robot, often referred as coaching, it is essential that we do not always demonstrate the policy over and over again from the beginning but only gradually refine the existing policy. This issue is efficiently addressed by incremental learning.

Related Work
Incremental learning has been studied intensively in the past decade. During the incremental task demonstration, the demonstrator must have the ability to change only the parts of the trajectory that require correction, leaving the rest unchanged. Calinon and Billard [10] incrementally updated the policy encoded with Gaussian Mixture Models while learning gestures for a humanoid robot. Kulic et al. [11] dealt with the problem of incremental learning and simultaneous segmentation of human body motions using Hidden Markov Models. Gams et al. [12] studied how to incrementally refine periodic movements using either visual or force feedback. A framework for incremental learning with kinesthetic guidance was proposed by Lee and Ott [13]. Although kinesthetic guiding is intuitive, it is often impractical as the user can only move a limited number of joints simultaneously. In [13], they proposed incremental policy adaptation along the refinement tube, which allows to limit the motion of the joints during the teaching in a controlled way. Incremental adaptation of context-dependent robot skills was studied in [14], where they applied probabilistic Motion Primitives and Iterative Learning Control framework. Incremental learning can also be implemented using vocal commands, as it was demonstrated in [15]. Incremental learning by combining several demonstrations and using virtual tool dynamics has been proposed in [16]. For assisting kinesthetic modifications of the learned behavior in task variations, the concept of virtual fixtures [17] and penetrable virtual fixtures [18] was recently proposed.
A typical policy consists of both spatial and temporal parts; often, it is beneficial to treat them separately. Since complex movements can be more accurately demonstrated at a low-speed [19], it is also necessary to allow for a separate refinement of the spatial and temporal part of the task. This important aspect was not considered in the aforementioned approaches. In [20] we proposed first to refine the spatial part, i.e., the shape of the trajectory, with an arbitrary speed and introduced the concept of variable stiffness along Frenet-Serret (FS) frames. The desired speed profile was learned in the last stage of learning. Another essential feature of the proposed approach is the ability to incrementally refine the movement by moving back and forth along the trajectory during the demonstration, i.e., to update the path in multiple passes. In [21], batch regression was used to update the parameters of the existing policy incrementally. An alternative way, proposed in [22], applies recursive regression, which updates the policy parameters directly and avoids saving the entire set of coordinates captured during the refinement.

Main Contributions
This paper is an extension of work originally presented at the 19th International Conference on Advanced Robotics [22], which presented a framework for efficient incremental learning from demonstration aiming to directly modify the policy parameters by recursive regression technique.
In this paper, we further improve our previously presented incremental learning algorithms [20,22] with several modifications, that enhance kinestethic feeling during policy corrections and improve the learning accuracy.
• We introduce Constant Speed Cartesian DMPs, a modified formulation of DMPs, that follows the idea of Arc Length DMPs (AL-DMPs) [23]. Using this formulation, we can completely decouple the spatial and temporal component of the task, which contributes to more efficient learning. • Next, we refine the original DMP policy by learning offsets instead of modifying the DMP parameters, which diminishes the computational burden and improves long-term stability of the policy [24]. • Using sigmoid activation functions, we provided a soft start and stop when pushing along the trajectory, enhancing the kinestheitc feeling. • The encoding accuracy during incremental learning has been improved by replacing entire sections of the trajectory and implementing a new phase detection algorithm.
Additionally, we compare different incremental learning schemes for admittance or impedance controlled robots with policy updated based on recursive or batch regression technique by providing the stability analyses of our incremental learning schemes. Finally, we conduct a user study to evaluate their performance statistically and discuss the benefits and weaknesses of proposed algorithms.
The paper is organized into five sections. In Section 2, we revisit Cartesian DMPs, introduce DMPs with constant speed profile, and provide extensions that enable the reversibility of the policies. Procedures for incremental refinement of an existing policy using kinesthetic guidance are given in Section 3. Section 4 analyses the stability of these procedures, describes the experimental verification of our framework and statistical evaluation of the user study. Finally, in Section 5 we highlight the contributions and benefits of the proposed methods with respect to the learning efficiency and ease of implementation.

Policy Encoding using Dynamic Movement Primitives
Initial step of every policy refinement procedure is to encode the original trajectory in a suitable parametric form, often referred as a Movement Primitives (MP). Our approach relies on Dynamic Movement Primitives (DMP) [25], but other representations could also be used.
We apply a modified version of DMPs, which handles non-uniform velocity scaling [26] and encoding Cartesian space policies using unit quaternions [27]. In this work, we propose DMPs with a constant speed profile to fully decouple the spatial and temporal part of the task. Next, we revisit and refine further extensions of the DMPs framework that handle reversible policies, i.e., policies that allow us to follow the same path both forwards and backward [20,22].

Speed Scaled Cartesian DMPs
The original policy π(t), where t is the time, can be obtained in various ways, e.g. by a motion tracker or kinesthetic guidance. In all cases, we acquire the Cartesian space tool center point (TCP) positions and orientations, angular velocities and accelerations at times t k , k = 0, . . . , T , in the form Positions and orientations are represented as a frame {p, q}, where p ∈ R 3 and q = v + u, v ∈ R, u ∈ R 3 , is a unit quaternion. First, we briefly review the foundations of Cartesian dynamic movement primitives (CDMP) for discrete tasks. Within this framework, a frame {p, q} is generated by integrating the nonlinear dynamic system, described with the following equations where s ∈ R is the phase variable, z z z, η η η ∈ R 3 are auxiliary variables, τ ∈ R is the duration of the policy, ν(s) ∈ R is a nonlinear temporal scaling factor. Scalar α z > 0 sets the convergence rate of the underlying second order linear dynamic system. Setting β z = α z /4 provides that it is critically damped. α s sets the decay rate of the phase variable s. Operator * denotes quaternion multiplication and q conjugate of quaternion q. An alternative formulation for (5) has been proposed recently to better account for nonlinearities in orientation space [28]. The quaternion logarithm log : S 3 → R 3 is defined as S 3 denotes the unit sphere in R 4 . Quaternion logarithm maps the quaternion q to the angular velocity ω ω ω that rotates the identity orientation to the current orientation defined with q within unit time. Its inverse transformation is quaternion exponent (exp : The nonlinear forcing terms f p (s) and f o (s) shape the response of the second-order differential equation system (2) -(6) for any smooth point-to-point trajectory from the initial position p 0 and orientation q 0 to the final position g p ∈ R 3 and orientation g o ∈ R 4 . They can be represented by a sum of N radial basis functions Ψ (RBFs) The matrices W p , W o ∈ R N×3 , W = [W p , W o ] contain free parameters with column vectors w p,i , w o,i ∈ R N that determine the shape of the positional and orientational part of the trajectory, respectively. Index i denotes the i-th coordinate axis of the Cartesian position. c j are the centers of RBFs and h j their widths. Usually they are selected in such a way that RBFs are evenly distributed along the trajectory. The temporal scaling function ν(s) determines variations from the demonstrated speed profile and allows to specify non-uniform speed changes along the demonstrated trajectory. Similarly to the forcing terms (9) and (10), it is encoded as a linear combination of RBFs where w ν ∈ R N is a column vector containing the weights. For ν we initially set w ν = 0 so that ν(s) = 1 ∀s, meaning that the speed profile remains as demonstrated.
While the goal pose g p , g o , initial pose p 0 , q 0 needed for the integration of (3), and the trajectory duration τ can be directly retrieved from the demonstrated trajectory G d , the weights W p and W o need to be calculated by applying standard regression techniques [27]. By replacing the system of two first order equations for positions (2), (3) and orientations (4), (5) with one equation of the second order, respectively, and assuming ν(s) = 1, we obtain where p k , q k ,ṗ k , ω ω ω k ,p k ,ω ω ω k are the measurements gathered from the demonstrated trajectory (1). Writing we obtain the following system of linear equations In (16), s(t j ) are computed by integrating (6) with the initial condition s(0) = 1 and t j specified in (1). The sets of weights that solve linear equations (17) in a least square sense can then be calculated as follows where X + denotes the Moore-Penrose pseudo-inverse of the system matrix X.

Construction of Constant Speed Cartesian DMPs
For incremental kinesthetic guidance, it is more advantageous if the trajectory has a constant speed profile. See Section 3 for the explanation why this is useful. In this section, we provide a construction to transform the demonstrated trajectory into a Cartesian DMP with constant speed profile. For this purpose, we first re-sample the trajectory at constant arc length steps. A similar idea was exploited by Gašpar et al. [23] to develop arc length dynamic movement primitives (AL-DMPs), which are advantageous for statistical skill learning. Here instead of changing the DMP formulation, we replace the originally demonstrated trajectory with a new trajectory with constant speed profile. We refer to the resulting CDMP trajectory as Constant Speed Cartesian DMP (CS-CDMP). The arc length parameter for a position and orientation trajectory is defined as The overall arc length of the Cartesian space DMP defined by data (1) is given by L = l(t T ). Note that unlike in [23], l contains in addition to the standard arc length of position trajectory also the angle length of orientation trajectory. ζ is a positive scalar used to scale the angles, which must be done because of different units. To estimate the arc length of a Cartesian space DMP, we integrate the Cartesian space DMP at a high frequency (usually higher than the servo rate of the robot controller) and obtain a new set of discrete measurement { p k , q k }. The arc length can then be estimated as We obtain the pairs { t k , l k } T k=1 , where l T = L and T > T . Assuming that l k > l k−1 , ∀k, we can interpolate time as the function of arc length t (l), 0 ≤ l ≤ L, e.g. using cubic spline interpolation.
Next we select the desired constant arc length sampling step Δl > 0 and an integer T > 0 so that L = T Δl. By using the previously calculated interpolation function t (l), we can compute the times t k = t (kΔl), k = 0, . . . , T , when the original DMP reaches the constant arc length steps. By integrating the original Cartesian space DMP to the newly calculated times t k , we obtain a new set of positions and orientations This new set of positions and orientations is sampled at constant arc length step The next step is to replace the originally demonstrated speed profilel(t) with the constant speed profilel d . This can be achieved everywhere except at both ends of the trajectory where the velocities and accelerations must be zero to ensure that the robot is at rest at the beginning and the end of motion. We first determine the arc length L 1 =kΔl at which the robot should reach the desired speeḋ l d . The number of initial stepsk and the desired speedl d must be specified by the user. We then specify the arc length parameterization for the initial interval using a fifth order polynomial l b with the following boundary conditions where T 1 is a user provided time in which the acceleration phase of motion should finish. Similarly, the arc length for the final interval is parametrized with the fifth order polynomiall e with the following boundary conditionṡ where The complete arc length is parameterized as follows Note that the above parametrization ensures that the speed profile is constant on the interval T 1 ≤ t ≤ T 2 − T 1 and that the robot continuously accelerates and decelerates to the desired speed and acceleration at the beginning and the end of motion.
In the final step we compute a new Cartesian space DMP with the provided speed profile. For this purpose we need to invert It is trivial to invert (25) for L 1 ≤ kΔl ≤ L − L 1 since on this interval, the arc length is a linear function of time. For 0 ≤ kΔl ≤ L 1 and L − L 1 ≤ kΔl ≤ L, we have to invert a fifth order polynomial. Bisection or Newton's method can be used for this purpose as it is known that the solution lies between 0 ≤ t ≤ T 1 and The newly computed times t k are added to (22) to form a new sampling set This set and numerically computed derivatives are then used to compute the Cartesian DMP with constant speed profile (except at both ends). For the sake of clarity, in the rest of this paper we omit the tilde modifier when referring to CS-CDMP computed from data (27).

Reversible Cartesian Space DMPs
Speed scaling factor ν(s) has an important impact on the trajectory generated by the DMP. Positive ν(s) scales the trajectory velocity in its original direction, setting ν(s) = 0 stops the motion and negative ν(s) reverses the trajectory. However, reversing the trajectory in this way causes DMP stability issues. Namely, if we omit the nonlinear term 1 f p (s) and assume ν(s) = 1, the solution of the linear dynamic equation system (2)-(3) is given by where c is a constant vector specified in such a way that the initial conditions p(s(0)) = p 0 , z(s(0)) = 0, are fulfilled. By setting α z = 4β z > 0, the system becomes critically damped because the matrix A has six equal negative eigenvalues λ 1,...,6 = −α z /2τ < 0. Thus the exponential part of the solution vanishes as t → ∞ and the system is stable. On the other hand, if we reverse the trajectory by setting ν(s) = −1, the trajectory is given by which means that the system is unstable because unlike e tA , e −tA does not vanish as t → ∞. Consequently, even small disturbances in the DMP (e.g. the DMP integration noise) might result in unstable trajectory generation 2 . It follows that we need to keep ν(s) positive in order to ensure the stability of the system. To avoid instabilities, we create two sets of DMPs, one for the forward direction and the other for the reverse direction as we proposed in [21]. The parameters of the reverse DMP, denoted as DMP, are computed for policy π(t) = π(τ − t), where π denotes the reverse policy and π the forward policy. In order to assure smooth switching between the two DMPs at phase s, we need to initialize the internal variables according to the following formulas where overlined variables × belong to the DMP. Note that p and q remain unchanged at the time instance when we switch from DMP to DMP and vice versa. Iturrate et al. [29] proposed another approach for reversible DMPs, where the linear second-order differential equation system defining the fixed part of DMP was replaced with a nonlinear logistic differential equation. Such a system is stable in both directions within the initial and goal point, but becomes unstable outside this region. As a result, all points on the trajectory must lie between the initial and goal point, which limits the practical applicability of their approach.

Incremental Refinement of the Control Policy using Kinesthetic Guidance
In this section, we describe techniques applied to modify an existing robot policy incrementally. In doing so, the operator moves the robot at any speed forward and backward along the previously learned trajectory and changes only those parts of the path where changes are necessary. We assume that the desired trajectory has been encoded as a constant speed profile, reversible CS-CDMP, as presented in Sections 2.2 and 2.3.
In our framework, we do not change the policy encoded with CS-CDMP directly. Rather, we learn an offset from the policy. This phase dependent offset, here denoted as d(s) ∈ R 6 = [d p (s) T , d o (s) T ] T , will be encoded as a linear combination of RBFs as defined in Eqs. (11) - (12). In this way, we avoid recomputing CS-CDMP and CS-CDMP parameters in the iterations of the refining process. Further, we eliminate the influence of the DMP integration noise, which would accumulate during the iterative policy improvements.
The pose sent to the robot controller is thus specified as where p(s) and q(s) are obtained from CS-CDMP integration using (2), (3) and (6). Position and orientation offsets d p (s) and d o (s) are calculated as a weighted sum of radial basis function (RBFs) similarly as in (9) and (10). Prior to learning we have to initialize the RBF weights W p , W o , to zero, meaning that no offsets are learned yet. For clarity, we refer to the original demonstrated trajectory encoded with CS-CDMP and the corresponding offsets RBF as the nominal policy π in the remainder of this paper.

Mechanism for kinesthetic guidance along the policy
To allow the operator to move the robot forward and backward along a policy π in an intuitive way and change the speed of motion (see Fig. 1), we associate the speed scaling factor ν of CS-CDMPs with a force projected onto the tangential direction of the policy.
The tangential direction of the policy is given by whereṗ c =ṗ +ḋ p are the velocities obtained by integrating the CS-CDMP and RBF at the current phase s. The corresponding direction for the rotational motion is given by where ω r = ω + ω d is the robot's end-effector angular velocity. CS-CDMP angular velocity ω is calculated by integrating (4) -(6) and taking into account the relation Based on tangential directions and the measured forces and torques, a new speed scaling factor is calculated using where · denotes dot product and μ p , μ o are positive scalars used to scale the velocities of the translational and rotational motion along the desired path. sgm() denotes a sigmoid function. In our experiments, we used logistic function given by It ensures a smooth start and stop when pushing along the policy and effectively solves the jittery robot motion due to noisy input from force sensor. The constant scalar a was chosen empirically and determines the softness of the transition. Generally, it should be decreased with increased force-torque sensor noise. F ∈ R 3 and M ∈ R 3 denote the Cartesian forces and torques in robot base coordinate system. Since forces and torques are usually measured at the end-effector, the following transformation is used to transform them in the robot base coordinate system where F m ∈ R 3 and M m ∈ R 3 are the measured forces and torques at the end-effector and q is the unit quaternion describing the current orientation of the robot's end-effector. Ifν becomes negative, we switch the direction of motion along the desired path specified by the policy π . Instead of following the policy specified by π , we start using the reverse policy π with the initialization provided by Eqs. (30) - (33). Thus we can always use a positive speed scaling factor for CS-CDMP integration, i. e. ν(s) = |ν|. By applying this procedure, the guiding process becomes natural and intuitive, as the operator simply pushes the robot forth and back in the tangential directions of the path. Namely, when both F · t p (s) and M · t o (s) → 0, ν(s) → 0, which slows down the robot motion and eventually stops it. Using ν(s) in (2) -(6), the robot moves along the policy π(s) in the direction of the applied forces and torques, with the speed proportional to them.
Since robot motion is modified along the policy based on the measured forces and torques, the robot should be stiff in the tangential direction of the policy. If the robot is not equipped with a force-torque sensor but is compliant and we know its stiffness, we can still implement the proposed approach by computing virtual forces and torques as where K p ∈ R 3×3 K o ∈ R 3×3 are position and orientation stiffness matrices. p m , q m respectively denote the actual (measured) robot position and orientation whereas p c and q c denote the commanded position and orientation as calculated in (34). The virtual forces and toques can be used in (39) to calculate the speed scaling factor. However, this method requires that the robot is compliant and can not be applied on many standard industrial robots that are typically stiff.

Changing the Desired Path
So far, we have proposed a method that allows the motion along the policy by kinesthetic guidance. To modify the desired path, the robot has to enable movements that deviate from the desired path. Two cases will be considered here: a) the robot is impedance-controlled and it is possible to set the stiffness and damping matrices, and b) it is not possible to adjust the stiffness and damping matrices but the robot is equipped with a force-torque sensor.
In the first case, the robot has to be compliant in the plane attached to the current robot position and orthogonal to the trajectory direction. In the continuation of the paper, we will refer to this plane as the refinement plane (see Fig. 2). A suitable choice to define such a plane is given by Frenet-Serret formulas [30] where tangential t, normal n, and bi-normal coordinate axis b are determined as The refinement plane is spanned by the normal and binormal axes. It is necessary to ensure that the Frenet-Serret (FS) frame does not change abruptly. Since linear velocitẏ p c and accelerationp c are obtained from CS-CDMP and RBF, they are guaranteed to be smooth. Note that the FS frame can not be calculated when the robot's velocity and/or acceleration are equal to zero. Therefore, we suspend the updating of R t at low speeds and accelerations until the motion becomes faster again.
To arbitrary set the compliance along any axis of the FS frame, the stiffness matrices of the impedance control are calculated as [31] K p and K o are diagonal matrices that define compliance along tangential, normal and binormal axis defined by the FS frame. To obtain a critically damped response of the robot, it is also necessary modify the damping matrices In this way, the displacement from the nominal trajectory π in the refinement tube is proportional to the applied force projected to the refinement plane and chosen stiffness K p and K o . The offset d(s) ∈ R 6 from the originally demonstrated path is calculated as follows where p m , q m is the measured robot pose and p(s), q(s) the CS-CDMP pose.
Incremental learning can be also applied to robots that are not compliant, but are equipped with a six-axis force-torque sensor at the end-effector. In this case, we first calculate a virtual displacement vector ξ ξ ξ from the applied forces and torques projected onto the refinement plane as where K g ∈ R 6×6 is a diagonal gain matrix with positive coefficients. In compassion with (34), the virtual displacement vector is added to the pose sent to the robot controller As the robot is commanded to also account for the virtual displacement calculated in (52), we can still sample d(s) using (51). By adjusting the compliance parameters in the refinement plane, we can prevent the robot from deviating too much from the existing trajectory. With increasing displacement from the existing trajectory, the operator feels increasing force directed towards the reference trajectory. This force prevents the operator to unintentionally modify the path when guiding the robot along the existing trajectory. Thus, major trajectory changes can be learned gradually over several iterations of moving back and forth and updating offsets trajectory.
Papageorgiu et al. [18] proposed an alternative method where a dedicated controller was used to realize a penetrable virtual wall around the trajectory. In their approach, the virtual wall pushes the robot towards the reference trajectory, but the robot motion is not constrained when the operator penetrates the virtual wall. In this way, the operator can achieve major trajectory changes even in a single pass. In our framework, a similar behavior can be achieved by setting the robot's stiffness K p (47) for normal and bi-normal directions in the refinement plane as follows: where K p,1 and K p,0 are positive scalars that define the upper and lower bound of the refinement tube, respectively. Scalars a and b set the shape of the tube. K t sets the desired stiffness in the tangential direction. An example of the refinement tube and the operator's corresponding forces are shown in Fig. 3. Note that when updating the compliance gains (47) and (48), the corresponding damping coefficients must also be set according to (49) and (50), respectively.
The above described procedure allows to change the desired path only on segments where changes are necessary. The set of sampled displacements is used to compute a new displacement RBF. For this, we consider two approaches: batch regression and recursive regression.

Batch Regression Based Policy Update
This method is very similar to the classic trajectory learning by kinesthetic guidance except that here we sample the offsets caused by human guidance while the robot attempts to track the policy π . The resulting deviations from the desired trajectory are influenced by the compliance settings in the normal and binormal direction of the FS frame.
We denote by D a set of offsets (51) sampled along the CS-CDMP with the constant arc length step, just like the set (27) (for the sake of clarity, we omit the tildes) We start collecting new offsets once the robot motion deviates from the commanded trajectory (34). We update the displacement set (55) whenever we change the direction of motion. Note that the demonstrator can change only a part of the trajectory. Let denote the new offsets collected from the time when the robot started to deviate from the nominal trajectory until the robot rejoined it. Just like (27), they are sampled with constant arc length steps, which enables us to match the data from D and D along the arc length instead of phase. Next we determine which offsets from (55) should be replaced with offsets from (56 where t p,j and t o,j are the tangents calculated at the j -th sample using (37) and (38). Once e is small enough, we stop accumulating corrected offsets since the actual robot motion rejoined the nominal trajectory. The proposed approach enables the operator to demonstrate parts of the trajectory multiple times. If a part of the trajectory has not been executed correctly, the operator can change the direction of motion by pushing the robot in the opposite direction along the trajectory. Note that the speed scaling factorν defined in (39) is positive as long as the user pushes the robot forwards along the trajectory. We switch between the two CS-CDMPs forming the reverse CS-CDMP wheneverν becomes negative. The offset trajectory encoded with RBFs can be reversed with a negative speed scaling factorν, which does not cause any stability problems because there is no integration.
Pseudo code for BR-based incremental learning is given as Algorithm 1. In this algorithm, function join(·) denotes insertion of sampled set of poses D into D, as described above.
The sampling procedure presented above differs from the approach in [20], where we used the phase to match the poses on the existing trajectory with the current robot poses. Due to pushing along the trajectory, a tracking error can occur, which causes that the phase of the reference trajectory is not aligned with the current robot motion.  Consequently, the sampled poses might be inserted to G d with wrong indices, which can significantly corrupt the modified trajectory, as illustrated in Fig. 5. In this paper, we solved this problem by sampling the offsets with a constant arc-length and matching the poses along the arc length. Papageorgiou et al. [18] also addressed this issue by finding the phase where the existing trajectory is closest to the current robot pose. However, unlike our approach, their procedure does not guarantee that the points are equidistant.

Recursive Regression Based Policy Update
Instead of updating the entire offset RBF at each change of the direction, we can concurrently modify weights of the offset RBF, W i = [W p,i W o,i ] ∈ R N×6 , using recursive regression formulas where i is the iteration index of recursive regression, P i ∈ R N×N is the covariance matrix of parameters W i , x(s j ) ∈ R 1×N is a vector of Gaussian kernel functions (11), N is the number of kernel functions, and λ is the forgetting factor, which is usually set to a value close but smaller or equal to 1. s j , s j −1 denote the phase at the current and previous iteration step, respectively. e(s j ) denotes the displacement between the robot commanded pose and measured robot pose at phase s j . Here, it is crucial to estimate the phase s j , which corresponds to the measured current robot pose. Due to the robot compliance, this phase varies from the CS-CDMP and RBF phase (6) used to calculate the desired robot pose. It is calculated by projecting the measured robot pose to the existing robot trajectory and solving the following optimization problem where s k = s i e − αs kδt τ , {k = 0 . . . , h}, s i is the current trajectory phase, δt is suitably chosen time interval and h denotes the search horizon. Once solving this optimization problem, we calculate the displacement vector as e(s j ) = p m − p c (s j ) 2 log(q m * q c (s j )) .
With a smaller λ the previous measurements are quickly forgotten, or in other words, more recent samples have a greater impact on the estimated trajectory shape. On the other hand, with λ = 1 the covariance matrix P i becomes monotonically decreasing and the recent trajectory updates have less and less influence on the estimated forcing  The dashed black curve shows the updated path and demonstrates how RR predicts beyond the demonstrated part (green). The operator pushes the robot in the direction of the gray arrow. Red arrow indicates the force needed to displace the robot from the existing trajectory term weights. To prevent this, it is necessary to reset the covariance error matrix to the default value, P i = γ I, γ > 0, whenever the factorν changes its sign. γ is usually set to a large value, e. g. 100.
When changing the reference trajectory by recursive regression, every new displacement vector e immediately affects the shape of the reference trajectory (see Fig. 6). The dynamics of change is determined by the equations of recursive regression (58) and (59), offset calculation using RBFs (35) and (36), and the robot dynamics. In our previous work [32] we showed that we can approximate this dynamics with a second-order filter. The overall scheme of such a learning process 3 can therefore be described as a feedback loop through a second-order filter, as shown in Fig. 7.
This filter is a low-pass filter whose parameters depend on the RBF parameters. Thus, this filter suppresses only the fast movements of the operator during trajectory refinement. On contrary, the system becomes fully compliant at slow movements.
Note that with recursive regression it is not possible to implement a penetrable virtual wall because the reference trajectory is constantly being adjusted. Instead, we can apply the sigmoid function (40) to the displacement vector e(s j ) vector, which makes the system stiff at low forces applied in the refinement plane. In this way we prevent unintentional changes of the path. Pseudo code for RRbased incremental learning is given as Algorithm 2.

Speed Profile Learning
The speed profile is demonstrated by pushing the robot along the trajectory with the desired speed. Using either the Fig. 7 The dynamics of learning is governed by the robot system dynamics and the recursive regression (RR) dynamics. Both subsystems can be approximated by a linear second order system measured forces and torques or their virtual counterparts as defined by (43), ν can be completed computed according to (39) and sampled along with the corresponding phase. The weights w ν of the temporal scaling function defined in (13) are computed using regression techniques once the demonstration stops. This step can be repeated multiple times until the desired speed profile has been obtained. Usually, speed profile learning is the last phase in incremental learning, executed after the spacial part of the policy is refined.

Stability of Learning
In the implementation of incremental learning, it is necessary to consider the closed-loop stability of the overall control scheme and how the choice of free parameters influences the behavior of the learning system. We consider four different cases, which differ according to whether the robot is compliant in normal and bi-normal direction of the FS frame or the robot is admittance-controlled and the virtual displacement vector is computed by (52) and also whether the parameters of offset RBF are computed using batch regression (BR) or recursive regression (RR). These four different cases are shown in Fig. 8. For the sake of simplicity, we limit our stability analysis to one positional degree of freedom. In this case, the decoupled robot dynamics is described by a second-order system [33].
Let's first assume that the robot is compliant, i.e. impedance-controlled. The block scheme for the compliant robot and BR (Fig.8a) shows that this is an open-loop control scheme where the dynamics is determined by the dynamics of the robot. F e denotes an external force by send new reference [p c , q c ] to the robot controller which the operator guides the robot. The control scheme for the compliant robot and RR (Fig. 8b) is similar. However, here the RR is inside the feedback loop together with the robot dynamics. As previously mentioned in Section 3.4, the RR dynamics can be approximated with a low pass secondorder filter [32]. Consequently, the overall dynamics can be approximated with a stable fourth-order linear system. In both arrangements, F e changes the robot position p m but does not affect the stability. The external force F e changes the robot motion also in the block diagrams of Fig. 8c and 8d, but here the robot is admittance controlled. As long as the robot is not in contact with the environment, the admittance control is in an open-loop and does not affect the stability of the system. However, the contact force F c arises once the robot establishes contact with the environment. The admittance control forms feedback through the environment stiffness matrix K s , as shown in block diagrams of Fig. 8c and 8d. In these diagrams, vector p o denotes the environment contact position. Thus, the overall system is non-linear, where the saturation block and the stiffness matrix model the contact forces F c . Next, we analyze the behavior of the latter two nonlinear systems in simulation. 4 The left side of Fig. 9a shows force and position of one positional degree of freedom for the BR admittance setup with the following parameters: external force F e = −10 N, environment stiffness K s = 2000 N/m, robot control gain K p = 1000 N/m, admittance gain K g = 0.002 m/N, and control sampling frequency 100 Hz. The force plot shows that the robot starts oscillating with constant frequency and amplitude after touching the environment. The phase plot on the right side of Fig. 9a shows that the system exhibits a stable limit cycle. If we increase the sampling frequency to 300 Hz, the limit cycle vanishes, as evident from Fig. 9b. If we also increase the admittance gain to K g = 0.005 m/N, the system starts oscillating, but oscillations are damped, and the system does not enter into the limit cycle, as evident from Fig. 9c. The above simulation experiments demonstrate that high sampling frequency is necessary for proper operation of the admittance BR setup.
A similar behavior can be noticed in the RR-based admittance scheme shown in Fig. 8d, except that the RR transfer function additionally damps the system. For this reason, the RR-based admittance scheme does not enter the stable limit cycle at sampling frequency 100 Hz and admittance gain K g = 0.002 m/N as seen in Fig. 9d. However, if we further increase the admittance gain to K g = 0.005 m/N, we can notice strong vibrations in stable limit cycle as seen in Fig. 9e.
Given the above analysis, we can conclude that a high sampling frequency (500 Hz or more) is required for incremental learning with admittance control. Lowering admittance gain K g stabilizes the system, but it also makes it less responsive, i.e., higher forces are needed to modify the robot trajectory. The RR-based algorithm is generally more resistant to the vibration caused by the limit cycle. On the other hand, the vibrations are more pronounced at excessive admittance gain K g when the RR-based algorithm is applied.

Experimental Verification
We implemented and evaluated the proposed framework on two collaborative robots: Franka Emika Panda and Universal robot UR10.
Implementation on Franka Emika Panda On the 7 DOF collaborative robot arm Franka Emika Panda, the control algorithm was implemented as a ros control plug-in in C++ using libfranka library. The LfD framework was implemented Fig. 9 Simulation results for admittance control and different incremental learning algorithms with varying sampling frequency f and admittance gain K g . Time plots of position p and overall force F c + F e , are shown on the left. In the right column the corresponding phase portraits are given in Matlab as a ROS node. Since the Panda robot supports variable compliance around any axis, we implemented RRand BR-based algorithms in impedance mode. In the case of the BR-based scheme, we set high positional stiffness (3000 N/m) in the tangential direction and lower stiffness (500 N/m) in the normal and binormal directions of the FS frame. Accordingly, the stiffnesses of the rotational axes were set high (90 Nm/rad) for rotation around the tangent and low (10 Nm/rad) for other rotations. For the RR-based scheme, we set high stiffness in all directions. Implementation on Universal Robot UR10 On the 6 DOF Universal Robot UR10, the control algorithm was implemented in Simulink using XPC Target platform, while the LfD framework was implemented in Python as a ROS node [2]. Wrist-mounted force-torque sensor ATI Delta was applied in this setup to implement admittance-based incremental learning. To preserve stability during the assembly operation that involves physical contact with the environment, it was necessary to select rather low admittance gains K g (0.00005 m/N). However, since low admittance gains result in sluggish responses to the guiding force F e , we provided the user with the ability to increase K g during the free space motion with a button at the robot wrist. In both BR and RR modes, the robot was stiff and the virtual displacement vector was calculated using Eq. (52).

Refinement of a policy for peg-in-hole insertion task
The algorithms were experimentally verified on refining policy for peg-in-hole (PiH) insertion task. In this task, unavoidable positioning errors and tight tolerances between the objects require adaptation of the desired policies [24]. The existing policy resulted in a failure during insertion. Therefore, the insertion path was adapted using BR-based algorithm described in Section 3.3 and RR-based algorithm as described in Section 3.4.
A snapshot of the adaptation process for the PiH task using the Panda robot in BR-impedance mode is shown in Fig. 10, while Fig. 11 shows the original and refined positional part of the policy. Here, the policy had to be shifted due to positional displacement. The operator guided the robot to the start and demonstrated the required offsets in one pass. After the adaptation, the robot was able to perform the peg insertion successfully. Figures 12 and 13 show a snapshot of the adaptation process for the PiH task using the UR10 robot in RRadmittance and the original and refined policy, respectively. As this system is less responsive, higher forces are needed to modify the robot trajectory compared to impedance-based control scheme. In this case, robot was able to successfully perform the task after three refinement passes.

User Study
We assessed the performance of the proposed algorithms by conducting a user study, where the subjects were teaching the robot a specific movement needed in shoe sole grinding. We chose this task as it requires many modifications depending on the shape and size of the shoe. Besides, it requires a precise speed profile, which is conditioned by the grinding technology. Similar are the tasks of applying glue, arc welding, polishing, etc. The evaluation was performed on a Franka Emika Panda robot using the impedance-based scheme and the same setup as described in Section 4.2. During the experiment, the participants stood next to a small table with a shoe model mounted as depicted in Fig. 14 and were instructed to move the robot in such a way that the tip of the stick moves along the edge of the shoe sole from a start to the goal point which were both marked visually on the shoe. There were three experimental conditions: classical kinesthetic guidance (KG), BR-based incremental learning, and RR-based incremental learning. The sequence of conditions was randomized. Each condition consisted of an initial familiarization phase and a subsequent test phase.  For the two variants of incremental learning, we initialized the adaptation process with a trajectory that corresponds to another shoe size. For classical KG, there is no initial trajectory. In the classical KG condition, the subjects had to demonstrate the movement in one pass and Fig. 13 Original and refined path of the PiH policy after adaptation using the UR10 robot were allowed to repeat the procedure until they considered that the learned policy is good enough or felt that they could not improve it further. The first pass counted as the familiarization phase and was not considered in statistical evaluation. In both incremental learning conditions, the subjects got acquainted with the specific operation mode in the familiarisation phase. They received verbal instructions on how to move along an existing path and apply the corrections by pushing or pulling the robot. In the test phase, the subjects performed just one learning attempt but were able to correct eventual deviations of the learned path incrementally. Finally, they also demonstrated the speed profile. They were verbally instructed that the movement should take around 10 seconds. After each experiment, we collected the subjective assessment of the overall user experience (ease of use, kinesthetic feeling) of the learning procedure on a 1-5 rating scale.
A snapshot of the BR-based refinement process of the shoe grinding policies and a plot showing one representative operator's movement while adapting the spatial part of the policy is shown in Figs. 14 and 15a, respectively. Fig. 15b shows the corresponding intermediate policies throughout the iterations of the BR-based refinement process by this operator. The update occurs when the operator reverses the direction of teaching. Most participants needed 4-5 back and forth iterations to correct the policy. Note that there are no update iterations for the RR-based process as the policy is updated continuously.
Nineteen healthy subjects participated in the study (13 male, 6 female, age: 34.63 ± 14.59 years). Prior to their participation, the subjects were informed about the experimental procedure, potential risks, the aim of the study and gave their written informed consent in accordance with the code for ethical conduct in research at Jožef Stefan Institute. This study was approved by the National Medical Ethics Committee (No. 0120-228/2020-3).
To evaluate the time-efficiency of the proposed methods, we compared the total time of learning elapsed in the test phase. To evaluate the quality of learning, we calculated the deviation of both spatial and temporal course of the learned policy from the reference ground truth trajectory, which was carefully captured using a passive mechanical digitizer MicroScribe G2LX6. The position and velocity error measure the local accuracy of the trajectory and was calculated using the evaluation metrics proposed in [34].
We investigated the effects with One-way Repeated Measures ANOVA. The level of statistical significance used was 0.05 for all tests. Learning method had significant effect on position error [F (2,18)  The plots in Fig. 16 show the means and standard errors (SEM) for different aspects for the three learning methods.
Based on subject's previous experience with operating robots using kinesthetic teaching, we split subjects in two groups. However, a separate two-way ANOVA shows no difference between groups of expert and novice users for any of the aspects [0.1 < F (1,56) The main result of the user study confirms that incremental policy learning by kinestheic guidance can effectively tackle limitations of the classical kinesthetic guidance. As the user can only control a limited number of joints simultaneously during the kinesthehic guidance, it is difficult to ensure the accuracy of the demonstrations [10]. Constraining the robot's movement along the nominal trajectory and allowing gradual refinements improves not   16 Plots show four quality and efficiency measures (position error, velocity error, learning time, and user experience rating) for all participants when using classical kinesthetic guidance (KG), BR-or RR-based method for policy refinement. Stars denote pairs with statistically significant differences at p < 0.05. Diamonds show pairs without statistically significant differences only the accuracy, but also shortens the learning time especially for complex policies. The statistically nonsignificant differences between BR -and RR-based learning in the user study can be justified by the moderate changes required to adapt to the new shoe sole as BR and RRbased learning behave similarly when changes are applied incrementally. Further, no statistical difference between groups of novice and expert users can be accounted to the intuitiveness of kinesthetic teaching paradigm and the familiarisation phase.

Conclusion
In the paper, we presented a framework for incremental adaptation of complex robot trajectories using kinesthetic guidance. It is based on speed scaled, constant arc-length, reversible Cartesian space DMPs that enable integration forward and backward along the trajectory. The update mechanism is based on two techniques. Batch regression temporarily stores modified trajectory segments and updates policy parameters when we reverse the direction of kinesthetic teaching. On contrary, recursive regression technique updates policy parameters directly within the control loop, which avoids storing modified trajectory.
There are several novelties in the presented framework. Instead of modifying the original DMP policy in learning iterations, we are learning offsets to the originally demonstrated policy and encode them separately using RBF. This improves the long-term stability of the learning algorithm and diminishes the computational burden of the learning algorithm. Next, we introduced Constant Speed Cartesian DMPs (CS-CDMPs), that encode the spatial component of the task with a constant speed profile. This allows proper decoupling of spatial and temporal part of the policy, which contributes both to more robust and more efficient learning. Among the other improvements, we enhanced the kinestheitc feeling during learning using soft activation functions and improved encoding accuracy with a new phase detection algorithm.
The proposed approach can be applied both to impedance and admittance controlled robots which was experimentally verified on two robot platforms. Stability analysis revealed that a high sampling frequency and low gains are required in the incremental learning with admittance control. Therefore, higher forces are needed to modify the robot trajectory compared to impedance-based control scheme. In practice this means, that more subsequent passes are needed to modify the existing trajectory We performed a user study to evaluate the performance of proposed incremental learning algorithms and compared them with classical kinesthetic guidance. The user study confirmed that the concept of incremental updates is more effective and user-friendlier than the classical kinesthetic guidance for both experienced and novice operators. User study showed that the differences in performance and userexperience between the BR-based and RR-based algorithms are small. In our opinion, BR-based algorithm is still favorable, since it allows to implement the concept of penetrable virtual wall and allows for correcting the trajectory with larger deviations from the existing trajectory in a single step. On the other hand, the RR algorithm is less complex and easier to implement.
The proposed incremental learning can be used for learning many robotic tasks that involve contact with the environment. There are also some exceptions where our approach cannot be directly applied, e. g. when we need to modify the trajectory in the tangential direction. Changing the depth of peg insertion is such a case. Extensions that allow coping with such issues remain for our future work. Funding The research leading to these results has received funding from the Horizon 2020 RIA Programme grant 820767 CoLLaboratE and from the program group P2-0076 Automation, robotics, and biocybernetics funded by the Slovenian Research Agency.

Materials Availability
The data that support the findings of this study are available from the corresponding author, B.N., upon reasonable request.

Declarations
Competing interests The authors have no conflict of interest to declare.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.