Introduction

Over the latest years, there has been an increasing interest in Human Robot Interaction (HRI) due to the increasing usage of robots not only in industries, but in other areas such as schools [1], homes [2], hospitals [3], and rehabilitation centres [4]. Service robotics is one such area of robotics where robots have shown high promise in working near humans. Intelligent robotic agents have been deployed in hospitals [5], in domestic environments [6], in retirement houses [7]. The presence of a robot, in fact, is a useful support during the management of daily activities [8, 9], the promotion of social inclusion [10], and the suggestion of healthy activities [11, 12]. An easy and continuous connection with other people (i.e. relatives, friends or doctors), could promote social inclusion of people with disabilities or elderly people and increase the quality of their life [13]. Consequently, in the future, robots will concretely share environments with human beings to actively collaborate with them in specific daily tasks.

One such daily task is the handling of clothes, ranging from washing them to pressing and folding them, or to placing them at their designated places such as cupboards and shelves. Getting dressed up is another daily task involving the handling of clothes. While accomplishing these tasks may seem to be an effortless task for the young and able-bodied, it is undoubtedly a cumbersome activity for the elderly and the disabled and thus demands assistance. The increased life expectancy, owing to the availability of better healthcare facilities, coupled with falling fertility levels [14] has only added to the already ageing population. This has directly resulted in a shortage of caregivers and therapists [15]. This shortage has prompted researchers in the field of robotics to explore new avenues and ways of letting robots take over, fully or partially, some of the assistive tasks involving cloth manipulation.

These manipulation tasks, as trivial as they may seem for humans, are extremely challenging for robots to accomplish. These challenges stem from the intrinsic property of clothes being deformable which allows them to, in theory, assume an infinite number of states each varying in appearance. The challenge is aggravated by the unpredictability of the outcome of a specific action on a piece of cloth. Thus, tracking the cloth state becomes an expensive operation once a manipulation action is executed. On top of that, effective trajectory planning, and control strategies are needed to execute such manipulation in a closed loop which remains a challenge too. In fact, an intelligent robotic agent requires a perfect synergy between the state perception and control execution to ensure successful completion.

While the state estimation problem has been thoroughly studied over the recent years, there still exists a gap in identifying ideal closed loop control strategies for different cloth manipulation scenarios. Earlier work on state perception started off by employing sophisticated motion detection systems which soon gave way to the use of inexpensive depth cameras and principles of computer vision to infer the cloth state. Many of the works also attempted to model the deformation of clothes to predict their future state. A comprehensive review outlining these methods was carried out by [16]. There has been works aimed at modelling the manipulator movements to achieve cloth manipulation tasks such as folding or assisted dressing, but they have proved to be insufficient in adapting to the complex nature of movements required mainly due to their limited generalization abilities and higher computations costs for working in real-time scenarios. Consequently, researchers have turned to make the robots learn these movements as opposed to modelling them.

Some notable works have employed the Reinforcement Learning paradigm to make robots learn the required trajectories based on trial and error, interacting with the environment (clothes) while others have tried to exploit human demonstrations in a Learning from Demonstration paradigm to impart human knowledge to robotic agents. There exists, however, a pressing need to analyse where the field currently stands. Concretely, the following questions need to be addressed:

  1. 1)

    What is the current state of the art for learning based approaches in cloth manipulation?

  2. 2)

    How effective are these approaches in overcoming the limitations of traditional model-based approaches?

  3. 3)

    What are the current challenges and future directions of research in this area?

To answer these questions, we have carried out this comprehensive review. Our work analyses the learning-based strategies, keeping in mind the above-mentioned questions, and groups them as follows:

  • Supervised learning (SL): is the task of learning a function that maps an input to an output based on example input–output pairs [17, 18].

  • Learning from demonstration (LfD): is to transfer new skills to a machine based on human demonstrations [19, 20].

  • Reinforcement learning (RL): is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Its focus is finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) [21, 22].

The rest of the article is organized as follows: in Sect. 2, the research methodology for the comprehensive review is explained. In Sects. 3 and 4 the results and the discussions regarding the papers are shown. In Sect. 5, a summary of the review and its conclusions are presented.

Methodology: search strategy and selection criteria

This paper reviews empirical studies published between 2009 and 2019 since most of the advances in this area have occurred within that timeframe. A bibliography was developed upon searches in Scopus, Web of Science, and Google Scholar electronic databases.

The searched terms and the combinations used are: {dress AND robot} OR {garment AND robot} OR {dress AND manipulation} OR {garment AND manipulation} OR {dress AND grasping} OR {garment AND grasping} AND D.

$$D\in \left\{ReinforcemenLearning,Learningbydemonstration,Supervisedlearning\right\}$$

As concerns Google Scholar, we took into consideration the first ten pages of the electronic database. Reference lists of included articles and significant review papers were examined by authors of the review to include other relevant studies. After the deletion of duplicates and papers out of context (i.e. papers not related to robots), we identified articles deserving a full review. Additionally, other articles were excluded (not written in English for e.g.) and a total of 76 works was selected at this stage. Then, a full-text assessment was carried, and the final list of papers includes 39 studies, since papers not related to control strategies were discarded; in the following figure (Fig. 1) the selection process is shown:

Fig. 1
figure 1

Selection process of relevant papers

Results

Application overview

The interest toward service robots which are involved in dressing task has grown and we decided to collect the papers concerning dressing dividing them according to the task the robot is doing. Particularly, out of the fully selected papers, 8 papers (20.51%) were published before 2015 and 31 papers (79.49%) were published within the past five years (Table 1).

Table 1 List of papers

In Fig. 2, the robot assisted dressing process is described. The first step is the cloth detection, followed by cloth classification and manipulation planning. The human position is then tracked to avoid the robot hurting the patient during the dressing task. Finally, different tasks accomplished by the robot are described in green, while the ones that should be investigated in the future are crossed in red. The tasks accomplished in several.

Fig. 2
figure 2

Robot assisted dressing process

Fig. 3
figure 3

a Robot folding cloths using a SL strategy [23]

papers by robots are putting a t-shirt or a jacket on the arm of the user, putting a t-shirt on the head of a person, putting a shoe on the feet of the user, or dressing trousers. The tasks that should be accomplished in the future are, for example, folding a complex shape or ironing a garment.

Cloth folding/untangling/coveraging

In this section, the papers concerning cloth folding, untangling or coveraging are evaluated and divided into subsections according to the control approach applied.

Supervised learning

In SL, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). An SL algorithm analyses the training data and produces an inferred function, which can be used for mapping new examples [17]. Bersch et al. [24] used a DL approach with a PR2 robot for cloth manipulation and specifically for laundry folding. The quality of each grasp pose was evaluated using a function that calculates a score based on a set of geometric features and the score function was automatically trained using an SVM. Other strategies belonging to SL were the ones developed by Lui al. [25] that used a learning algorithm based on max-margin learning to manipulate deformable objects such as ropes with a PR2. Starting with a point-cloud obtained from an RGB-D camera, the authors designed appropriate features that their learning algorithm uses to first infer the rope’s knot structure and then chooses an appropriate manipulation action to untangle the knots in the ropes. Concerning humanoid robots, Yang et al. [23] also used DL to let a humanoid robot achieve folding task skills. The proposed approach was a real-time user interface with a monitor and provided a first-person perspective using a head-mounted display. Through this interface, teleoperation was used for collecting task operating data, especially for tasks that are difficult to be applied with a conventional method. A DL model was also utilized in the proposed approach. A deep convolutional autoencoder extracted image features and reconstructed images, and a fully connected deep time delay neural network learnt the dynamics of a robot task process from the extracted image features and motion angle signals (Fig. 3).

Tanaka et al. [14] used, a particular NN called Encode-Manipulate-Decode Network (EMD Net) for cloth folding. This EMD Net is essentially a 3D convolutional auto-encoder (providing the encoder and decoder modules), with a fully connected network (the manipulation module) inserted at the bottleneck layer.

Furthermore, in Hu et al. [26], limitation of movements of the user (modelled with the Gaussian Process Latent Variable Model) were studied and related to the online update of the dressing trajectory. The authors validated their idea by letting the robot fold towels.

A different approach was used by Jia et al. [27]. The authors used a random forest approach. They used imitation data consisting of visual features and control signals to learn a random forest controller that maps the observed visual features from an RGB-D camera to optimal control signals of a robotic manipulator to manipulate clothes. The controller parameters are learnt in two steps: online dataset sampling and controller optimization. The dataset is generated from an expert (a ground-truth hard coded control algorithm in their case but can also be a human) performing the manipulation task and RGB-D images from a camera collected which are then transformed into a low dimensional feature space by computing HOW features [27]. The random forest imitation learning controller parameters are learnt in an online fashion where a set of cloth simulation trajectories are first generated. During each time step of these trajectories, they query their expert for an optimal control action. This action is combined with the action proposed by the random forest controller and fed into the simulator to generate a new observation. This process is repeated until the imitation learning has converged to an optimal solution. The authors validated their approach by folding towels.

Finally, Corona et al. [28] used a hierarchy of three CNNs with different levels of specialization to grasp a garment and fold it using a Wam Robot. First, one robot arm grasps a garment from any point and shows it to an RGB-D camera and the cloth is recognized using the first CNN. Then, the visibility and locations of two reference grasping points are identified using the second CNN. Next, they located the second point of grasping with a third, more specialized, CNN.

Learning from demonstration

Learning from demonstration is conceived to provide and transfer assistive skills from non-expert users to the robot. It can be achieved using a kinaesthetic teaching or motion capture system, that demonstrations of the task executed in several situations can be used to adapt new situations rapidly. For these reasons, LfD is widely used for robotic manipulation tasks such as assistive dressing, towels and ropes folding.

Sannapaneni et al. [29] proposed an algorithm that folds cloth using amrita dual anthropomorphic manipulator (ADAM). Cloth coordinates (composed of four points) are extracted using depth images and are used to classify the cloth shape as a trouser and t-shirt. The main algorithm is to use marker coordinates along with cloth dimension and type stored. The marker is composed of four picks and place points, and then it is applied for folding the cloth by simple geometry calculation. It is implemented for cloth folding, but the limitation of this proposed LfD algorithm can only be used for a specific shape of robotic clothing assistance. To develop more complex assistive dressing algorithms, well-known LfD algorithms, which represent Dynamic Movement Primitives (DMP) and hidden Markov model (HMM) with a combination of the traditional methods, are applied.

Reinforcement learning

Balaguer et al. in [30] and in [31] were one of the first groups of researchers to formulate the cloth assistance problems as an RL problem. They combined imitation and RL to learn a control policy for two independent manipulators, working collaboratively, to achieve a towel folding task. Imitation training data was acquired by motion capture system detecting tracking the reflective markers placed on the towel and a human performing momentum folds—the kind of fold where the force applied to grasping points on the towel is used to give momentum to the towel and lay half of it flat on the table. Rewards were computed as the exponential function of the negative smallest error between an observation and training samples. This error was calculated by the Iterative Closest Point ICP (algorithm). PoWER [32], which was a state-of-art algorithm RL algorithm, was used to learn a policy based on the human samples.

Yaqiang et al. [33] also targeted to accomplish a t-shirt folding task while it is surmounted on the chest of a human being. They teach the general motion expected of a dual-arm robot by a teaching approach. A human demonstration of the expected folding behaviour is captured by a 3D range image capture system. Coloured markers placed on the shirt help recognize the state and shape of the cloth. Since the final state of the cloth is explicitly defined by the marker positions, the problem is reduced to a search problem and so the PILCO algorithm [34] is used for policy search.

Finally, Wu et al. [35] proposed a conditional learning approach for learning to fold deformable objects, improving sample complexity.

Putting a cloth on user’s arm

In this part, the papers concerning wearing a cloth on the user's arm are evaluated and divided into subsections according to the control strategy applied.

Supervised learning

Zhang et al. [36] proposed the offline learning of a cloth dynamics model by incorporating reliable motion capture data and applied this model for the online tracking of human-cloth relationship using a depth sensor. The authors tested the approach using a robot that wears a cloth on the user's arm. Furthermore, Chance et al. [37] used the Support Vector Machine (SVM) to dress a jacket onto a mannequin or human participants, considering several combinations of user pose and clothing type. In detail, their SVM method involved searching for an optimal hyperplane that separates the data by class and is optimized by finding the largest margin at the boundaries.

Moreover, Stria et al. [38] used the SVM for the classification of garment categories and focuses particularly on putting a shirt on a user's arm.

Erickson et al. [39], used a fully connected NN that estimated the local pose of a human limb in real time. A key benefit of this sensing method is that it can sense the limb through opaque materials, including fabrics and wet cloth creating a robot that can assist a person during dressing and bathing tasks. The authors tested their approach by putting a hospital gown on the user's arm.

Finally, Gao et al. [40] used a random forest approach. They presented an end-to-end approach to build user specific models for home-environment humanoid robots to provide personalised dressing assistance (a robot puts a cloth on the user's arm). By mounting a depth camera on top of the head of a Baxter humanoid robot, they recognised the upper body pose of users from a single depth image using randomised decision forests. From sequences of upper-body movements, the movement space of each upper-body joint is modelled as a mixture of Gaussian learned by an expectation–maximization (EM) algorithm. The experimental results showed that their method of modelling upper-body joint movement of users, combined with real-time human upper body pose recognition enables a humanoid robot to provide personalised dressing assistance and has potential use in rehabilitation robotics and long-term human–robot interactions.

Learning from demonstration

Pignat et al. [41] proposed a different approach that encodes a joint distribution in a hidden semi-Markov model (HSMM) for adaptive dressing assistance. The parameters of this model, which represents the sequence of complex behaviours, were learned from human demonstration data using an EM algorithm. This method provided a solution for movement primitives (MPs), which are usually encoding only motor commands. Also, it increased the performance of robot behaviour that could be controlled both time-dependently and independently. Also, another HMM [42] method was used to classify the time series of forces robot-assisted dressing [43]. To classify the force, HMM was used for pattern recognition of the forces. Mainly raw forces were measured and the movement of the end-effector in the x and z direction was provided as the dataset from 12 human participants. The performance of the HMMs was validated using univariate and bivariate models with force in the x-direction. The limitation of these methods (DMP and HMM) was that the workspace where the robot can move for assisting dressing was inadequate compared to human body movement. In addition, demonstrations for a specific task like the one-to-one relationship had the restriction that motor commands were always linked to a unique perception distribution. To overcome this problem, the combination of each demonstration and point cloud scene was developed for folding towel manipulation. First, the method recorded the demonstrated pose and force trajectories. During the demonstration, the authors found that five demonstrations were sufficient for achieving generalization. Also, the point cloud of the scene was retrieved using a Kinect depth sensor at the beginning of each demonstration. The thin-plate spline robust point matching (TPS-RPM) algorithm [56] was used to match from each of the demonstrations to the current point cloud scene. After the demonstration, a mean trajectory and a sequence of time-varying feedback gains were extracted, and the gains were learned using a joint Gaussian distribution. This method is beneficial for dressing from small demonstrations, and point cloud scene well recognizes the new situation, but it still needs to obtain optimal gains to optimize the task.

Kapusta et al. [44] provided evidence for the value of data-driven haptic perception for robot-assisted dressing through a carefully controlled experiment. To design an informative and replicable experiment, they deliberately focused on a representative sub-task of dressing with a commonly used article of clothing, and they tested their approach by wearing a hospital gown on the user's arm.

Reinforcement learning

Clegg et al. [45] approached the dressing problem differently by viewing the long horizon task of dressing as a sequence of smaller sub tasks. They have argued that learning to dress is challenging because humans rely heavily on haptic information and the task itself is a prolonged sequence of motions which are very costly to learn together especially in the right order. They have thus proposed to learn a unique policy for each subtask and have introduced a policy sequencing algorithm that matches the output state distribution of one subtask to the input state distribution of the other subtask while the transitions between the different subtasks are managed by a state machine. To deal with a high dimensional observation space typically associated with dressing tasks, they defined their observation space as a 163-dimensional vector which includes information on the human’s joint angles, garment feature (e.g. a sleeve opening) locations, haptics (contact forces between human and cloth), surface information (information on the inner and outer surfaces of the garment) and a task vector. The reward function is then defined as the weighted sum of the progress reward (extent to which a limb is dressed), deformation penalty (penalization of undesired cloth deformation), geodesic reward, reward for moving the end effector in the direction of the task vector and another reward that attracts the character to a target position. With these definitions of the reward and state which are queried from a dressing simulation, Trust Region Policy Optimization algorithm (TRPO) [46] was used to update the policy parameters represented by a neural network. They validated their approach by putting a hospital gown on a virtual user. The same authors presented a DRL based approach for modelling collaborative strategies for robot-assisted dressing tasks in simulation. Their approach applied co-optimization to enable distinct robot and human policies to explore the space of joint solutions to maximize a shared reward. In addition, they presented a strategy for modelling impairments in human capability. They demonstrated that their approach enables a robot, unaware of the exact capability of the human, to assist with dressing tasks.

Other methods

Chance et al. [47] created strategies for an assistive robot to support dressing using a compliant robotic arm on a mannequin. A tracking system is used to find the arm position of the mannequin and it supports trajectory planning using waypoints. Torque feedback and sensor tag data provide failure detection. Also, speech commands are allowed for correction of detected dressing errors successfully. The authors tested on ten different poses of the mannequin with the proposed method, and it showed that assistive dressing tasks could be developed without complex learning algorithms. Further, the method investigated has the advantage of using a small number of low-cost sensors which can be used to sense unplanned movement in smooth trajectories. The problem of this strategy was to not have force feedback from the mannequin that is important to know (people could be hurt by the robot). They validated their approach by putting a t-shirt and a jumper on the user's arm.

Erickson et al. [48] showed how task-specific LSTMs can estimate force magnitudes along a human limb for two simulated dressing tasks. At each time step their LSTM networks took a 9-dimensional input vector consisting of the force and torque applied to the end effector by the garment and the velocity of the end effector. The networks then output a force map at each time step consisting of hundreds of inferred force magnitudes across the person’s body. Their work was tested on a simulated robot that puts a shirt on a virtual user’s arm.

Putting a cloth on user’s head

In this section, the papers concerning wearing a cloth on the user's head are evaluated and divided into subsections according to the control strategy applied.

Supervised learning

Koganti et al. [49] proposed a data-efficient representation to encode task-specific motor-skills of the robot using Bayesian non-parametric latent variable models to learn a dynamics model of the human-cloth relationship and use this model as a prior for robust tracking in real-time. They reduced their policy search space by first learning a low dimensional latent space using the BGPLVM [44]. A dataset of successful clothing assistance trajectories was then used to train a latent space that encodes the motor skills. Each of the trajectories were then transformed into a sequence of points in the latent space forming latent space trajectories followed by searching for policy using the PoWER algorithm [32]. The authors validated their idea by wearing a t-shirt on a person. The same authors learnt the underlying cloth dynamics using the shared Gaussian Process Latent Variable Model and by incorporating accurate state information obtained from the motion capture system into the dynamics model. Shared GP-LVM provides a probabilistic framework to infer the accurate cloth state from the noisy depth sensor readings. The experimental results showed that shared GP-LVM was able to learn reliable motion models of the T-shirt state for robotic clothing assistance tasks. They also demonstrated three key factors that contribute to the performance of the trained dynamics model. The advantage of using GP-LVM is that a corresponding latent space manifold can be learned for any representation used in the observation spaces.

Saxena et al. [50] also used SL for grasping point detection and for garment recognition; the challenge of their work was to use the Kinect camera near the garment to try the algorithm with an occluded vision of the object. The authors tested their approach by putting a t-shirt on a person.

Learning from demonstration

Joshi et al. [51] (Fig. 4) presented a framework for robotic clothing assistance by DMP on a Baxter robot. The authors divided the dressing task into three phases (reaching, arm dressing, and body dressing) and each phase was applied for different skills. The reaching phase was to move the robot arm in a specific location without collision, thus it can be achieved through point-to-point trajectory while the arm dressing phase was to reach the ends at the elbow position. To make the robotic arm reaching the position DMP, which can be applied for a global trajectory modification, was used. DMP parameters can be acquired from the kinaesthetic demonstration, and support generating a trajectory globally using the start and goal parameters of DMP. Compared to reaching the elbow position, generating a trajectory to the torso position is more complicated, thus the authors introduced the Bayesian Gaussian Process Latent Variable Model (BGPLVM) as the body dressing phase. They applied BGPLVM to encode complicated motor-skills to generalize trajectory in latent space and modify the trajectory locally. The authors validated their idea using a manipulator that puts a t-shirt on a person.

Fig. 4
figure 4

LfD example where a Baxter robot is dressing a man with a T-shirt [51]

Reinforcement learning

Koganti et al. [52] used a depth sensor to extract and filter a point cloud of the t-shirt collar and sleeve which in turn were detected by a colour extraction method. Once retrieved, both the point clouds were approximated with an ellipse followed by computing the same topological relationship but this time, in real-time. They also modified the reward function to now calculate the Mahalanobois distance between the current and the target states to account for the different scales of different state variables. The authors tested their model using a robot that puts a t-shirt on the head of the person. Twardon et al. [53], instead, made a dual-arm robot, attached with anthropomorphic hands, and learned to put a knit cap on a styrofoam head. They modelled the head as an ellipsoid using point cloud data and constructed a head-centric policy space where the policy search takes place. The policy was then defined in this space as the parameterized end-effector trajectories (parameterized as B-splines) from the back of the head (back pole) to its front (front pole). They then defined an objective function which gives the robot a fixed reward for successful task completion while supporting the robot to find a trade-off between minimizing the risk of early failure and establishing contact between the fabric and the head. All this setting allowed the authors to use a gradient-free direct policy search approach to find the optimal policy by minimizing the objective function Active-CMA-ES algorithm [80].

Furthermore, Tamei et al. [54] presented a novel learning system for an anthropomorphic dual-arm robot to perform the clothing assistance task. The keys of their system were to apply a reinforcement learning method for coping with the posture variation of the assisted person, and to define a low-dimensional state representation utilizing the topological relationship between the assisted person and the non-rigid material. With their developed experimental system for T-shirt clothing assistance, including an anthropomorphic dual-arm robot and a soft mannequin, they demonstrated the robot quickly learns to modify its arm motion to put the mannequin’s head into a T-shirt.

Additionally, Matsubara et al. [55] and Shinoara et al. [56] proposed a novel learning framework for learning motor skills for interacting with non-rigid materials by RL. Their learning framework focuses on the topological relationship between the configuration of the robot and the non-rigid material. They constructed an experimental setting with an anthropomorphic dual-arm robot and a tailor-made T-shirt for the robot. They both applied the method to the robot to perform the motor task of wearing a T-shirt.

Other methods

Klee et al. [57] focused on the motion interaction between the robot and the person. The authors found a solution involving manipulator motions and user repositioning requests. Specifically, the solution allows the robot and user to take turns moving in the same space and is cognizant of the user’s limitations. To accomplish this, a vision module monitors the human’s motion, determines if they are following the repositioning requests, and infers mobility limitations when they cannot. The learned constraints were used during future dressing episodes to personalize the repositioning requests. Their contributions included a turn-taking approach to human–robot coordination for the dressing problem and a vision module capable of learning user limitations. They validated their approach using a robot that puts a hat on the user’s head.

Putting a shoe on user feet

In this section, the papers concerning wearing a shoe on the user's feet are evaluated and divided into subsections according to the control strategy applied.

Learning from demonstration

Canal et al. [58] defined a method to guide a planner to choose the preferred actions by the user. The user model was included in the planning domain as predicates, and the actions’ associated costs depend on them, the costliest actions being those that do not satisfy the user model. Moreover, they used a stochastic planner with NID rules that contemplate the possibility of different action outcomes and failures. The initial user model was inferred by asking two simple questions to the user, related to his/her confidence and comfortability. A Fuzzy Inference System (FIS) was then used to translate the answers to planning predicates. To make the planner adapt to user behaviour change and to cope with wrongly inferred user models, each rule’s probabilities and costs were updated. First, an initial refinement was performed to favour the inferred user model. Then, after each task completion, the satisfaction of the user was used to refine each rule cost, and the outcome of each action was used to refine the success’ probabilities. This defines a separation between the user model and the action outcomes, as the user delight should not be measured only by the success of the actions, which may fail due to events unrelated to the users’ preferences. Moreover, the system was able to plan with task related actions as well as with interaction actions, asking the user to move when needed and informing them regarding the next action when this increased the success rate of the action. They showed how the system was able to adapt to user behaviour changes, as well as how the use of feedback to update the action costs with the decreasing m-estimate produced a more stable behaviour and faster convergence to the preferred solution.

Putting an item on user leg

In this part, the papers concerning wearing a cloth on the user's leg are evaluated and divided into subsections according to the control strategy applied.

Other method

Yamazaki et al. [59] focused on a different task: the actions by which the robot can pull a pair of trousers along the subject’s legs. These actions are frequently demanded by humans requiring dressing assistance and which are potentially automatable. To overcome this problem the authors implemented the dressing procedure using a life-sized humanoid robot. Estimating the shape of the legs from images captured by a three-dimensional range camera, they proposed a method of modifying the trajectory from the basic one estimated from statistical human-body data.

Multiple tasks

In this section, the papers multiple tasks are evaluated and divided into subsections according to the control strategy applied.

Learning from demonstration

Lee et al. [5] presented an approach for generalizing force-based demonstrations of deformable object manipulation skills to novel situations. Their method uses non-linear geometric warping based on point cloud registration to adapt the demonstrations to a novel test scene, and then learns appropriate feedback gains to trade off position and force goals in a manner consistent with the data, providing for variable impedance control. Their results showed that including forces in the manipulation tasks allows for significantly greater generalization than purely kinematic execution: knots could be tightened more tightly in ropes with greater length variation and could be tied to a pipe without slipping off, towels of varying geometries could be stretched and laid flat, and whiteboards could be erased effectively. They chose their tasks to include both phases that were determined primarily by pose, such as positioning the gripper to grasp the rope, and phases that were primarily force-driven, such as tightening the knot. Performing such tasks kinematically is unreliable, because some parts are defined primarily by the force exerted on the object, while others require precise positioning. Automatically determining whether force or pose is important at each phase is essential for effectively generalizing demonstrations of such tasks. The authors validated their work using a robot that tied a knot, folded a towel, erased a whiteboard, and tied a rope to a pipe.

Reinforcement learning

Tsurumine et al. [60] (see Fig. 5a, b) proposed two DRL algorithms: deep policy network and duelling deep policy network structure which combine the nature of smooth policy update with the capability of automatic feature extraction in deep neural networks to enhance the sample efficiency and learning stability with fewer samples. To exploit the nature of smooth policy update, they used dynamic policy programming [61] which considers the Kullback–Leibler divergence between current policy π and baseline policy π̄ into the reward function to minimize the difference between the current and baseline policy while maximizing the expected reward. A DDQN inspired novel architecture was also presented that learned separate value and advantage functions and then used human demonstrations to drastically reduce the exploration space for their RL agent. Their state was defined as raw RGB images which are then mapped to optimal actions by the neural network. Results reported, indicated a stable and sample efficient learning for cloth manipulation tasks such as folding a t-shirt and flipping a handkerchief when compared to deep Q-learning (DQN) [62] algorithm while simultaneously earning higher total reward. The robot in this approach tied a knot, folded a towel, erased a whiteboard, and tied a rope to a pipe.

Fig. 5
figure 5

a RL example of a robot that is folding a T-shirt [57] and (b) RL example of network implementation with folding steps of the T-shirt [57]

Matas et al. [63] instead, proposed a task agnostic algorithm based on Deep RL which bypasses the need to explicitly model cloth behaviour and does not require reward shaping to converge. The agent was able to learn 3 long horizon tasks: folding a towel to a tape mark, diagonal folding of face towel and draping a small towel over a hanger. Training was seeded with 20 demonstrations and happened entirely in simulation with a couple of adaptations to account for imperfections in experimental deformable body support, and with domain randomization to enable easy transfer of the policy. The learning algorithm incorporated 9 improvements proposed in the recent literature and they presented ablation studies to understand the role of these improvements. The robot in this approach folded up a towel up to a mark, folded a face towel diagonally, and draped a piece of cloth over a hanger.

Discussion

As the description of the result section, the three learning-based control approaches have lots of advantages for cloth manipulation and dressing assistance. However, it still has disadvantages that needs to improve the approaches, so the pros and cons of these approaches are described in Table 2. We also analyse the state of the art that provides a list of hints regarding control approaches during dressing tasks. Future research efforts should lead to overcoming the limitations of the existing works, as summarized in Table 3. It identifies several areas that should be analysed in future works such as dataset, multisensory approach, perception, manipulation, simulation, and experimental phases. Moreover, legal, and social aspects should be taken into consideration to build an efficient behavioural model for future robots.

Table 2 Pros and cons of SL, LfD, and RL
Table 3 Challenges and opportunity/Weaknesses

On the other hand, it should be noted that research and development in this field is following a positive trend with many examples of concrete experimentation with code availability for users, which can be found [30, 44, 56]. This is very useful for the reader if they want to replicate and develop new code.

Dataset

A crucial aspect in dressing a person using a robot or more technically speaking, using machine learning techniques, is the acquisition of the dataset of clothes and this situation brings to a limitation of the dataset itself. In the state of the art, we can find several apparel datasets such as the DeepFashion dataset or Fashion-MNIST dataset; the issue is that those datasets are very small compared to other existing datasets not concerning fashion, such as the MNIST dataset. Furthermore, in [51], the authors discussed that their approach (dressing a sleeveless t-shirt) should be extended to other clothing articles such as pants, jackets and so on, for each type of clothing article. The same arguments were discussed in [28] where the authors explained that in their future work, they would design a system that can be easily scaled to work with more types of garments with few modifications to expand their dataset. The main problem of having a small dataset is that this brings to a wrong classification of clothes when a new item is used in the experimental phase because the robot is not able to recognize it.

Sensor technology

Sensor technology plays an important role in the interaction between humans and the robot and as concerns fashion, many sensors can be found and several of them are being employed to infer the cloth state in cloth manipulation and assisted dressing tasks. In the field analysed, sensor information is often used separately while it is important to have a multisensory approach to improve the accuracy of the system. Zhang et al.[36], for example, would aim in their future work to combine multi-modal information, including gripper positions, force information and depth images, into a probabilistic framework to obtain real-time estimate of the arm pose during the dressing process.

Perception

A second ability that should be deeply analysed is the area of perception. At present, perception is mostly relying on vision for the detection of clothes, and this is a very limited view for a robot. There are few works such as Yuan et al.[73], where information from tactile sensors is used for perception. Stria et al. [38] underlined the importance of not relying only on vision in the field of perception and they stated that they have future plans to develop advanced models of clothing taking into account their physical properties when unfolding a towel. They also pointed out that they plan to detect and model special parts of clothing like buttons, pockets or collars which provide additional information about the garment configuration.

Furthermore, in [25], the garment taken by the robot is put in a specific position which represents a very limited scenario. The authors stated that active perception is needed for identifying rope’s knot structure, e.g., turning the rope over or introducing additional slack for perceiving it better. Moreover, they pointed out that tactile sensing could also improve perception. Saxena et al. [50] proposed to add additional views of the cloth using a hand-mounted camera or putting more cameras around the grasping scenario to avoid occlusion and to have an improved representation of the item.

Manipulation

There are many issues to be solved concerning manipulation. The main problem that can be found during dressing tasks is clothing assistance that is developed only for specific scenarios.

Another important aspect is the number of experiments and of people involved in the trials.

Tamei et al. [54] also underlined the importance of having a validation of putting in a mannequin’s head a t-shirt with more participants. They stated that testing their method of dressing with different participants is important to obtain more information about the experience of the person regarding the interaction between him/her and the robot tested to improve it.

Furthermore, another issue that should be considered, is to manipulate not only one object at a time, but instead collecting more clothes at the same time even if some parts of them are occluded [28]. Collecting more apparels, would speed up the process of the dressing assistance.

Manipulation in situations of occlusion, is another problem that should be solved as stated in and in [36].

Another important issue is that some methods cannot learn to manipulate new objects exclusively by watching human demonstrations, since performing a manipulation task requires a model that can effectively predict the motion of the object, and this model is learned from the robot’s own experience. Yang et al. [23] underlined this concept stating that their future focus is to implement their model on unknown items and in a quick manner. This limitation could be overcome simply by collecting data from many object manipulation scenarios.

Simulation and experimental phases

In the simulation phase, an issue that should be overcome is the lack of support for deformable objects that most robotic simulators have. In literature there are some simulators like Gazebo, among the widely used simulators, [74] cannot offer this specific, only Pybullet [75] implements some rudimentary and experimental functionality for simulating deformable objects. Solving this problem would let the researchers create accurate models of deformable objects grasping that can be useful to succeed in the experimental phase. One problem is that in many real scenarios, the model of the robot is tested on a mannequin. Working with it represents a problem, since the researchers cannot have feedback from the mannequin about the intensity of the forces applied by the robot [54]. If the robot, instead, interacts with a person, it can receive all this kind of feedback. A second issue is related to the importance of having several scenarios for validating the behaviour of a robot [44]. During the experiments of this paper a specific scenario was an armrest supported the participant’s upper arm, and the participant’s forearm was initially aligned to the robot’s motion. In other dressing tasks, body parts could have greater freedom to move with respect to the robot, resulting in more variability of the forces measured by the robot. Likewise, a robot might hold a garment in place while the person’s body moves, which might increase the variability of the forces measured by the robot [44]. Yang et al. [23] also underlined the importance of having several scenarios stating that they expected that the accuracy and adaptability for various environments and the robot’s task performing speed should improve in the future.

Another important issue that emerges in [57], is the importance of the autonomy of the robot in terms of being controlled by humans. If a system is autonomous, the computation of time during tasks is reduced and the robot could achieve a better performance in dressing tasks. Moreover, in two papers the importance of the number [54] and feedback of participants is pointed out to increase the accuracy of the model. Other elements that should be considered during the experimental phase are the following: trying not only a single network but using several of them, comparing them and finding the one that boots the accuracy of the dressing task and having improved planning algorithm [20]. In the last case, algorithms that consider the limitations of the arms of a robot as well as uncertainty in perception would also improve the performance and the safety of the people who are working with a robot. Several works consider only a single dressing task, or a single object grasped, or specific configurations [55], and these limitations should be overcome in the future to have as many generic models as possible. Moreover, applying for such tasks as turning socks inside out, and applying bandages, [56] could be a new step in this field. Another issue is the conversion from mesh to voxel representations. For example, given a cloth folded neatly in half, it can be nearly impossible to distinguish on which side the fold is by looking at the voxel representation alone. They found that such ambiguity can be greatly reduced by adopting a coloured voxel representation that marks the cloth’s hems. However, in practical application this would require visual recognition of the hems in a pre-processing step. Canal et al. [58] stated the importance of long-term adaptation of the system. The authors explained that in the future long-term adaptation should be analysed carefully, as well as the inclusion of more actions and preferences, with the possibility of automatically learning the actions along with the preferences.4.6. Legal aspect and safety aspects.

At present, few rules can be found in the legal or safety fields of social robotics. In [76], expert opinions are given from different international workshops exploring ethical, legal, and social (ELS) issues associated with social robots, but many questions remain open. Several extensions to cope with the safety of patients must be made in several works such as [55] where it is pointed out that this topic should be analysed in depth. Finally, the birth of a regulation for social robots, like the one created for drones during the past years, could be an important step to have the possibility to use robots in crowded environments [77].

Conclusions

This paper focuses on the control approaches that service robots use for dressing tasks. The current state-of-the-art of existing systems used in this field is presented to identify the pros and cons of each work with the aim to provide recommendations for future improvements.

Several issues must be solved to improve the development of robots for clothing assistance. First, there should be an increment of the size of the dataset to have a better training and to obtain better results. Concerning perception, most experiments focus on one or two specific skills and have been executed in pre-defined laboratory conditions. This is far from the human ability to approach and grasp items where needed, reverse inside-out sleeves or fold clothes [78]. Moreover, there is the need of high-resolution depth to solve to go towards more accurate wrinkle measuring and state estimation, but processing times are still too long for real-time applications. Furthermore, future solutions should include active vision, with the help of a robot moving base, to obtain multiple views of the garment and generate a more robust prediction [79]. Another aspect that should be taken into consideration in the future is using a multisensory approach to acquire as much as information the robot needs to accomplish its task in a fast way (not only use vision but also using force and tactile inputs).

Perceptual skills must gain in speed and accuracy and must be tightly coupled with manipulation to achieve active vision strategies to resolve uncertainties in an agile way.

Moreover, robots should have a multitasking strategy to not only accomplish a single task but to be more useful in different tasks (dressing both lower and upper limbs). The researchers should improve networks limitations, the autonomy of the robot, improve manipulators trajectory, the lack of support for deformable objects in most of the robotic simulators, and should test the robot in different scenarios with different populations, to receive feedbacks by the participants of an experiment. Manipulation of clothes should evolve to have algorithms that recognize unknown objects or occluded objects, and more attention should be paid to the forces applied from robots to a patient.

Furthermore, legal or safety fields for social robotics should be studied in deep since it is still a new topic of research.

Additionally, the tasks of state estimation and tracking require further advances in versatility and uncertainty handling to effectively mimicking human comprehension of cloth states and our intuitive discretization of what is a continuum of deformed states [78].

To maintain the importance of each contribution, it is fundamental to include in a whole vision all the suggestions provided by each work.

Finally, for what concerns the future direction of learning control strategies for dressing task it can be shown that Transformers [80, 81] could be used in the future for clothes classification tasks. Other models that could be applied in future works of RL, in the manipulation of clothes and dressing assistance, could be the one used in [82]. Then, for what concerns LfD in apparel manipulation and assistance, [83] and [84] could be used in future works.

Although many improvements remain to be accomplished, the already satisfying results the authors have achieved are an optimum starting point to develop a better solution using knowledge of human cognitive and psychological structures.