Abstract
The main goal of this article is to present COACH (COrrective Advice Communicated by Humans), a new learning framework that allows non-expert humans to advise an agent while it interacts with the environment in continuous action problems. The human feedback is given in the action domain as binary corrective signals (increase/decrease the current action magnitude), and COACH is able to adjust the amount of correction that a given action receives adaptively, taking state-dependent past feedback into consideration. COACH also manages the credit assignment problem that normally arises when actions in continuous time receive delayed corrections. The proposed framework is characterized and validated extensively using four well-known learning problems. The experimental analysis includes comparisons with other interactive learning frameworks, with classical reinforcement learning approaches, and with human teleoperators trying to solve the same learning problems by themselves. In all the reported experiments COACH outperforms the other methods in terms of learning speed and final performance. It is of interest to add that COACH has been applied successfully for addressing a complex real-world learning problem: the dribbling of the ball by humanoid soccer players.
Similar content being viewed by others
References
Knox, W.B., Stone, P.: Interactively shaping agents via human reinforcement: the TAMER framework. In: The Fifth International Conference on Knowledge Capture (2009)
Argall, B.D., Browning, B., Veloso, M.: Learning robot motion control with demonstration and advice-operators. In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (2008)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: an Introduction, vol. 1, no. 1. MIT Press, Cambridge (1998)
Leottau, L., Celemin, C., Ruiz-del-Solar, J.: Ball dribbling for humanoid biped robots: a reinforcement learning and fuzzy control approach. In: Robocup 2014: Robot World Cup XVIII, pp. 549–561. Springer (2015)
Randløv, J., Alstrøm, P.: Learning to drive a bicycle using reinforcement learning and shaping. In: ICML, vol. 98, pp. 463–471 (1998)
Vien, N.A., Ertel, W., Chung, T.C.: Learning via human feedback in continuous state and action spaces. Appl. Intell. 39(2), 267–278 (2013)
Celemin, C., Ruiz-del-Solar, J.: Interactive learning of continuous actions from corrective advice communicated by humans. In: Robocup 2015: Robot World Cup XIX (2015)
Celemin, C., Ruiz-del-Solar, J.: COACH: learning continuous actions from corrective advice communicated by humans. In: 2015 International Conference on Advanced Robotics (ICAR), pp. 581–586 (2015)
Chernova, S., Thomaz, A.L.: Robot learning from human teachers. Synth. Lect. Artif. Intell. Mach. Learn. 8(3), 1–121 (2014)
Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Rob. Auton. Syst. 57(5), 469–483 (2009)
Billard, A., Calinon, S., Dillmann, R., Schaal, S.: Robot programming by demonstration. In: Springer handbook of robotics, pp. 1371–1394. Springer (2008)
Billing, E.A., Hellström, T.: A formalism for learning from demonstration. Paladyn J. Behav. Robot. 1(1), 1–13 (2010)
Cuayáhuitl, H., van Otterlo, M., Dethlefs, N., Frommberger, L.: Machine learning for interactive systems and robots: a brief introduction. In: Proceedings of the 2nd Workshop on Machine Learning for Interactive Systems: Bridging the Gap Between Perception, Action and Communication, pp. 19–28, ACM (2013)
Amershi, S., Cakmak, M., Knox, W.B., Kulesza, T.: Power to the people: the role of humans in interactive machine learning. AI Mag. 35(4), 105–120 (2014)
Fails, J.A., Olsen, D.R. Jr: Interactive machine learning. In: Proceedings of the 8th International Conference on Intelligent User Interfaces, pp. 39–45 (2003)
Ware, M., Frank, E., Holmes, G., Hall, M., Witten, I.H.: Interactive machine learning: letting users build classifiers. Int. J. Hum. Comput. Stud. 55(3), 281–292 (2001)
Amershi, S., Fogarty, J., Weld, D.: Regroup: interactive machine learning for on-demand group creation in social networks. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 21–30 (2012)
Ngo, H., Luciw, M., Nagi, J., Forster, A., Schmidhuber, J., Vien, N.A.: Efficient interactive multiclass learning from binary feedback. ACM Trans. Interact. Intell. Syst. 4(3), 1–25 (2014)
Aler, R., Garcia, O., Valls, J.M.: Correcting and improving imitation models of humans for robosoccer agents. In: The 2005 IEEE Congress on Evolutionary Computation, 2005, vol. 3, pp. 2402–2409 (2005)
Grollman, D.H., Jenkins, O.C.: Learning robot soccer skills from demonstration. In: IEEE 6th International Conference on Development and Learning, 2007. ICDL 2007, pp. 276–281 (2007)
Chernova, S., Veloso, M.: Multi-thresholded approach to demonstration selection for interactive robot learning. In: 2008 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 225–232 (2008)
Weiss, A., Igelsböck, J., Calinon, S., Billard, A., Tscheligi, M.: Teaching a humanoid: a user study on learning by demonstration with hoap-3. In: The 18th IEEE International Symposium on Robot and Human Interactive Communication, 2009. RO-MAN 2009, pp. 147–152 (2009)
Breazeal, C., Berlin, M., Brooks, A., Gray, J., Thomaz, A.L.: Using perspective taking to learn from ambiguous demonstrations. Rob. Auton. Syst. 54(5), 385–393 (2006)
Silver, D., Bagnell, J.A., Stentz, A.: Learning from demonstration for autonomous navigation in complex unstructured terrain. Int. J. Rob. Res. 29(12), 1565–1592 (2010)
Yu, C.-C., Wang, C.-C.: Interactive learning from demonstration with a multilevel mechanism for collision-free navigation in dynamic environments. In: 2013 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 240–245 (2013)
Sweeney, J.D., Grupen, R.: A model of shared grasp affordances from demonstration. In: 2007 7th IEEE-RAS International Conference on Humanoid Robots, pp. 27–35 (2007)
Lin, Y., Ren, S., Clevenger, M., Sun, Y.: Learning grasping force from demonstration. In: 2012 IEEE International Conference on Robotics and Automation (ICRA), pp. 1526–1531 (2012)
Chernova, S.: Interactive policy learning through con?dence-based autonomy (2009).pdf. J. Artif. Intell. Res. 34, 1–25 (2009)
Meriçli, C., Veloso, M., Akin, H.: Complementary humanoid behavior shaping using corrective demonstration. In: 2010 10th IEEE-RAS International Conference on Humanoid Robots (Humanoids), pp. 334–339 (2010)
Meriçli, Ç., Veloso, M., Akin, H.: Task refinement for autonomous robots using complementary corrective human feedback. Int. J. Adv. Robot. Syst. 8(2), 68–79 (2011)
Mericli, C.: Multi-Resolution Model Plus Correction Paradigm for Task and Skill Refinement on Autonomous Robots, Citeseer p. 135 (2011)
Argall, B.D.: Learning mobile robot motion control from demonstration and corrective feedback. Thesis (2009)
Argall, B.D., Browning, B., Veloso, M.M.: Teacher feedback to scaffold and refine demonstrated motion primitives on a mobile robot. Rob. Auton. Syst. 59(3–4), 243–255 (2011)
Meriçli, Ç., Veloso, M.: Improving biped walk stability using real-time corrective human feedback. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 6556 LNAI, pp. 194–205 (2011)
Akrour, R., Schoenauer, M., Sebag, M.: Preference-based policy learning. In: Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), Vol. 6911 LNAI, No. PART 1, pp. 12–27 (2011)
Akrour, R., Schoenauer, M., Souplet, J.-C., Sebag, M.: Programming by feedback. In: Proceedings of the 31St International Conference on Machine Learning, vol. 32, pp. 1503–1511 (2014)
Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: Advances in Neural Information Processing Systems, pp. 4302–4310 (2017)
Jain, A., Wojcik, B., Joachims, T., Saxena, A.: Learning trajectory preferences for manipulators via iterative improvement. In: Advances in neural information processing systems, pp. 575–583 (2013)
Mitsunaga, N., Smith, C., Kanda, T.: Adapting robot behavior for human – robot interaction. IEEE Trans. Robot. 24(4), 911–916 (2008)
Tenorio-Gonzalez, A.C., Morales, E.F., Villaseñor-Pineda, L.: Dynamic reward shaping: training a robot by voice. In: Advances in Artificial Intelligence–IBERAMIA 2010, No. 214262, pp. 483–492. Springer (2010)
León, A., Morales, E.F., Altamirano, L., Ruiz, J.R.: Teaching a robot to perform task through imitation and on-line feedback. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 549–556 (2011)
Suay, H., Chernova, S.: Effect of human guidance and state space size on interactive reinforcement learning. In: RO-MAN, 2011 IEEE, pp. 1–6 (2011)
Pilarski, P.M., Dawson, M.R., Degris, T., Fahimi, F., Carey, J.P., Sutton, R.S.: Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In: IEEE International Conference on Rehabilitation Robotics, vol. 2011, p. 5975338 (2011)
Yanik, P.M., Manganelli, J., Merino, J., Threatt, A.L., Brooks, J.O., Green, K.E., Walker, I.D.: A gesture learning interface for simulated robot path shaping with a human teacher. IEEE Trans. Human-Machine Syst. 44(1), 41–54 (2014)
Najar, A., Sigaud, O., Chetouani, M.: Training a robot with evaluative feedback and unlabeled guidance signals. In: IEEE International Symposium on Robot and Human Interactive Communication (ROMAN), pp. 261–266 (2016)
Knox, W.B., Stone, P.: TAMER: training an agent manually via evaluative reinforcement. In: 2008 7th IEEE International Conference on Development and Learning, pp. 292–297 (2008)
Knox, W.B.: Learning from human-generated reward. In: PhD Dissertation, The University of Texas at Austin (2012)
Haykin, S.: Neural networks: a comprehensive foundation. Knowl. Eng. Rev. 13, 4 (1999)
Vien, N.A., Ertel, W.: Reinforcement learning combined with human feedback in continuous state and action spaces. In: 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–6 (2012)
Thomaz, A., Hoffman, G., Breazeal, C.: Reinforcement learning with human teachers: understanding how people want to teach robots. In: Proceedings - IEEE International Workshop on Robot and Human Interactive Communication, pp. 352–357 (2006)
Toris, R., Suay, H. B., Chernova, S.: A practical comparison of three robot learning from demonstration algorithms. In: 2012 7th ACM/IEEE International Conference on Human-Robot Interact. (HRI), pp. 261–262 (2012)
Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming using Function Approximators, vol. 39. CRC Press (2010)
Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Rob. Res. 32, 1238–1274 (2013)
Takagi, T., Sugeno, M.: Fuzzy identification of systems and its applications to modeling and control. IEEE Trans. Syst. Man Cybern. 1, 116–132 (1985)
Babuska, R.: Fuzzy and Neural Control. Disc Course Lecture Notes. Delft University Technology, Delft, Netherlands (2001)
Rahat, A.A.M.: Matlab implementation of controlling a bicycle using reinforcement learning. https://bitbucket.org/arahat/matlab-implementation-of-controlling-a-bicycle-using (2010)
Acknowledgements
This work was partially funded by FONDECYT project 1161500 and CONICYT-PCHA/Doctorado Nacional/2015-21151488.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
Given that human feedback is a key component of the proposed learning framework, a new Hand-Gesture Recognition (HGR) interface that allows providing feedback to the agent is proposed. The interface allows detecting 5 gestures: positive correction, negative correction, a neutral gesture used when users do not need to provide feedback, a reward, and a punishment (see gestures in Fig. 15).
In order for the proposed system to be robust to variations in illumination, colors, and non-uniform backgrounds, it uses: (i) Gaussian Mixture Models (GMM) and based Background Subtraction (BS) to detect regions of interest (ROI), i.e. hand candidates, (ii) Kalman filtering for tracking the hand candidates, (iii) Local Binary Patterns (LBP) as features for characterizing the ROIs, and (iv) SVM classifiers for the final detection of the hand-gestures. The block diagram is shown in Fig. 16. The main functionalities are described in the following paragraphs:
-
Detection of Regions of Interest (ROI): Movement blobs are first detected using background subtraction. Then, adjacent blobs are merged and filtered using morphological filters, and the largest blob is selected as a hand candidate and fed to the tracking system.
In parallel, a second process applies BS to color edges: First, a binary edge image is computed, and then color information is incorporated into the edges. Afterwards, BS and area filtering is applied in the edge’s domain. Finally, the output of the area-filtering module is intersected with the color edges in the block “&”. In order to manage occlusions properly (see Fig. 16b) the block “&” deletes the blobs associated with the occluded edges, which are labeled by BS as regions with movement (Fig. 15 left); since those edges are not present in the original image. The output is a blob with the detected moving, color edges (Fig. 17 right).
-
Tracking: The parameters of the bounding box of the largest blob taken as a hand candidate by the prior module are used as observations by a Kalman filter, which estimates the final hand candidates, based on the fusion of the current ROI information with the prior ones. Afterwards, the image computed in the block “&” of the previous module is intersected with the Kalman-filtered bounding box. Examples of the resulting images are shown in Fig. 15.
-
Features Extraction and Classification: The image window given by the Tracking module is analyzed in order to classify the captured gesture. Histograms of LBP features are computed inside the image window. Since this window is a binary image, LBP are used as discretized measurements of the gradient. Then, the histograms of the LBP features are similar to Histograms of Gradient (HOG). This feature vector feeds five SVM classifiers, one trained for each gesture, where the gestures are detected.
The dataset used for training the SVM was built using images generated by the tracking module. Altogether, 1654 images of the five hand-gestures were recorded, 60% of them used for training, and 40% for validation. The classification error is 9.05%, which is considered appropriate to be used as an interface for the learning problems described in Section 4.
Rights and permissions
About this article
Cite this article
Celemin, C., Ruiz-del-Solar, J. An Interactive Framework for Learning Continuous Actions Policies Based on Corrective Feedback. J Intell Robot Syst 95, 77–97 (2019). https://doi.org/10.1007/s10846-018-0839-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10846-018-0839-z