Abstract
Sensitive robot systems are used in various assembly and manufacturing technologies. Assembly is a vital activity that requires high-precision robotic manipulation. One of the challenges faced in high precision assembly tasks is when the task precision exceeds the robot’s precision. In this research, Deep Q-Learning (DQN) is used to perform a very tight clearance Peg-in-Hole assembly task. Moreover, recurrence is introduced into the system via a Long-Short Term Memory (LSTM) layer to tackle DQN drawbacks. The LSTM layer has the ability to encode prior decisions, allowing the agent to make more informed decisions. The robot’s sensors are used to represent the state. Despite the tight hole clearance, this method was able to successfully achieve the task at hand, which has been validated by a 7-DOF Kuka LBR iiwa sensitive robot. This paper will focus on the search phase. Furthermore, our approach has the advantage of working in environments that vary from the learned environment.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
Industrial robotics plays a key role in production, notably in assembly. Despite the fact that industrial robots are currently primarily used for repetitive, dangerous, or relatively heavy operations, robotic applications are increasingly being challenged to do more than simple pick-and-place activities [1, 2]. They must be able to react to their surroundings. As a result, Sensitive robot systems are capable of conducting force- or torque-controlled applications, which are used to achieve the previously mentioned contact with the environment. Although there is no clear definition of the term sensitivity, based on the measurement technology DIN 1319 norm, sensitivity is defined as the change in the value of the output variable of a measuring instrument in relation to the causal change in the value of the input variable [3]. Special control strategies are required in the case of a physical contact with the environment, since simple pure position control, as utilized in part manipulation, is no longer sufficient. Furthermore, relying just on force control is insufficient thus it makes sense to employ a hybrid force/position control [4, 5]. Depending on the task, it is therefore necessary to decide which of the transitional and rotational degrees of freedom are position controlled or force controlled [6]. The Peg-in-Hole assembly is an example of a robotic task that requires direct physical contact with the surrounding environment [7]. It has been extensively researched in both 2-D [8, 9] and 3-D environments [10, 11], and a variety of techniques for solving it have been presented [8,9,10,11,12,13,14,15]. Conventional online programming methods have been suggested and widely utilized with robots to train them to perform precise industrial processes as well as assembly activities, in which a teach pendant is used to guide the robot to the desired positions while recording each movement. This strategy is time consuming and challenging to adapt to new environments. Another approach is offline programming (simulation) [9, 12], and while it has many advantages in terms of downtime, it is difficult to simulate a precise actual environment due to environmental variance, and it is inefficient in industrial activities when the required precision exceeds robot accuracy. So due to the limitation of these techniques, a new skill acquisition technique has been proposed [11, 15], where the robot learns to do the high precision mating task using reinforcement learning [11].
2 State of the Art
A variety of techniques in tackling Peg-in-Hole assembly challenges have been suggested [8,9,10,11,12,13,14,15]. This section will go over some of these strategies. Gullapalli et al. [8] investigated a 2D Peg-in-Hole insertion task, focusing on employing associated reinforcement learning to learn reactive control strategies in the presence of uncertainty and noise, with a 0.8 mm gap between peg and hole. A Zebra Zero robot with a wrist force sensor and position encoders was used. Their evaluation was conducted over 500 sequential training runs. Hovland et al. [15] proposed skill learning by human demonstration where they implemented a hidden Markov model. Nuttin et al. [12] ran a simulation with a CAD-based contact force simulator. Their results show that the insertion is effective if the force level or time surpasses a particular threshold. Their approach focuses solely on the insertion while using reactive control with reinforcement as their strategy, in which the learning process is divided into two phases. The first phase is controller, where it consists of two networks: policy network and exploration network. The second phase is actor-critic algorithm, in which the actor calculates the action policy and the critic is responsible for computing the Q-value. Yun [9] imitated the human arm using passive compliance and learning. He used simulation, implemented in MATLAB, to solve a 2-D Peg-in-Hole task with a 3-DOF manipulator where he focuses on search phase only. The accuracy is 0.5 mm, and the training was done on a gap of 10 mm. Their main goal of the research is to demonstrate the significance of passive compliance in association with reinforcement learning. We use integrated torque sensors with the deep learning algorithm, unlike Abdullah et al. [14], who used a vision system with force/torque sensors to achieve automatic assembling by imitating human operating steps, in which vision systems have limitations due to changes in illumination that may cause measurement errors. Also, unlike Inoue et al.’s [11] strategy, in which the robot’s movement is caused by a force condition in x and y directions, in our approach, the robot’s motion is discrete displacement action in x or y direction, because a motion resulting from a force condition raises the difficulty that such a force condition cannot be reached due to the physical interaction between the robot and the environment (e.g. the stick-slip effect), eventually resulting in a theoretically infinite motion. Furthermore, in contrast to the aforementioned approaches, the Peg-in-Hole task has not been conducted on a very narrow hole clearance, and some of these approaches were only confirmed with a simulation, which is not as exact as the real world, adding to the challenge of adjusting to actual world variance. Moreover, our approach has a higher advantage in adapting to variations in both hole location and environmental settings. As well as the ability to take actions based on a prior state trajectory rather than just the current state. It also has the capability of compensating for sensor delays.
3 Problem Formulation and Task Description
As previously stated, when the required level of precision of the assembly task surpasses the robot precision, it is difficult to perform Peg-in-Hole assembly tasks, and it is even more challenging to perform them using force controlled robotic manipulation. Our approach in solving the Peg-in-Hole task is employing a recurrent neural network trained with reinforcement learning using skill acquisition techniques [11, 13]. The first learned skill, which is known as the search phase, where the peg seeks to align the peg center within the clearance zone of the hole center. A successful search phase is followed by the insertion phase in which the robot is responsible for correcting the orientational misalignment. This paper focuses solely on the search phase. This research is done on a clearance of 30 \(\upmu \)m using a robot with repeatability of 0.14 mm and some millimeters positional inaccuracy.
4 Reinforcement Learning
Reinforcement learning (RL) is an agent-in-the-loop learning approach in which an agent learns by performing actions on an environment and receiving a reward (\(r_{t}\)) and an updated state (\(s_{t}\)) of the environment as a result of those actions. The aim is to learn an optimal action policy for the agent that maximizes the eventual cumulative reward (\(R_{t}\)) shown in Eq. (1), where (\(\gamma \)) indicates the discount factor, (\(r_{t}\)) is the current reward generated from performing action (\(a_{t}\)), and (t) denotes to the step number. The learned action policy is the probability of selecting an action from a set of possible actions in the current state [11, 16].
Deep Q-Learning
Q-Learning is a model-free off-policy RL technique. Model-free techniques do not require an environment model. Off-policy techniques learn optimal action policy implicitly by learning optimal Q-value function. Q-value function at a given state—action (s,a) pair is a measure of the desirability of taking action (a) in state (s) as illustrated in Eq. (2).
Q-Learning employs the \(\epsilon \)-greedy policy as behavior policy, in which an agent chooses a random action with probability (\(\epsilon \)) and chooses the action that maximizes the Q-value for the (s,a) pair with probability (1-\(\epsilon \)) (see Eq. (3)). In this paper, exploration and exploitation are not set to a specific percentages. On the contrary, the exploration rate decays with a linear rate per episode as shown in Éq.(4).
The simplest form of Q-Learning is a tabular form which uses an iterative Bellman based update rule as seen in Eq. (5). Tabular Q-Learning computes the Q-value function for every (s,a) pair in the problem space, which makes it unsuitable for the assembly task at hand due to the complexity and variety of the environment. To overcome the tabular formulation drawbacks, DQN was introduced in [16] in which a neural network is employed as a function approximator of a (s,a) pair Q-value.
Deep Recurrent Q-Learning
While Deep Q-Learning can learn action policies for problems with large state spaces, it struggles to learn sequential problems where action choice is based on a truncated trajectory of prior states and actions. This challenge urged the use of another DQN variant which has a memory to encode previous trajectories. In this paper, a Deep Recurrent Q-Network (DRQN) is utilized as a suitable DQN variant. DRQN was introduced in [17] to solve the RL problem in partially observable markov decision process (POMDP). DRQN utilizes long-short term memory (LSTM) layers to add recurrency to the network architecture. The LSTM layer can encode previous (s,a) trajectories providing enhanced information for learning the Q-values. In addition, the recurrency can account for sensor and communication delays.
Action and Learning Loops
The Deep Recurrent Q-Learning algorithm is illustrated in Fig. 1. The algorithm can be divided into two parallel loops; the action loop (Green) and the learning loop (Yellow). The action loop is responsible for choosing agent’s action where the current environment state is fed through a policy network. The policy network estimates the Q-value function over the current state and the set of available actions. Based on \(\epsilon \)-greedy exploration rate, the agent action is either the action with the highest Q-value or a randomly sampled action as illustrated in Eq. (3). At each step, \((s_{t},r_{t},a_{t},s_{t+1})\) experience is saved in a reply memory. After a predefined number of episodes, the agent starts learning from randomly sampled experience batches. Each experience batch is a sequence of steps with a defined length from a randomly sampled episode. The target network is an additional network serving as a temporary fixed target for optimization of the Bellman Eq. (5). The weights of the target network are copied from the policy network after a number of steps. The policy network estimates the Q-value of \((s_{t},a_{t})\) pair while the target network estimates the max Q-value achievable in \((s_{t+1})\). The output from both networks is used to compute the proposed loss function in Eq. (6). Gradient descent is used to learn the policy network passed by back propagation of loss as illustrated in Eq. (7).
5 Search Skill Learning Approach
This paper focuses on search skill which will be discussed in the following subsections in more details. Fig. 2 illustrates how the learning process is done.
Initial Position
Each episode starts with the peg in a random position. The polar coordinates of the initial position are determined by a predefined radius from the hole’s center and a randomly sampled angle between 0 and 2\(\pi \) (see Fig. 3). The advantage of utilizing such an initialization method is that it maintains the initial distance to the hole center while searching the full task space.
State
At each time step, the reinforcement learning (RL) agent receives a new state sensed by the robot (see Fig.2 lower arrows) which consists of forces in x, y, and z (\(F_{x}, F_{y},F_{z}\)), moments around x and y (\(M_{x},M_{y}\)), and rounded positions in x and y (\(\tilde{P_{x}},\tilde{P_{y}}\)) as seen in Eq. (8). In order to provide enough robustness against positional inaccuracy, it was assumed that the hole and the peg were not precisely positioned. \(\tilde{P_{x}},\tilde{P_{y}}\) are computed using the grid indicated in Fig. 4. where C is the positional error’s margin. This approach provides auxiliary inputs to the network, which can very well aid in the acceleration of learning convergence.
Action
A deep neural network (policy network) is utilized in the current system to estimate a Q-value function for each (s,a) pair, which subsequently generates an action index to the robot controller. The action index is then utilized to assist the robot in selecting one of four discrete actions (see Fig.2 upper arrows), each of which has a constant force in the z-direction (\(F_{z}^{d}\)). According to Eq. (9), the agent must alter its desired position between x and y \(\left( \pm \,d_{x}^{d}, \pm \,d_{y}^{d} \,\right) \), For all four discrete actions, the orientation of the peg (\(R _{x}^{d},R_{y}^{d}\)) is set to zero throughout the search phase. The advantage of maintaining constant and continuous force in the z-direction is that when the search algorithm finds the hole, the peg height drops by a fraction of a millimeter, which is a success criterion for the search phase.
Reward
A reward function is used to evaluate how much an agent is rewarded or punished for performing an action in the current state. The reward (\(r_{t}\)) is calculated after the completion of each episode in this proposed methodology. The reward zones for our search task is illustrated in Fig. 5. First, the inner circle (Green Zone) indicates that the peg has either reached the goal position or the maximum number of steps (\(k_{max}\)) per episode is reached with the peg close to our goal. Inside the second circle (White Zone), the peg is at a distance less than the initial distance (\(d_{o}\)) and receives a reward of zero. Moreover, when the robot moves away from the starting position toward the boundaries of the safety limits (Yellow Zone), the agent receives a negative reward. Finally, the working space barrier is the outer square (Red zone), which indicates that the peg is violating the safety restrictions (D) and receives the highest negative reward.
6 Implementation and Validation
A KUKA LBR iiwa, which is a sensitive robot arm with an open kinematic chain and integrated sensors, is used for this work. The integrated torque sensors are based on strain gauges in each of the robot’s joints that enable for the determination of external forces and torques acting on the robot. Force-controlled robot applications are therefore possible when combined with the control approach discussed. The peg and block used in this study are made of corrosion-resistant stainless steel, which is ideally suited for this purpose due to the continuous force exerted during the experiments. The clearance of the peg and the hole is 30 \(\upmu \)m. The experimental setup is displayed in Fig. 6. As mentioned before, such assembly is done with the assistance of artificial intelligence as the task accuracy exceeds the robot precision. According to KUKA, the position repeatability of the LBR iiwa lies at ± 0.15 mm [18]. This could be proven by DIN 9283:1998 with the help of a high-precision laser tracker API R50-Radian with an accuracy of \(\pm 10 \upmu {\text {m}} + 5 \upmu {\text {m}}/{\text {m}}\). The measured repeatability was 0.14 mm, which equates to around five times the clearance between the peg and the hole. In order to assure data flow between the DRQN and the robot, Message Queuing Telemetry Transport (MQTT) was used. MQTT is a bidirectional network protocol based on the client-server principle rather than the end-to-end connection paradigm like many other network protocols. Messages are not sent directly to clients; rather, communication is event-based and follows the publish-subscribe paradigm [19]. In order to validate our approach, we conducted experiments by running 200 learning episodes followed by some test runs. In order to achieve near optimal hyperparameter values, a few tests were conducted by maintaining all variables constant and adjusting one at a time. Throughout the training, the agent was able to identify the hole 130 times out of a total of 200 times. Two test trials were conducted, in which the agent was able to locate the hole 18 times out of 21 and 27 times out of 31, for an overall success rate of 86.5%, and as shown in Fig. 7c, the loss decreases during the training process. Additionally, the loss curve also demonstrates a well-chosen learning rate. Moving on to Fig. 7a, the graph shows that the peg strives to stay near to the hole position and only drifts further away a few times. Experiments revealed that a sparse reward function Fig. 7b is not the best fit for the search challenge, and that more dense reward functions should be investigated. Fig. 7d shows a cutout from the trajectory using two cases where the hole was identified (Success) and one case where the defined limit were exceeded (Failure).
7 Conclusion and Future Work
This research demonstrated and validated the success of our proposed strategy using DRQN in addressing a high-precision Peg-in-Hole assembly task using a 7-DOF sensitive robot with integrated sensors. The employed approach was successful in completing the search phase. It was also shown that integrating recurrence into a reinforcement learning system via an LSTM layer overcomes DQN’s drawbacks, where the LSTM layer was able to encode previously taken decisions, allowing the agent to execute a better informed decision and overcome sensor delays. In the future, the approach will be extended to the insertion phase as well as improving the network architecture, including tuning the hyperparameters in order to reach an overall success rate of 100%. In addition we are planning to evaluate continuous action space techniques such as DDPG, DPPO, or NEAT, which should potentially enhance the performance.
References
Vogel-Heuser, B., Bauernhansl, T., Ten Hompel, M.: Handbuch Industrie 4.0 Bd. 1. Springer, Berlin (2017)
Vogel-Heuser, B., Bauernhansl, T., Ten Hompel, M.: Handbuch Industrie 4.0 Bd. 2. Springer, Berlin (2017)
DIN 1319, Grundlagen der Messtechnik: Begriffe für Messmittel. (2005)
Lynch, K.M., Park, F.C.: Modern robotics. Cambridge University Press (2017)
Siciliano, B., Khatib, O., Kröger, T. (eds.): Springer handbook of robotics, vol. 200. Springer, Berlin (2008)
Winkler, I.A.: Sensorgeführte Bewegungen stationärer Roboter. (2016)
Park, H., Park, J., Lee, D., Park, J., Baeg, M., Bae, J.: Compliance-based robotic peg-in-hole assembly strategy without force feedback. IEEE Trans. Ind. Electron. 64(8), 6299–6309 (August 2017). https://doi.org/10.1109/TIE.2017.2682002
Gullapalli, V., Grupen, R.A., Barto, A.G.: Learning reactive admittance control. In ICRA, pp. 1475–1480. (May 1992)
Yun, S. K.: Compliant manipulation for peg-in-hole: is passive compliance a key to learn contact motion \(?\). In 2008 IEEE International Conference on Robotics and Automation, pp. 1647–1652. IEEE. (May 2008)
Gubbi, S., Kolathaya, S., Amrutur, B.: Imitation learning for high precision peg-in-hole tasks. In 2020 6th International Conference on Control, Automation and Robotics (ICCAR), pp. 368–372. IEEE. (April 2020)
Inoue, T., De Magistris, G., Munawar, A., Yokoya, T., Tachibana, R.: Deep reinforcement learning for high precision assembly tasks. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 819–825. IEEE. (September 2017)
Nuttin, M., Van Brussel, H.: Learning the peg-into-hole assembly operation with a connectionist reinforcement technique. Comput Ind 33(1), 101–109 (1997)
Sharma, K., Shirwalkar, V., Pal, P.K.: Intelligent and environment-independent peg-in-hole search strategies. In 2013 International Conference on Control, Automation, Robotics and Embedded Systems (CARE), pp. 1–6. IEEE. (December 2013)
Abdullah, Mustafa W., et al.: An approach for peg-in-hole assembling using intuitive search algorithm based on human behavior and carried by sensors guided industrial robot. IFAC-PapersOnLine 48.3, 1476–1481 (2015)
Hovland, G. E., Sikka, P., McCarragher, B.J.: Skill acquisition from human demonstration using a hidden markov model. In Proceedings of IEEE international conference on robotics and automation, vol. 3, pp. 2706–2711. IEEE. (April 1996)
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Hausknecht, M., Stone, P.: Deep recurrent q-learning for partially observable mdps. In 2015 aaai fall symposium series. (September 2015)
KUKA AG: Data sheet LBR iiwa. (2022)
MQTT Version 3.1.1.: http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/errata01/os/mqtt-v3.1.1-errata01-os-complete.html#_Introduction
Acknowledgements
The research is funded by the Interreg V A Großregion within Robotix-Academy project (no 002-4-09-001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Afifi, N.A., Schneider, M., Kanso, A., Müller, R. (2023). High Precision Peg-in-Hole Assembly Approach Based on Sensitive Robotics and Deep Recurrent Q-Learning. In: Schüppstuhl, T., Tracht, K., Fleischer, J. (eds) Annals of Scientific Society for Assembly, Handling and Industrial Robotics 2022. MHI 2022. Springer, Cham. https://doi.org/10.1007/978-3-031-10071-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-10071-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10070-3
Online ISBN: 978-3-031-10071-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)