Keywords

1 Introduction

Industrial robotics plays a key role in production, notably in assembly. Despite the fact that industrial robots are currently primarily used for repetitive, dangerous, or relatively heavy operations, robotic applications are increasingly being challenged to do more than simple pick-and-place activities [1, 2]. They must be able to react to their surroundings. As a result, Sensitive robot systems are capable of conducting force- or torque-controlled applications, which are used to achieve the previously mentioned contact with the environment. Although there is no clear definition of the term sensitivity, based on the measurement technology DIN 1319 norm, sensitivity is defined as the change in the value of the output variable of a measuring instrument in relation to the causal change in the value of the input variable [3]. Special control strategies are required in the case of a physical contact with the environment, since simple pure position control, as utilized in part manipulation, is no longer sufficient. Furthermore, relying just on force control is insufficient thus it makes sense to employ a hybrid force/position control [4, 5]. Depending on the task, it is therefore necessary to decide which of the transitional and rotational degrees of freedom are position controlled or force controlled [6]. The Peg-in-Hole assembly is an example of a robotic task that requires direct physical contact with the surrounding environment [7]. It has been extensively researched in both 2-D [8, 9] and 3-D environments [10, 11], and a variety of techniques for solving it have been presented [8,9,10,11,12,13,14,15]. Conventional online programming methods have been suggested and widely utilized with robots to train them to perform precise industrial processes as well as assembly activities, in which a teach pendant is used to guide the robot to the desired positions while recording each movement. This strategy is time consuming and challenging to adapt to new environments. Another approach is offline programming (simulation) [9, 12], and while it has many advantages in terms of downtime, it is difficult to simulate a precise actual environment due to environmental variance, and it is inefficient in industrial activities when the required precision exceeds robot accuracy. So due to the limitation of these techniques, a new skill acquisition technique has been proposed [11, 15], where the robot learns to do the high precision mating task using reinforcement learning [11].

2 State of the Art

A variety of techniques in tackling Peg-in-Hole assembly challenges have been suggested [8,9,10,11,12,13,14,15]. This section will go over some of these strategies. Gullapalli et al. [8] investigated a 2D Peg-in-Hole insertion task, focusing on employing associated reinforcement learning to learn reactive control strategies in the presence of uncertainty and noise, with a 0.8 mm gap between peg and hole. A Zebra Zero robot with a wrist force sensor and position encoders was used. Their evaluation was conducted over 500 sequential training runs. Hovland et al. [15] proposed skill learning by human demonstration where they implemented a hidden Markov model. Nuttin et al. [12] ran a simulation with a CAD-based contact force simulator. Their results show that the insertion is effective if the force level or time surpasses a particular threshold. Their approach focuses solely on the insertion while using reactive control with reinforcement as their strategy, in which the learning process is divided into two phases. The first phase is controller, where it consists of two networks: policy network and exploration network. The second phase is actor-critic algorithm, in which the actor calculates the action policy and the critic is responsible for computing the Q-value. Yun [9] imitated the human arm using passive compliance and learning. He used simulation, implemented in MATLAB, to solve a 2-D Peg-in-Hole task with a 3-DOF manipulator where he focuses on search phase only. The accuracy is 0.5 mm, and the training was done on a gap of 10 mm. Their main goal of the research is to demonstrate the significance of passive compliance in association with reinforcement learning. We use integrated torque sensors with the deep learning algorithm, unlike Abdullah et al. [14], who used a vision system with force/torque sensors to achieve automatic assembling by imitating human operating steps, in which vision systems have limitations due to changes in illumination that may cause measurement errors. Also, unlike Inoue et al.’s [11] strategy, in which the robot’s movement is caused by a force condition in x and y directions, in our approach, the robot’s motion is discrete displacement action in x or y direction, because a motion resulting from a force condition raises the difficulty that such a force condition cannot be reached due to the physical interaction between the robot and the environment (e.g. the stick-slip effect), eventually resulting in a theoretically infinite motion. Furthermore, in contrast to the aforementioned approaches, the Peg-in-Hole task has not been conducted on a very narrow hole clearance, and some of these approaches were only confirmed with a simulation, which is not as exact as the real world, adding to the challenge of adjusting to actual world variance. Moreover, our approach has a higher advantage in adapting to variations in both hole location and environmental settings. As well as the ability to take actions based on a prior state trajectory rather than just the current state. It also has the capability of compensating for sensor delays.

3 Problem Formulation and Task Description

As previously stated, when the required level of precision of the assembly task surpasses the robot precision, it is difficult to perform Peg-in-Hole assembly tasks, and it is even more challenging to perform them using force controlled robotic manipulation. Our approach in solving the Peg-in-Hole task is employing a recurrent neural network trained with reinforcement learning using skill acquisition techniques [11, 13]. The first learned skill, which is known as the search phase, where the peg seeks to align the peg center within the clearance zone of the hole center. A successful search phase is followed by the insertion phase in which the robot is responsible for correcting the orientational misalignment. This paper focuses solely on the search phase. This research is done on a clearance of 30 \(\upmu \)m using a robot with repeatability of 0.14 mm and some millimeters positional inaccuracy.

4 Reinforcement Learning

Reinforcement learning (RL) is an agent-in-the-loop learning approach in which an agent learns by performing actions on an environment and receiving a reward (\(r_{t}\)) and an updated state (\(s_{t}\)) of the environment as a result of those actions. The aim is to learn an optimal action policy for the agent that maximizes the eventual cumulative reward (\(R_{t}\)) shown in Eq. (1), where (\(\gamma \)) indicates the discount factor, (\(r_{t}\)) is the current reward generated from performing action (\(a_{t}\)), and (t) denotes to the step number. The learned action policy is the probability of selecting an action from a set of possible actions in the current state [11, 16].

$$\begin{aligned} R_{t} = r_{t} + \gamma \, r_{t+1} + \gamma ^{2} \, r_{t+2} + ..... + \gamma ^{T} \, r_{t+T} = r_{t} + \gamma \, R_{t+1} \end{aligned}$$
(1)

Deep Q-Learning

Q-Learning is a model-free off-policy RL technique. Model-free techniques do not require an environment model. Off-policy techniques learn optimal action policy implicitly by learning optimal Q-value function. Q-value function at a given state—action (s,a) pair is a measure of the desirability of taking action (a) in state (s) as illustrated in Eq. (2).

$$\begin{aligned} Q^{\pi }(s,a) = \mathbb {E}\left[ \sum _{k=0}^{k=T}\gamma ^{k}\,r_{t+k} \,|\, s_{t} = s, a_{t} = a \right] \end{aligned}$$
(2)

Q-Learning employs the \(\epsilon \)-greedy policy as behavior policy, in which an agent chooses a random action with probability (\(\epsilon \)) and chooses the action that maximizes the Q-value for the (s,a) pair with probability (1-\(\epsilon \)) (see Eq. (3)). In this paper, exploration and exploitation are not set to a specific percentages. On the contrary, the exploration rate decays with a linear rate per episode as shown in Éq.(4).

$$\begin{aligned} a = \begin{Bmatrix} a\sim random\left( A_{t} \right) ,&with \; P= \epsilon \\ argmax_{a} \, Q(s,a),&with \; P= 1-\epsilon \end{Bmatrix} \end{aligned}$$
(3)
$$\begin{aligned} \epsilon _{n+1} = \epsilon _{initial} - \epsilon _{decay} * n \end{aligned}$$
(4)

The simplest form of Q-Learning is a tabular form which uses an iterative Bellman based update rule as seen in Eq. (5). Tabular Q-Learning computes the Q-value function for every (s,a) pair in the problem space, which makes it unsuitable for the assembly task at hand due to the complexity and variety of the environment. To overcome the tabular formulation drawbacks, DQN was introduced in [16] in which a neural network is employed as a function approximator of a (s,a) pair Q-value.

$$\begin{aligned} Q\left( s,a \right) \leftarrow Q\left( s,a \right) +\alpha \left[ \,r+ \, \gamma \, max_{{a}'} \, Q({s}',{a}') - Q\left( s,a \right) \right] \end{aligned}$$
(5)

Deep Recurrent Q-Learning

While Deep Q-Learning can learn action policies for problems with large state spaces, it struggles to learn sequential problems where action choice is based on a truncated trajectory of prior states and actions. This challenge urged the use of another DQN variant which has a memory to encode previous trajectories. In this paper, a Deep Recurrent Q-Network (DRQN) is utilized as a suitable DQN variant. DRQN was introduced in [17] to solve the RL problem in partially observable markov decision process (POMDP). DRQN utilizes long-short term memory (LSTM) layers to add recurrency to the network architecture. The LSTM layer can encode previous (s,a) trajectories providing enhanced information for learning the Q-values. In addition, the recurrency can account for sensor and communication delays.

Action and Learning Loops

The Deep Recurrent Q-Learning algorithm is illustrated in Fig. 1. The algorithm can be divided into two parallel loops; the action loop (Green) and the learning loop (Yellow). The action loop is responsible for choosing agent’s action where the current environment state is fed through a policy network. The policy network estimates the Q-value function over the current state and the set of available actions. Based on \(\epsilon \)-greedy exploration rate, the agent action is either the action with the highest Q-value or a randomly sampled action as illustrated in Eq. (3). At each step, \((s_{t},r_{t},a_{t},s_{t+1})\) experience is saved in a reply memory. After a predefined number of episodes, the agent starts learning from randomly sampled experience batches. Each experience batch is a sequence of steps with a defined length from a randomly sampled episode. The target network is an additional network serving as a temporary fixed target for optimization of the Bellman Eq. (5). The weights of the target network are copied from the policy network after a number of steps. The policy network estimates the Q-value of \((s_{t},a_{t})\) pair while the target network estimates the max Q-value achievable in \((s_{t+1})\). The output from both networks is used to compute the proposed loss function in Eq. (6). Gradient descent is used to learn the policy network passed by back propagation of loss as illustrated in Eq. (7).

Fig. 1
figure 1

Action and Learning Loops

$$\begin{aligned} L_{\theta } = \frac{1}{2}\left[ target - prediction \right] ^2 = \frac{1}{2}\left[ \,r+ \, \gamma \, max_{{a}'} \, Q_{{\theta }'}({s}',{a}') - Q_{\theta }\left( s,a \right) \right] \end{aligned}$$
(6)
$$\begin{aligned} \theta \leftarrow \theta + \alpha \left( r + \, \gamma \, max_{{a}'} \, Q_{{\theta }'}({s}',{a}')- Q_{\theta }\left( s,a \right) \right) \nabla _{\theta }Q_{\theta }\left( s,a \right) \end{aligned}$$
(7)

5 Search Skill Learning Approach

This paper focuses on search skill which will be discussed in the following subsections in more details. Fig. 2 illustrates how the learning process is done.

Fig. 2
figure 2

Illustration of How Robot Learns New Skill Using Deep RL

Initial Position

Each episode starts with the peg in a random position. The polar coordinates of the initial position are determined by a predefined radius from the hole’s center and a randomly sampled angle between 0 and 2\(\pi \) (see Fig. 3). The advantage of utilizing such an initialization method is that it maintains the initial distance to the hole center while searching the full task space.

Fig. 3
figure 3

Peg Initial Position Strategy

State

At each time step, the reinforcement learning (RL) agent receives a new state sensed by the robot (see Fig.2 lower arrows) which consists of forces in x, y, and z (\(F_{x}, F_{y},F_{z}\)), moments around x and y (\(M_{x},M_{y}\)), and rounded positions in x and y (\(\tilde{P_{x}},\tilde{P_{y}}\)) as seen in Eq. (8). In order to provide enough robustness against positional inaccuracy, it was assumed that the hole and the peg were not precisely positioned. \(\tilde{P_{x}},\tilde{P_{y}}\) are computed using the grid indicated in Fig. 4. where C is the positional error’s margin. This approach provides auxiliary inputs to the network, which can very well aid in the acceleration of learning convergence.

Fig. 4
figure 4

Examples of Peg Position Rounding Approach Using Grid Size

$$\begin{aligned} S = \left[ F_{x}, F_{y},F_{z},M_{x},M_{y},\tilde{P_{x}},\tilde{P_{y}}\right] \end{aligned}$$
(8)

Action

A deep neural network (policy network) is utilized in the current system to estimate a Q-value function for each (s,a) pair, which subsequently generates an action index to the robot controller. The action index is then utilized to assist the robot in selecting one of four discrete actions (see Fig.2 upper arrows), each of which has a constant force in the z-direction (\(F_{z}^{d}\)). According to Eq. (9), the agent must alter its desired position between x and y \(\left( \pm \,d_{x}^{d}, \pm \,d_{y}^{d} \,\right) \), For all four discrete actions, the orientation of the peg (\(R _{x}^{d},R_{y}^{d}\)) is set to zero throughout the search phase. The advantage of maintaining constant and continuous force in the z-direction is that when the search algorithm finds the hole, the peg height drops by a fraction of a millimeter, which is a success criterion for the search phase.

$$\begin{aligned} a = \left[ d_{x}^{d},d_{y}^{d},F_{z}^{d},R_{x}^{d},R_{y}^{d} \right] \end{aligned}$$
(9)

Reward

A reward function is used to evaluate how much an agent is rewarded or punished for performing an action in the current state. The reward (\(r_{t}\)) is calculated after the completion of each episode in this proposed methodology. The reward zones for our search task is illustrated in Fig. 5. First, the inner circle (Green Zone) indicates that the peg has either reached the goal position or the maximum number of steps (\(k_{max}\)) per episode is reached with the peg close to our goal. Inside the second circle (White Zone), the peg is at a distance less than the initial distance (\(d_{o}\)) and receives a reward of zero. Moreover, when the robot moves away from the starting position toward the boundaries of the safety limits (Yellow Zone), the agent receives a negative reward. Finally, the working space barrier is the outer square (Red zone), which indicates that the peg is violating the safety restrictions (D) and receives the highest negative reward.

Fig. 5
figure 5

Different Reward Zones

Fig. 6
figure 6

Experimental Setup: in simulation (left) and in reality (right)

6 Implementation and Validation

A KUKA LBR iiwa, which is a sensitive robot arm with an open kinematic chain and integrated sensors, is used for this work. The integrated torque sensors are based on strain gauges in each of the robot’s joints that enable for the determination of external forces and torques acting on the robot. Force-controlled robot applications are therefore possible when combined with the control approach discussed. The peg and block used in this study are made of corrosion-resistant stainless steel, which is ideally suited for this purpose due to the continuous force exerted during the experiments. The clearance of the peg and the hole is 30 \(\upmu \)m. The experimental setup is displayed in Fig. 6. As mentioned before, such assembly is done with the assistance of artificial intelligence as the task accuracy exceeds the robot precision. According to KUKA, the position repeatability of the LBR iiwa lies at ± 0.15 mm [18]. This could be proven by DIN 9283:1998 with the help of a high-precision laser tracker API R50-Radian with an accuracy of \(\pm 10 \upmu {\text {m}} + 5 \upmu {\text {m}}/{\text {m}}\). The measured repeatability was 0.14 mm, which equates to around five times the clearance between the peg and the hole. In order to assure data flow between the DRQN and the robot, Message Queuing Telemetry Transport (MQTT) was used. MQTT is a bidirectional network protocol based on the client-server principle rather than the end-to-end connection paradigm like many other network protocols. Messages are not sent directly to clients; rather, communication is event-based and follows the publish-subscribe paradigm [19]. In order to validate our approach, we conducted experiments by running 200 learning episodes followed by some test runs. In order to achieve near optimal hyperparameter values, a few tests were conducted by maintaining all variables constant and adjusting one at a time. Throughout the training, the agent was able to identify the hole 130 times out of a total of 200 times. Two test trials were conducted, in which the agent was able to locate the hole 18 times out of 21 and 27 times out of 31, for an overall success rate of 86.5%, and as shown in Fig. 7c, the loss decreases during the training process. Additionally, the loss curve also demonstrates a well-chosen learning rate. Moving on to Fig. 7a, the graph shows that the peg strives to stay near to the hole position and only drifts further away a few times. Experiments revealed that a sparse reward function Fig. 7b is not the best fit for the search challenge, and that more dense reward functions should be investigated. Fig. 7d shows a cutout from the trajectory using two cases where the hole was identified (Success) and one case where the defined limit were exceeded (Failure).

Fig. 7
figure 7

Experimental results

7 Conclusion and Future Work

This research demonstrated and validated the success of our proposed strategy using DRQN in addressing a high-precision Peg-in-Hole assembly task using a 7-DOF sensitive robot with integrated sensors. The employed approach was successful in completing the search phase. It was also shown that integrating recurrence into a reinforcement learning system via an LSTM layer overcomes DQN’s drawbacks, where the LSTM layer was able to encode previously taken decisions, allowing the agent to execute a better informed decision and overcome sensor delays. In the future, the approach will be extended to the insertion phase as well as improving the network architecture, including tuning the hyperparameters in order to reach an overall success rate of 100%. In addition we are planning to evaluate continuous action space techniques such as DDPG, DPPO, or NEAT, which should potentially enhance the performance.