Keywords

1 Introduction

To handle more complex manufacturing tasks, industrial robots are widely used in manufacturing systems because robots can provide fast and precise executions in repetitive tasks [1]. Nevertheless, robots lack the flexibility and adaptability of humans, and therefore, recent robotics research has focused on human-robot collaboration (HRC), which ensures both precision and flexibility in manufacturing systems [2].

The places that humans share with robots are called HRC workspaces [2]. Whenever robots are working with humans in HRC workspaces, security concerns apply. For example, safety regulations are elaborated by numerous standards (e.g. ISO 10218). In conventional scenarios, robots need to be separated from humans by specific equipment, e.g. protective fences. As HRC workspaces require the coexistence of humans and robots in one place, new safety concerns are of importance, and former separation regulations systems cannot persist in HRC workspaces.

Aiming at the resulting safety problem in HRC workspaces, two major categories of measures are commonly applied [3]. The first category intends to minimize the injury risk when collisions between humans and robots cannot be avoided. Measures in this category include mechanical compliance systems (e.g., viscoelastic covering [4] or mechanical absorption systems [5]), lightweight robot structures (e.g. [6]) and safety strategies involving collision or contact detection respectively (e.g. [7]). The commercial robots applied in HRC workspaces usually comprise one or several of these features [3]. Another category includes measures that achieve an active collision avoidance. These measures incorporate information about the robot motion and the human operations using vision systems or other sensing modules. Based on this information, alternative trajectory paths are generated to avoid the forecasted collision [3]. The works related to the collision-avoidance based on different sensors can be found in [8,9,10,11,12,13].

In addition to sensorics, deep reinforcement learning (RL) is another important approach to realize the collision-free path planning in HRC workspaces. RL is a subclass of machine learning and consists of two main parts, the agent and the environment [14]. The agent receives a representation of the current state within the environment and selects an action based on a policy. Once the action is performed, the agent will receive a reward. The agent aims at learning a policy that maximizes the total discounted future reward [15]. RL has been used successfully in various application fields such as solving complex games [16], job shop scheduling [17], or factory layout planning [18]. In terms of the collision problems in HRC workspaces, the implementation of RL can be found in a number of studies, e.g. [19,20,21].

However, in approaches related to the RL-based collision-avoidance, the size of robots in their HRC workspaces is smaller than humans, and the case where the robot size is larger than human is not considered. When the robots are small, even if the collision cannot be avoided, the location of the collision is mostly in the human hands or arms, which does not lead to a high risk of fatal injuries. But when the size of the robot is larger than humans, the collision may occur in the head or torso, resulting in a higher risk of fatal injuries. Therefore, we are focusing on the case where the robot size is larger than a human and are proposing an intelligent robotic arm path planning (IRAP2) framework. The IRAP2 framework and its case are explained in remainder of the paper.

2 Problem Statement and Methodology

2.1 Problem Statement

In our case, we scaled the scenario down to the size of a desktop scenario, as depicted in Fig. 1. In our desktop-level HRC workspace, the height of the base of the robot is 138 mm, and the lengths of the first and second connecting links are 135 mm and 147 mm, respectively. Neglecting the degree of freedom (DoF) of the attached vacuum gripper, the robot has 3 DoF, as labeled from J1 to J3 in Fig. 1. The movement ranges of J1, J2, and J3 are (−135° to +135°), (0° to +85°), and (−10° to +95°), respectively. The maximum rotation speed of the joints is 320°/s. To make the human model compatible for the small robot, the height of the human is downscaled to 129 mm.

To make the training environment as similar as possible to the real physical environment, we have applied the 3D physics engine ‘PyBullet’ to build a 1:1 virtual model for training the RL model in the IRAP2 framework. The virtual model consists of three parts: a virtual robot arm, a virtual human, and a virtual pickup object, as depicted in Fig. 1. The problem has been defined as to find out the shortest path to pick up the blue object without colliding with the human. In this work, we consider four different cases: (1) Pick up the target object with no humans (as comparison reference case); (2) Pick up the object with one human standing at one specific position; (3) Pick up the object with one human standing at one of two positions; and (4) Pick up with two humans standing at two specific positions, as depicted in Fig. 1.

Fig. 1.
figure 1

Real and virtual scenario of a HRC workspace

2.2 The IRAP2 Framework Based on DDPG

In HRC workspace, the motion path of the robot can be regarded as a sequence of decisions and can be planed using Deep Deterministic Policy Gradient (DDPG) algorithm.

Figure 2 illustrates the IRAP2 framework based on the DDPG algorithm. In the virtual 3D physics environment, the virtual robot arm is allowed to explore various positions within the described ranges and obtains a reward according to its interaction with the environment. The action, current state, next state, and reward of the virtual robot can be denoted as a list of tuples \(\{{a}_{t},{s}_{t},{s}_{t+1},{r}_{t}\}\), which will be stored in the replay buffer and can be used as data for training the artificial deep neural network. For each training iteration, the replay buffer will sample 64 batches of \(\{{a}_{t},{s}_{t},{s}_{t+1},{r}_{t}\}\), and a critic network with the weight \({w}_{Q}\) will calculate the state value \(Q({s}_{t},{a}_{t},{w}_{Q})\) that determines the cumulated rewards of the state \({s}_{t}\). Furthermore, an actor network with the weight \({w}_{\mu }\) is used to obtain the behavioral policy \(\mu ({s}_{t},{w}_{\mu })\), which is the action of the virtual robot for the next time step. For the stability of the training, two target networks are created for the critic and actor network, which are denoted as \(Q{^{\prime}}({s{^{\prime}}}_{t},{a{^{\prime}}}_{t},{w{^{\prime}}}_{Q})\) and \(\mu {^{\prime}}({s{^{\prime}}}_{t},{w{^{\prime}}}_{\mu })\), respectively. Weights of two target networks are updated slowly for each iteration (Soft Update). The update of weights of current critic and actor network in the DDPG algorithm is performed by minimizing the loss function through RMSProp optimizer. Loss functions for the actor and critic network (La and Lc) are expressed as follows.

$$ L_a \left( {w_\mu } \right) = \nabla_\mu Q\left( {s_t ,a_t ,w_Q } \right)\nabla_{w_\mu } \mu \left( {s_t ,w_\mu } \right) $$
(1)
$$ L_c \left( {w_Q } \right) = \left[ {r + \gamma Q^{\prime}\left( {s^{\prime}_t, \mu^{\prime}\left( {s^{\prime}_t ,w^{\prime}_\mu } \right),w^{\prime}_Q } \right) - Q\left( {s_t ,a_t ,w_Q } \right)} \right]^2 $$
(2)

In Eq. (1) and (2), r denotes the reward, \(\gamma \) is the discount factor, and \(\nabla \) describes the gradient. The action, state, and reward functions of the four cases are summarized in Table 1, where \({\varphi }_{1}, {\varphi }_{2},{\varphi }_{3}\) are the rotation angle of joints J1, J2, and J3, respectively. The parameters \(\alpha , \beta ,\varepsilon ,\delta \) are the scaling constants to convert the distance values and reward values from environment model in PyBullet to the values that are suitable for training the DDPG neural networks. The parameters \({d}_{ta},{d}_{{h}_{1}},{d}_{{h}_{2}}\) are the distance of the vacuum gripper to the target object, the first human, and the second human, respectively. Finally, the parameter i is an index to indicate whether the object has been successfully picked up or not. After the training, the optimal path of the agent can be exported to control the real robot in the HRC workspace.

Fig. 2.
figure 2

Illustration of the IRAP2 framework

Table 1. State and reward functions

3 Results and Discussion

3.1 Evaluation of the Training Process

The first result is the training performance. Figure 3 shows the number of steps (one step implies one action by the agent) and rewards versus the training episodes. During each training episode, if the target object has been picked up, the episode will be closed. Moreover, the maximal steps are set to 300. In Fig. 3, it is seen that in the first case, the agent was not able to catch the target object until about 200 episodes. From 200 episodes to about 300 episodes, the number of steps is reduced to appr. 100, but the optimal path is not yet found. From the reward plot, it is seen that the agent in first case can always reach the optimal grasping path after about 300 episodes. In the second case, the step and reward plots clearly show that the agent can always reach the optimal path after about 200 episodes. In the third and fourth cases, the stable optimal path generation is not achieved until more than 500 episodes.

Comparing all cases, it can be concluded that the IRAP2 framework has a higher training efficiency when there is only one human standing at a fixed position (i.e. in the second case). The third and fourth cases have more possible positions and humans, and therefore, the agent needs more episodes. Moreover, the minimum number of steps required to pick up the object is about 30 steps, and the maximum reward is about 150.

Fig. 3.
figure 3

Plots of the steps and rewards versus training episodes

3.2 The Optimal Pick-Up Path Generated by IRAP2 Framework

The optimal pickup paths for the four cases are shown in Fig. 4, where a purple or light-blue line frame in the 3D-plots of diagonal view describes the links between the robot’s joints, and each frame represents one state of the agent, as outlined in the first case in Fig. 4. The 2D-line plots under the 3D-plots are the grasping path the gripper from the top view. In viewing the 3D plots, it is seen that in the first case, the robot’s joint J1 rotates directly counterclockwise around the z-axis, the robot’s two links descend almost in a straight line, and the gripper picks up the target object directly. In the second case, there is a process of keeping the robot’s gripper highly parallel when it is almost close to the human (as highlighted by the green box), which implies the robot’s decision to avoid the collision with the human. In the third case, the purple lines represent the robot’s path when the human appears at position P1 (yellow human), while the light blue lines represent the path when human appears at position P2 (red human). Since the red human is more outward (in the positive direction of the x-axis), the path of the light blue lines is located a little more outward compared to the path of the purple lines. This is because in the fourth case, two humans are standing in the HRC workspace at the same time, and the robot tries to pick up the target object waving its arm from the outside (in the positive direction of the x-axis) around the two humans, in order to avoid a collision. In viewing the 2D-plots, it is seen that the gripper has successfully avoided the collision with the humans.

Fig. 4.
figure 4

The optimal pick-up paths generated by the IRAP2 framework

3.3 Validation of the Robot Control

Finally, our optimal paths are successfully applied in four different scenarios. Figure 5 shows the control process in the second case as an example.

Fig. 5.
figure 5

Validation of the optimal path in case ‘pick up the object with one human standing’

In this case, it can be seen that from 0 to 2 s, the robot arm lifts upward to avoid collision with the human. From 2 to 10 s, the robot arm moves around the human and picks up the target object. Without our algorithm, the robot would move in a straight line directly along the direction of the target object and crash into the human model. With this validation, we successfully confirmed the feasibility of the IRAP2 framework.

4 Conclusion and Outlook

In this work, we have confirmed the feasibility of the IRAP2 framework as well as the DDPG algorithms to generate the optimal path of robot in HRC workspaces. Moreover, in a desktop HRC workspace scenario, we studied that the case that the size of a robot is larger than humans and considered different working conditions. In terms of future work, it is suggested to upscale the implementation scenario of the IRAP2 framework to a real industrial HRC workspace. Moreover, in this work, all humans are assumed to be standing at one position or moving between two certain positions. In the future work, a more complex moving behavior of the humans must be considered in the problem. In addition, it is to mention that big robots are relatively more dangerous than small robots, and the implementation of such HRC requires a higher level of safety measures. Furthermore, the cases in which robot and human sizes are similar should be considered in the future work because such a problem may be more complex, since the robot and human have similar velocities and workspaces dimensions, and the robot needs more accurate and fast response capabilities. Finally, another future work should be focused on the improvement of the computing efficiency of the algorithm as well as the comparison of our approach with other existing optimization approaches such as genetic algorithms or other RL approaches such as normalized advantage function.