1 Introduction

Up to the recent past, the paradigm universally adopted for industrial robotics provided for the strict segregation of robots in protected environments, adopting fences or optical barriers. Only recently, the potential benefits of a collaboration between humans and robots have gained the attention of roboticists [10], mainly motivated by the Industry 4.0 paradigm [15]. Since many tasks are still impossible to be fully automatized, it seems natural to let humans and robots cooperate: highly cognitive actions are undertaken by humans, while those requiring high precision and repeatability are performed by robots. A well established literature addressed the problem of a safe coexistence in a shared space by introducing motion control techniques [6], based on the use of sensors perceiving the scene and tracking the motion of the human operators. Corrective trajectories can be planned on-line with the aim of dodging the human and, at the same time, keep driving the manipulator to the desired target position.

Such initial works conceived the robots as something that should interfere as less as possible with the humans populating the same cell, only opening the door for a real collaboration. The possibility for a properly instrumented robotic device to understand and somehow predict humans’ intentions is now considered as important as safety and it is possible by providing the robots with the proper cognitive capabilities. In this context, a crucial role is played by vision sensors since the analysis of human motion is one of the most important features. Many results have been reported showing the increasing capability of robots to semantically interpret their human fellows. In [11] a method based on conditional random files (CRF) is used by the robot to anticipate its assistance. In [12] Gaussian Mixture Models (GMMs) are used to predict human reaching targets.

Collaborative assemblies are typical applications of collaborative robotics. In such scenarios, the actions of the robots influence the ones of the humans, and the optimization of the robotic action sequence becomes crucial. Reference [4] describes a genetic algorithm for a collaborative assembly station which minimises the assembly time and costs. In [14], a trust-based dynamic subtask allocation strategy for manufacturing assembly processes has been presented. The method, which relies on a Model Predictive Control (MPC) scheme, accounts for human and robot performance levels, as well as far their bilateral trust dynamics. By taking inspiration from real-time processor scheduling policies, [9] developed a multi-agent task sequencer, where task specifications and constraints are solved using a MILP (Mixed Integer Linear Programming) algorithm, showing near-optimal task assignments and schedules. Finally, [13] proposes a task assignment method, based on the exploration of possible alternatives, that enables the dynamic scheduling of tasks to available resources between humans and robots. Many alternative plans are computed in [16], where a dynamic task allocation of activities is also possible, by taking into account the different capabilities of agents (robots and humans). In [3], the scheduling problem is solved using a Generalised Stochastic Petri Net as a modelling tool. The selection of the optimal plan takes into account the amount of time for which the agents remain inactive, waiting for the activation of some tasks.

All the aforementioned works seem to model the human operators as no more than highly cognitive manipulators, that can be instructed to do certain actions. In this work timed Petri nets will be adopted to model the human-robots multi agent system, however assuming the human as an uncontrollable agent, whose actions must be recognized in order to decide which complementary actions the robots should undertake. The remaining of this chapter is structured as follows: Sect. 13.2 will discuss the activity recognition problem, while Sect. 13.3 will present the approach followed to predict the future ones. Such predictions will be crucial for applying the scheduling approach described in Sect. 13.4.

2 Recognizing the Human Actions

Recognizing the actions performed by a human is easy and natural for another human being. It is done by mainly analysing the motion of the arms. This is the same approach followed by computer vision algorithms. In the following, an approach based on RGB-d cameras and factor graphs will be discussed. Assume the (finite) set of possible human actions \(\mathcal {A} = \lbrace a_1, \ldots , a_m \rbrace \) known. The aim of the algorithm proposed in this section is to detect the starting and the ending time of each action, by analysing the motion performed by the operator in the recent past. RGB-D sensors can be exploited to keep track of some points of interest in the human silhouette, see the top left corner of Fig. 13.1. The signals retrieved are subdivided into many sub-windows each having a maximum length of \(l_w\) samples. Then, a feature vector \(F^i_O = \begin{bmatrix} O^i_1&\ldots&O^i_F \end{bmatrix}\) made of F components can be extracted from the ith window, representing an indirect indication of the action \(a \in \mathcal {A}\) that was performed within the same window. A good selection for the feature set is crucial and can be made with deep learning techniques for instance. Clearly, a specific action a must produce characteristic values for \(F_O\). In the layout proposed in Sect. 13.4, the inter-skeletal distances as well as the distance of the operator’s wrists to some particular positions of the space were empirically found to be effective.

Fig. 13.1
figure 1

A RGBd camera can be used to keep track of the skeletal points (cyan dots in the top left corner) and compute the skeletal distances over time. The segmenting graph is built considering a segmentation hypothesis \(\rho \) and the observations are partitioned accordingly

The main problem to overcome when segmenting human actions is that the exact durations of the actions (and consequently the starting and ending time instants) as well as their total number is not precisely known. To handle this aspect, a probabilistic framework must be considered and the most probable sequence of actions that produced a certain macro-window of observations must be determined. This problem can be tackled considering factors graph [7], having as hidden variables the sequence of actions actually performed by the human and as evidences a battery of features \(F^{1,\ldots ,L}_O\). Vector \(\rho = \begin{bmatrix} \rho _1&\ldots&\rho _S \end{bmatrix}\) is adopted for describing the durations of human actions as well as their number. Indeed, \(\rho _i\) indicates the percentage of time spent by the human when doing the ith action in the sequence. Knowing \(\rho \), it is possible to build the underlying factor graph, assuming to connect \(\rho _i \cdot L\) observations to the node representing the ith action, refer to Fig. 13.1. The potentials \(\Psi ^i_{OA}\) must express the correlation existing among the features and the actions, while those correlating the actions over time, \(\Psi ^i_{AA}\), should take into account precedence constraints among the actions or any kind of prior knowledge about the process. Since the real segmentation \(\rho ^{*}\) describing the sequence of actions is not available, the proposed approach considers many hypothesesFootnote 1 \(\rho ^1, \rho ^2, \ldots \), which are iteratively compared with the aim of finding the optimal one \(\hat{\rho }\), i.e. the one more in accordance with the observations retrieved from the sensors. More specifically, a genetic algorithm [8] can be efficiently adopted assuming as a fitness function \(\mathcal {C}\) the following likelihood:

$$\begin{aligned} \mathcal {C}(\rho ) = \mathcal {L}(\rho | O^1_1, \ldots , O^L_F) = \mathbb {P}(O^1_1, \ldots , O^L_F | \rho )\mathbb {P}(\rho )_{prior} \end{aligned}$$

where \(\mathbb {P}(O^1_1, \ldots , O^L_F | \rho )\) can be computed by doing belief propagation on the factor graph pertaining to a specific hypothesis \(\rho \). Determining the optimal segmentation \(\hat{\rho }\) completes the segmentation task: the number of actions done by the operator is assumed equal to the number of segments in \(\rho \) and for the jth segment the marginal distribution \(\begin{bmatrix} \mathbb {P}(A_j = a_1 | O^1_1, \ldots , O^L_F , \rho ^*)&\ldots&\mathbb {P}(A_j = a_m | O^1_1, \ldots , O^L_F , \rho ^*) \end{bmatrix}\) expresses in a probabilistic way which action was done within that segment.

3 Predicting the Human Actions

By knowing the sequence of actions done in the recent past, predictions about the future ones can be made. In particular, we are interested in evaluating the waiting time \(\tau ^a\) before seeing again a particular action \(a \in \mathcal {A}\).

Human assembly sequences usually form quasi-repetitive patterns. In other words, the sequence of human activities can be modelled through a time series, which is the output of a certain dynamic process. Assume \(A_k\) as the ongoing activity at discrete time instant k, the behaviour of the human fellow co-worker can be modelled through the following discrete-time process:

$$\begin{aligned} \begin{aligned} A_{k+1}&= f\left( A_k, A_{k-1}, A_{k-2}, \dots , A_{k-n}\right) \\ t_{k+1}&= t_k + g\left( A_k\right) \end{aligned} \end{aligned}$$

where \(t_k \in \mathbb {R}^+ \cup \left\{ 0\right\} \) represents the time instant corresponding to the transition from \(A_{k-1}\) to \(A_k\) and \(g\left( a\right) = T^a > 0\) is the duration of activity \(a \in \mathcal {A}\). The stochasticity of the underlying discrete process governing the sequence of activities can be modelled by making use of a higher order Markov Model [5], which computes the predicting probability distribution associated to the next activity in the following way:

$$\begin{aligned} \begin{aligned}&\mathbb {P}\left( A_{k+1} = a \left| A_{k} = k_0, \dots , A_{k-n} = k_{n}\right. \right) \approx \\&\approx \sum _{i=0}^{n}{\lambda _i \mathbb {P}\left( A_{k+1} = a \left| A_{k-i} = k_{i}\right. \right) } \end{aligned} \end{aligned}$$

hence a mixture model that corresponds to usual Markov Chains for \(n=0\) and requires only \(m^2\left( n+1\right) \) parameters. A prediction of the probability distribution \(\hat{X}_{k+1}\) at time \(k+1\) can be computed as

$$\begin{aligned} \hat{X}_{k+1} = \sum _{i=0}^{n}{\lambda _i Q_i X_{k-i}} \end{aligned}$$

where \(X_{k+1} = \begin{bmatrix} \mathbb {P}(A_{k+1}=a_1)&\ldots&\mathbb {P}(A_{k+1}=a_m) \end{bmatrix}^T\) refers to the probability distribution describing the state of the systemFootnote 2 at step \(k+1\), while \(Q_i\) denotes the i-steps transition probability matrix that can be simply evaluated through counting statistics. The weights \(\lambda _i\) are estimated on-line, by minimizing the prediction errors on a window of n recent observations:

$$\begin{aligned} min \bigg ( \left\| \sum _{i=0}^{n}{\lambda _i Q_i X_{k-i}} - X_{k+1}\right\| ^2 \bigg ) \end{aligned}$$

It is not difficult to show that this leads to the solution of a quadratic programming problem in the following form:

$$\begin{aligned} \min _{\lambda }{\left\| A\lambda -b\right\| ^2} \text{ subject } \text{ to } \sum _{i=0}^{n}{\lambda _i} = 1\text{, } \text{ and } \lambda _i \ge 0 \end{aligned}$$

Assume that the duration of an activity \(a \in \mathcal {A}\), i.e. \(T^a\), can be modelled as a stochastic variable with a strictly positive lower bound, i.e. \(T^a \ge T_{min}^a > 0\). In order to estimate the waiting time needed for the certain activity a to show up, say \(\tau ^a\), we can combine this information with the model describing the activity sequence, Eq. (13.3). In particular, at the present continuous time instant \(\bar{t}\), given the sequence of the last activities (possibly including the currently running one) \(A_k, A_{k-1}, \ldots , A_{k-n}\), we would like to estimate the probability distribution of the waiting time for the beginning of a certain activity a, i.e. \(\mathbb {P}\left( \tau ^{a} \le t \left| A_k, A_{k-1}, \ldots , A_{k-n} \right. \right) \). The key idea is to construct a predictive reachability tree. Then, evaluating the time spent to traverse each possible branch in the tree, terminating with the desired activity \(a \in \mathcal {A}\), it is possible to estimate \(\tau ^a\). Since the reachability tree is, in principle, infinite, a prediction horizon \(\Delta T\) must be defined, meaning that the given probability will be computed up to the instant \(t = \bar{t} + \Delta T\). The probability associated to each branch can be simply computed by multiplying the probability of each arc of the branch, i.e. \(p_{{branch}} = \prod _{\left( i,j\right) \in {branch}}{p_{\left( i,j\right) }}\). As for the waiting time associated to each branch \(\tau _{branch}\), this is simply the sum of the duration of each activity \(T^a\), i.e. \(\tau _{{branch}} = \sum _{j: \left( i,j\right) \in {branch}}{T^j}.\) The time associated to each branch is computed as the sum of stochastic variables which are empirically approximated using the statistics associated to recently acquired samples, using a Monte Carlo numerical approach. Finally, given the distributions of the times associated to each branch, the overall distribution of the waiting time of the activity a can be simply computed as a weighted sum of the waiting times associated to each branch, i.e.

$$\begin{aligned} \begin{aligned}&\mathbb {P}\left( \tau ^{a} \le t \left| A_k, A_{k-1}, \dots , A_{k-n} \right. \right) =\\&=\sum _{branch}{p_{branch} \mathbb {P}\left( \tau _{branch} \le t\right) }. \end{aligned} \end{aligned}$$
Fig. 13.2
figure 2

Example of task allocation of some assembly sequences

4 Assistive Scheduling

Detecting and predicting the human actions is crucial when planning the robotic ones. Assuming the tasks as pre-assigned to all the agents (robots and humans), the aim of scheduling become essentially to control the robots in a way as much as possible compliant with the human (predicted) intentions. To this purpose the cell is modelled as a multi agent system by making use of timed Petri Nets [1]. In particular, the robots will be modelled as usual controllable agents, while humans will be assumed as uncontrollable, even if they will affect the system when doing the assigned tasks. Modelling Timed Petri Nets can be done by following systematic rules, from a description of the tasks assigned to both humans and robots as well as their precedence constraints, Figs. 13.2 and 13.3. The robotic tasks are modelled in a canonical way with two transitions: the first starting the task and the second terminating it. For every robots, an additional action is always considered and consists of an idle of a quantum of time (orange transitions in Fig. 13.3), able to postpone the starting of the following action (multiple idle are also possible). The human actions are modelled with three transitions: the first one firing when the human is predicted to be ready to start the corresponding action (also indicated as intentional transition), the second one firing as soon as the human actually starts that action and the last one terminating it. Red places in Fig. 13.3 are those related to idle states for the agents.

Fig. 13.3
figure 3

The timed Petri nets adopted to model the co-assemblies reported in Fig. 13.2. The global net modelling the system can be obtained by superimposing the reported three ones. The transitions reported in yellow are those assumed as controllable. Orange transitions are controllable too and correspond to idling of a quantum of time. Red places are those related to idle state for an agent of the system

Scheduling essentially consists in deciding the commands to dispatch to the robots, or in other words resolving the controllable conflicts. To this purpose, a model predictive approach is exploited and the underlying Petri Net is used to compare possible future evolutions of the systems with the aim of finding those minimizing the inactivity times and indirectly maximizing the throughput. Since the system is only partially controllable, a stochastic approach is assumed and the optimal policy is recomputed every time by exploring the reachability tree of the net with a Monte Carlo simulation, details are explained in [2]. The predictions about the future human activities, Sect. 13.3, are used to build the distributions modelling the intentional transitions. The pipeline of Fig. 13.4 summarizes the entire approach.

Fig. 13.4
figure 4

The assistive scheduling approach

5 Results

The assistive scheduling approach described so far was applied in a realistic co-assembly, whose workspace consisted in: one IRB140 ABB robot, the collaborative dual-arm robot YuMi of ABB, two automatic positioners (treated like additional robots to control) and a MICROSOFT KINECT v2 as monitoring device, see Fig. 13.5. The aim of the collaboration was to perform the assembly of a box containing a USB pen.Footnote 3 20 volunteers were enrolled for the experiments and were divided into two equally sized groups named Group 1.A and Group 1.B. Operators in Group 1.A performed the assembly while the strategy discussed in Sect. 13.4 was adopted. Instead, for those in Group 1.B a centralized strategy was adopted for deciding the actions of all agents, treating the humans like additional robots to control.

The pictures at the top of Fig. 13.6 report the inactivity times measured during the experiments. As can be seen, the assistive approach applied for Group 1.A is able to significantly reduce waste of times. This reflects in an improved throughput of the system as can be appreciated in the bottom left corner picture of Fig. 13.6. It is worth to recall that the approach followed to predict the human intentions, Sect. 13.3 is data-driven, adapting to the human over time. This fact gives the scheduler some learning capabilities or in other words the produced plans are able to constantly improve the way the robots are controlled, refer to the bottom right corner of Fig. 13.6.

Fig. 13.5
figure 5

The experimental setup with the two robots, the carts and the human operator

Fig. 13.6
figure 6

Results showing the benefits of an assistive scheduling, Group 1.A w.r.t to a centralized approach, Group 1.B

6 Conclusions

This work aimed to study how to allow humans and robots to actively collaborate for performing a shared tasks, like the ones involved in co-assemblies. Humans were not conceived as additional controllable agents which are instructed to perform actions scheduled by a centralized planner. Instead, the decision-making capabilities of the operators were exploited, since they were allowed to drive the evolution of the plant. In this context, the robotic mates were endowed with the cognitive capabilities required for both recognizing and predicting the human actions. Regarding action recognition, particular attention was paid to manage the fact that the starting and ending time instants of the action were not known. To overcome such issue, genetic algorithms were deployed in conjunction with factor graphs. For what concerns the prediction problem, a data-driven approach was described, using an adaptive Higher order Markov model to forecast the sequence of human actions. The scheduling of the robotic actions was done with a kind of scenario based approach, which needs to explore the reachability tree of a timed Petri Net modelling the agents in the robotic cell.