1 Introduction

In multi-agent systems, the realization of cooperative behavior is a challenging issue. The design of a cooperative system using multiple agents requires a comprehensive understanding of the entire system, including the environmental structure, capabilities of individual agents, and task structures. However, it is difficult to investigate the characteristics of the entire system during the system design phase. In addition, in a complex and dynamic environment with interactions among agents, a static system design with predetermined and limited cooperative behavior of agents may be inefficient and limited. Therefore, the emergence of cooperative behavior by the autonomous learning of the agents and flexible adaptation of the cooperative behavior in a dynamic environment is highly desirable.

Reinforcement learning (RL) is often used to achieve cooperative behavior through the autonomous learning of individual agents. However, to apply RL to a multi-agent system, it is necessary to prevent learning inefficiencies due to the explosion of the state table, which results from a large number of the environmental states that contains other agents, as well as the increase in the number of agents. Agents are also required to have a learning mechanism that reflects instability and uncertainty caused by changes in the behavior of other agents because other agents are also learning simultaneously. In particular, if a task consists of a number of subtasks, each of which must be executed by different types of agents in a certain order within a limited time, agents are required to learn behavior balancing between the efficient execution of the subtask allocated to themselves and coordinated/cooperative behavior to support or facilitate the subtasks of others. This helps to improve the overall efficiency from a social perspective, although such a coordinated behavior may slightly reduce the performance of their own subtasks. Therefore, the design of an appropriate reward allocation scheme to obtain such a balance is extremely important. In a small-scale static environment, it may be easy to design an appropriate reward allocation for the coordinated behavior of agents, but in a dynamic and complicated environment, it is not obvious how to design a reward allocation that encourages the cooperative/coordinated behavior as well as behaviors for their own responsible tasks. This is because the contribution of the agents’ cooperative behavior is indirect and often not easy to clarify, and the effect of the contribution usually appears with some delay.

Recent studies have applied deep reinforcement learning (DRL) to single agents, and have produced results in various fields such as robot control [12, 23] and computer game playing [16, 19]. In the research of multi-agent systems, several studies involving multi-agent deep reinforcement learning (MADRL), which applies DRL to achieve cooperative behavior among agents, have been proposed by Foerster et al. [10] and Palmer et al. [22]. In general, DRL requires a carefully developed reward design to learn appropriate action values. Therefore, for each agent to appropriately learn the action values in MADRL, the reward allocation must be balanced between the actions necessary for performing its own subtasks and the cooperative actions for other agents. However, to the best of our knowledge, there have been only few studies on the effect of reward allocation on the emergence of cooperative behavior in MADRL, especially for sequential executions by heterogeneous agents, and the resulting mechanism of efficiency improvement of cooperative behavior to the system. The purpose of our study is to see if it is possible for each agent to learn to act while considering the positions of other learning agents to coordinate, rather than only considering its own task execution when continuous and sequential executions are required for task completion. This study is motivated by our target applications in which various type of autonomous robots need to complete their own tasks sequentially within a limited time on a construction site. For example, a robot must carry a wall material, then the second robot installs the carried wall, and then another robot sweeps the area to remove trash for the sequent tasks; these series of tasks should be completed as quickly as possible.

To investigate the effect of a reward distribution on the emerging cooperative behavior, we introduced the multi-task problem, which is an abstraction of tasks performed by robots (agents) on a construction site, where two types of agents with different functions have to complete their own responsible subtasks sequentially under a limited time constraint; the first agent’s effort will become wasted if expired. From the perspective of the first agent, the true completion of the required task occurs after the completion of the subsequent subtask performed by one of the second agents, and there is a delay between the execution of its own subtask and the actual completion of the entire task. Furthermore, there are different numbers of these types of agents because the times required to complete their subtasks are different. Nevertheless, they have to coordinate their executions to meet the time constraint. We already reported a preliminary method to target this problem in [17], but it is based on limited experimental results and our contribution has not been fully discussed. Therefore, to clarify our contribution in this paper, we further refined the model of agents and tasks, and conducted additional experiments to confirm if agents can learn cooperative behaviors using our method even if their own performance (number of subtasks completed) is slightly reduced in situations where different agents must coordinate in unbalanced numbers. We then added a detailed discussion on the results to clarify the characteristics of the coordinated behavior learned by the proposed method.

The contribution of this paper is threefold. First, we propose a two-stage reward allocation method that balances between actions for their responsible subtasks and cooperative actions for others. With this allocation method, the first type of agent will receive immediate first rewards for their subtasks and delayed second rewards for the completion of the entire task. Second, we extend the buffered experience replay, which is used to sample the training data, to fit the proposed reward allocation method. Then, we investigate whether or not cooperative behaviors can be established, and if so, we analyze how the characteristics of these behaviors change by varying the ratio between the first and second rewards as well as the time at which agents can achieve both efficient behavioral learning to execute their own subtasks and that for cooperative behavior to facilitate the execution of other subtasks.

Third, to acquire both behaviors, we propose a method in which the ratio between first and second rewards in two-stage allocation vary as learning progresses. Actually, when rewards were assigned at the end, i.e., only when the two types of agents complete tasks, both types of agents could not sufficiently learn the behaviors that are necessary for executing their own subtasks, resulting in poor performance. Agents also cannot learn cooperative behavior owing to the insufficient chances of learning. Conversely, if large rewards are given when their own subtasks are completed, they can improve the efficiency to execute their own subtasks, but neglect cooperative behaviors, resulting in a large amount of wasted effort owing to time expiration, although they can complete many tasks due to the learning from a self-centered perspective. In the proposed method, agents focus more on their own subtasks in the earlier phase; although this may not induce the cooperative behavior, it is essential for them to execute their own subtasks. Furthermore, the second types of agents cannot have opportunities to learn unless the first types of agents can complete their subtasks sufficiently. After that phase, agents can gradually learn cooperative behavior by varying the ratio of the rewards. We then demonstrate experimentally that agents can learn well-balanced robust and stable behaviors, and can improve the performance from a social viewpoint, i.e., although the efficiency of one’s own subtask execution may be somewhat reduced, the agent learns to take into account the movements of other types of agents so that its own effort is not wasted even in a system with an unbalanced number of the two types of agents. Finally, we summarize this work and describe future research.

2 Related work

There are many studies that aim to extend traditional table-based reinforcement learning paradigms from single to multi-agent, but this extension is a challenging issue owing to the intricate interrelationships and learning instability, which are due to various factors, such as the mutual effect of simultaneous learning in independent multiple agents, a large number of states (including other agents), interpretation of joint actions, and coordinated/cooperative tasks that may not bring immediate rewards [6, 25]. Multi-agent reinforcement learning (MARL) also addresses sequential decision-making problems in which agents interact with their surroundings; thus, the credit assignment problem with feedback mechanisms is needed to reinforce appropriate (sequences of) actions for individual agents [2, 5, 7, 8, 26, 28]. For example, in full system reward, agents’ rewards depend on the joint action with all other agents, but it is difficult for learning agents to discern the impact of their local actions from the impact of all the other agents’ actions [30]. Meanwhile, the difference reward measures each agents’ impact on the system’s common rewards and then divides and allocates the rewards individually based on the measured impacts [1, 28].

With the recent development of DRL, MADRL has been applied to learn complex actions in multi-agent systems, and it has achieved successful results [10, 13, 22, 24]. Shao et al. [24] proposed the policy share algorithm to promote cooperative behavior among agents in the StarCraft game. Palmer et al. [22] proposed the lenient MA-DRN, in which state-action pairs have decaying temperature values that affect the leniencies to negative policy updates using samples from the experience replay memory, so agents are likely to converge on the optimal joint policy. ElSayed-Aly et al. [9] applied a shielding framework given in [3] to multi-agent learning to prevent the exploration of unsafe actions. Beal et al. [4] proposed the deep learning method to learn the long-term (seasonal) tactics of a soccer team by using the fluent objectives, and they showed the increased seasonal performance by using their simulations based on real data. Gupta, Egorov, and Kochenderfer [13] extended three single-agent deep reinforcement learning algorithms to multi-agent learning, and combined them with some curriculum learning methods. They experimentally showed that curriculum learning is necessary to scale MADRL by using a number of simple problems. Miyashita and Sugawara [18] clarified how coordinated structure and behaviors are affected by agents’ input states fed to DQNs. In contrast, this paper more focuses on reward allocation method to learn more sophisticated coordinated behaviors for tasks consisting of sequential subtasks.

Meanwhile, as in the cases involving MARL, the credit assignment problem and difference reward are crucial for MADRL to learn cooperative and coordinated behaviors for more complicated situations in fully distributed environments [11, 14, 15, 21]. For example, Jiang et al. [15] proposed a novel difference rewards method that can compute the individual contribution at approximation schemes in the collective case. He et al. [14] proposed reward functions that consider not only the individual reward, but also a history-dependent reward that is a proportion of the total reward from the previous time. In these studies, they assumed that the reasonably divided rewards were allocated to individual agents at or after the time of completion of each cooperative task. However, in our sequential tasks by different types of agents, even if the agents gained some reward at the end of the task, they could not determine which part of the sequence of actions it performed in the past was effective, and the effective action and the actions that induced it remained ambiguous. In fact, in our experiments, simply giving rewards at the end of each task was not sufficient to reinforce the valid actions. In addition, because the first agents could not learn the behavior for executing their own subtasks, the second agents could hardly have opportunities to execute their subtasks. However, when the first agents could receive some rewards right after performing their own subtasks, they could learn the effective behavior for their subtasks, but tended not to learn the coordinated behavior in order to facilitate subsequent subtasks of others.

Our proposed reward scheme, in which rewards are updated according to the execution of subsequent sub-tasks, aims to address the issue of how agents learn to balance between the execution of their own subtasks and their coordination behavior for other cooperating agents from the perspective of reward assignment. We can consider it as an extension of potential-based difference reward shaping (PBDRS) [8]; Ng et al. [20] used in the MARL. However, in our reward shaping, the agent receives (a part of) the constant total reward consisting of the individual reward for its subtask and the contribution reward for the required entire task, and its ratio varies over the learning process to guide the agents, instead of giving additional rewards [20]. Furthermore, because there is a time difference between the completion of an agent’s own subtask and the completion of the task, the reward of task contribution is added with some delay in the replay memory.

3 Problem formulation and models

3.1 Models of environment and agents

We introduce the carry-and-installation problem, which is a multi-agent cooperative problem that abstracts the tasks in a construction site, where a task completed two subtasks, a carrying a subtask, and an installation subtask. A carrying subtask is executed by a carrier agent, which has the role of transporting a material, i.e., loads a material in the area where it is supplied, carries it to the location where no material is installed, and unloads it there. Meanwhile, an installation subtask is executed by an installation agent that looks for the location where a material is placed by a carrier agent, and installs it within the specified limited time. Thus, the materials placed by the carrier agents have an effective time of existence; they become unusable when expired, and are removed from the environment. Therefore, before the placed materials are removed, the installation agents must begin executing the installation subtasks that they find; otherwise, the effort of the carrier agent becomes wasted. The two types of agents move around and continue these behaviors until they have finished installing the materials in the required locations.

Let \(I=\{1, \dots , n\}\) be a set of agents that consists of two types of agents; the disjoint sets of carrier agents and installation agents are denoted by \(I_{\mathit{carr}}\) and \(I_{\mathit{inst}}\), respectively. Therefore, \(I=I_{\mathit{carr}}\cup I_{\mathit{inst}}\). The environment is expressed by a \(G=N\times N\) grid, in which agents can move around. An example environment is shown in Fig. 1, where \(N=20\), a black square is the carrier agent, and a black pentagon is an installation agent. A carrier agent carrying a material is filled with green. Both types of agents can observe the local square region specified by observable range size V (>0); the blue squares in Fig. 1 show examples of the regions observable by the agents located at the centers, and their side lengths are \(2V+1\). The gray areas indicate the installation areas at which materials must be installed, and the green area in the center indicates the area at which materials are supplied. Therefore, a carrier agent can load a material there. When it carries and unloads to place a material on a gray cell, the cell turns to yellow, and then gradually changes to gray until expired. The yellow cell will change to white if an installation agent works on the yellow cell before expiration.

Figure 1
figure 1

Example of environment and observable ranges

3.2 Problem formulation

The carry-and-installation problem can be represented by tuple

$$ \langle I_{\mathit{carr}},I_{\mathit{inst}},N,E,\mathcal{S},\mathcal{A}, \mathcal{T}\rangle , $$

where E is the set of all possible states of the entire environment. The state of the entire environment at time t is denoted by \(e_{t}\in E\). At this time, agent \(i\in I\) can observe its local situation \(s_{i,t}\in S_{i}\) in its local area around itself, as specified by V, so it can be considered as a subset of \(e_{t}\). \(\mathcal{S}=S_{1}\times \cdots \times S_{n}\) is the product of the possible observations of agents. \(\mathcal{A}\) is the set of joint actions represented by the vectors of agents’ actions \(A_{1}\times \cdots \times A_{n}\), where \(A_{i}\) is the set of the possible actions of agent \(i\in I\). The joint action at t is denoted by \(a_{t}=(a_{1,t}, \dots , a_{n,t})\), where \(a_{t}\in \mathcal{A}\) and \(a_{i,t}\in A_{i}\). We assume that \(A_{i}=\{\mathit{up},\mathit{right}, \mathit{down}, \mathit{left}, \mathit{work}\}\) (for \(\forall i\in I\)), whose elements correspond to the movement except action work, which may differ depending on the types of agents, and the details are described below.

Finally, \(\mathcal{T}\) describes all of the tasks, which are specified by \(({\mathit{SM}}, C)\), where \({\mathit{SM}}=\{\psi _{1}, \psi _{2}, \dots \}\) expresses the set of materials in the supply area, which is the green area in the example environment of Fig. 1, and \(C=\{g_{1}, \dots , g_{m}\}\) denotes the set of cells to be installed, which are expressed by the gray areas in Fig. 1. We identify each installation cell \(g_{k}\in C\) and the task for \(g_{k}\); thus, C is the set of tasks required for completion. We can assume that task \(g_{k}\) consists of sequential subtasks, \(g_{k}=(g^{1}_{k}, g^{2}_{k})\), where \(g^{1}_{k}\) is the carrying subtask and \(g^{2}_{k}\) is the installation subtask using the carried material.

When agents perform a joint action \(a_{t}\in \mathcal{A}\) in environment \(e_{t}\) at time t, the environment transitions to the next state \(e_{t+1}\in E\), and then agent \(i\in I\) receives reward \(r_{i}(e_{t},a_{t})\). Of course, agent i can observe only a part of \(e_{t}\), and actions \(a_{j,t}\in A_{j}\) are executed by close agents \(j\in I\). The method of allocating the reward to the individual agents depends on the type of agent, and the details are described in Sect. 4.2. We use the distributed version of deep Q-networks (DQNs), that is, i has its own network to learn the action values such that the associated policy \(\pi _{i} : S_{i} \rightarrow A_{i}\) to maximize the rewards of i received as the results of its subtask execution and behaviors that may lead to the task completion, which correspond to a task at cell \(g_{k}\in C\). In particular, we focus on the carrier agents, i.e., the emergence of cooperative behaviors of the carrier agent, which encourage the subtask executions of installation agents, as well as the behaviors for the efficiency of their own carrying subtasks.

3.3 Agent model

When an epoch starts, all agents begin to move according to their own policies based on the learned Q-values, i.e., at each time t, agent \(i\in I\) performs action \(a_{i,t}=\pi _{i}(s_{i,t})\in A_{i}\) based on state \(s_{i,t}\) observed in the environment \(e_{t}\). If i chooses a move action, it moves to an adjacent cell, but the content of work action depends on the types of agents.

3.3.1 Carrier agent

Recall that the role of carrier agent \(i\in I_{\mathit{carr}}\) is to load a material from the material supply area, carry and unload it to place it on an uninstalled location (cell) so that an installation agent can execute its subtask (installation) using the material. We assume that when i enters the material supply area, i automatically picks up the material \(\psi ^{i}_{k}\) because this is a stand-alone action, and it is outside the scope of this paper. Then, it looks for an uninstalled cell. Here, we also assume that any carrier agent can carry only one material, and that the materials in the supply area are not exhausted. Then, when i performs the work action on the uninstalled cell at time \(t_{d}\), i unloads and places \(\psi ^{i}_{k}\) on top of the cell, turning this cell into a ready-to-install cell, where the placed material is denoted as: \(\psi ^{i}_{k,t_{d}}\). If \(\psi ^{i}_{k,t_{d}}\) is not used by an installation agent by \(t_{d}+\mu \), it is removed and the installation cell reverts to an uninstalled cell, where μ is the usable time of the material. Note that the effect of the work action on other types of cell is ignored, and thus nothing will happen.

3.3.2 Installation agent

The role of an installation agent is to execute a subtask, installation using the material carried before it is removed. The installation agent \(j\in I_{\mathit{inst}}\) executes the subtask by moving around to find the cell where material \(\psi ^{i}_{k,t_{d}}\) is placed, and using it to perform the work action. Then, the cell that completes the task changes to a completion cell (which is the white cell in Fig. 1). The work action by the installation agent on another cell is also ignored.

The agents’ actions described above are repeated for H time steps or until all uninstalled cells are transferred to completed cells. This sequence of actions is called an epoch, and the integer \(H>0\) is the maximum epoch length. After an epoch, the environment is initialized and the next epoch starts, repeating epochs \(F_{e} (> 0)\) times, where \(F_{e}\) is the number of epochs. Therefore, in the carry-and-installation problem, two types of agents learn to execute the required set of installation tasks as many as possible in an epoch, while not wasting the effort of transporting the materials by carrier agents.

4 Proposed learning method

4.1 Belief-integrated input and deep Q-network

DQN, which usually consists of two deep neural networks called the main network and the target network, is a method that is employed to approximate the Q-value by taking state \(s_{i,t}\) observed by agent i at t as the input to the two deep neural networks. In the reinforcement learning, i learns the Q-function \(Q_{i}(s_{i,t},a_{i,t})\) so that the cumulative rewards, including future rewards, are maximized, where \(a_{i,t}\) is the action selected in state \(s_{i,t}\). To learn the Q-function using DQN, it is necessary to update the parameter group θ that specifies the network using the received rewards (therefore, we often identify the parameter group θ using the corresponding neural network). Thus, assuming that each agent i has its own individual DQN, its network parameter group \(\theta _{i,t}\) at t is updated to reduce the values of the following loss function \(L_{i,t}(\theta _{i,t})\) by the mean-square error method using its gradient:

$$\begin{aligned}& L_{i,t}(\theta _{i,t}) \\& \quad = \mathbb{E}_{(s_{i},a_{i},r_{i},s_{i}')} \\& \qquad {}\times \Bigl[ \Bigl(r_{i} + \gamma \max_{a_{i}'}Q_{i} \bigl(s'_{i},\mathop{\operatorname{argmax}} _{a'_{i}}Q_{i} \bigl(s'_{i},a'_{i}; \theta _{i,t}\bigr);\theta _{i,t}^{-}\bigr) \\& \qquad {}- Q_{i}(s_{i},a_{i};\theta _{i,t}) \Bigr)^{2}\Bigr], \end{aligned}$$

where \(r_{i}\) is the received reward and \(\gamma \in [0,1)\) is the discount rate. In this formula, the loss function is obtained from the double DQN in [29] which uses both the main network \(\theta _{i,t}\) and the target network \(\theta _{i,t}^{-}\). Note that \(\theta _{i,t}^{-}\) is the parameter obtained before the update of \(\theta _{i,t}\). Therefore, the agent’s action is determined by the main network \(\theta _{i,t}\), and this network \(\theta _{i,t}\) is updated each time η using the action-value function based on the target network \(\theta _{i,t}^{-}\). Then, \(\theta _{i,t}^{-}\) is copied from \(\theta _{i,t}\) at every epoch.

In general, agent i decides its behavior based on the observed state \(s_{i,t}\) at time t, but in this study, we assume that i decides its behavior based on the information view \(v_{i,t}\), which is a combination of the observed information \(s_{i,t}\) of agent i at time t, and the internal beliefs of i as additional information. Therefore, we define the Q-function \(Q_{i}\) and the policy \(\pi _{i}\) of i as in the following equations [18].

$$\begin{aligned} &Q_{i}: \mathcal{V}_{i}\times A_{i} \longrightarrow \mathbb{R},\quad \textrm{and} \end{aligned}$$
(1)
$$\begin{aligned} &\pi _{i}: \mathcal{V}_{i} \longrightarrow A_{i}, \end{aligned}$$
(2)

where \(\mathcal{V}_{i}\) denotes the set of all possible views for i. The details of \(v_{i,t}\) are described in Sect. 4.4.

4.2 Two-stage reward allocation

Because the reinforcement learning of an agent’s behavior focuses on maximizing the cumulative reward that it receives, the design of the reward allocation to the agent is an important factor for appropriate behaviors including cooperation and coordination. In particular, because our problem requires the sequential execution of subtasks, the carrier agent that executes the first subtask and the installation agent that executes the next subtask needs to learn not only its own subtask processing, but also cooperative behaviors, such as how to help other agents and respond to help provided by other agents. For this purpose, we propose a two-stage reward allocation method with delay that allocates rewards to the agent’s own subtask execution and actual task completion to achieve both types of learning. Then, we investigate the learning speed of the agent’s subtask processing behavior, the potential for the emergence of cooperative behaviors, and their characteristics by changing the allocation ratio of these rewards. Note that there is a time difference between the completion of an agent’s own subtask and the completion of a task accomplished by two agents; thus, the reward allocation should be divided into two stages with some time difference. Therefore, in our method, each agent receives a fixed reward upon the successful completion of a task, and we consider when to give (part of) that reward, and not how to distribute the total reward among agents, as in the credit assignment problem.

First, in our problem, the carrier agent i receives the reward, called individual reward, \(r_{i}^{1}(t_{d})(\geq 0)\) upon the completion of the first subtask \(g^{1}_{k}\) at time \(t_{d}\). Then, when the corresponding task is finished, that is, when the installation agent j completes the subsequent subtask \(g^{2}_{k}\) by time \(t_{e}\) (\(\leq t_{d}+ \mu \)), the agent i receives a reward, called contribution reward, \(r_{i}^{2}(t_{e})\) separately. In other words, when the subtask executed by i and the entire process for executing the task, including i’s subtask, are completed within a certain period of time specified by μ, i finally receives the reward \(r_{i}^{1}(t_{d})+r_{i}^{2}(t_{e})\) and reinforces the action taken up to \(t_{d}\). However, when the task associated with the subtask executed by i is not completed, i receives only the reward \(r_{i}^{1}(t_{d})\). Meanwhile, installation agent j receives the reward only once by considering that j receives the two-stage reward at the same time because the completion of the task \(g_{k}\) is simultaneous with the completion of j’s subtask \(g^{2}_{k}\).

We fix \(R=r_{i}^{1}(t_{d})+r_{i}^{2}(t_{e})\) and explore the variation of the learned behavior when the ratio of \(r_{i}^{1}(t_{d})\) to \(r_{i}^{2}(t_{e})\) is different. However, if this ratio remains extremely unbalanced, the optimal rewarding behavior of agents may change. Therefore, we propose to vary this ratio dynamically over the learning process. In particular, we propose the gradually decayed reward (GDR) method, where \(r_{i}^{1}(t_{d})\) is decreased as the learning progresses. More specifically, we decrease individual reward \(r_{i}^{1}\) (\(= r_{i}^{1,h}\)) in the h-th epoch by defining

$$ r_{i}^{1,h} = \max \bigl(r_{a} - \delta _{r}\cdot \lfloor {h}/{F_{r}} \rfloor , 0\bigr), $$
(3)

where \(F_{r}\) is the positive integer to decide the decreasing speed, \(r_{a}\) is the initial individual reward, and \(\delta _{r}\) is the decay reward rate for GDR. Hence, \(r_{i}^{2,h}=R-r_{i}^{1,h}\). Note that if there is no confusion, \(r_{i}^{1,h}\) and \(r_{i}^{2,h}\) may be simply written as \(r_{i}^{1}\) and \(r_{i}^{2}\). By employing GDR, we expect that the carrier agents will learn their own subtask processing behavior in the early stages of learning, and after the learning progresses, they will be encouraged to learn cooperative behavior that facilitates task completion, enabling them to effectively learn both their own subtask processing behavior and cooperative behavior for task completion. Reward \(r_{i}^{1,h}\) gradually decreases to 0 (so \(r_{i}^{2,h}\) increases to R), and the rewards is given only when task \(g_{g}\) is completed; this means that the reward will be given only when the cooperative sequential task is completed, which is consistent with the original purpose of our problem. Generally, the carrier agent executing the first subtask should not be rewarded unless the task is completed, but we set \(r_{i}^{1,h}>0\) only for the first half of the learning phase to promote execution of the carrier agent’s subtask. In the second half and in the testing phase, \(r_{i}^{1,h} = 0\) and \(r_{i}^{2,h}=R\), which is the same as the intuitive global reward, and thus the reward is given only when the task is completed. In our problem setting, if the carrier agents cannot learn their subtasks, the installation agents will not have chances to learn either, so we initially set \(r_{i}^{1,h}>0\) to encourage the first agent to learn. Eventually, however, \(r_{i}^{1,h} = 0\) and the reward is consistent with the purpose of the problem. Note that the reward scheme where \(r_{i}^{1}\) is fixed is called the fixed-ratio reward (FRR). Note that, instead of Eq. (3), we can define it more generally as \(r_{i}^{1,h} = f_{r}(h)\), where \(f_{r}(h)\leq R\) is a monotonically decreasing function and satisfy the condition \(\exists h_{0} < F_{e}\) s.t. \(f_{r}(h_{0})=0\).

4.3 Experience replay for two-stage reward allocation

Because the two-stage reward allocation method receives rewards at different timings, we will extend the buffered experience replay to adapt this reward scheme to training DQNs. Experience replay is a method that involves the random sampling of experiences to eliminate over-learning caused by biased correlations. In a multi-agent system, each agent learns its behavior independently, so it must learn in a non-stationary environment that results primarily from simultaneous learning in other agents, and experience replay is effective at eliminating the correlation between experiences to some degree. We extend the experience replay to accommodate the fact that a certain rate of the reward received late depends on the subsequent actions of other agents.

We denote the replay memory of agent i at time t as \(D_{i,t}\). To prevent the experience whose reward has not yet been determined from being used for learning, when i completes its subtask \(g^{1}_{k}\) at time t, the carrier agent i temporarily stores the experience \(c_{i,t} = (s_{i,t},a_{i,t},r_{i,t}, s_{i,t+1})\) in the temporary waiting queue \(P_{i,t}\) instead of \(D_{i,t}\). Here, the maximum size of \(P_{i,t}\) is defined as \(M_{P}\) (\(\geq \mu >0\)), and when the temporary waiting queue satisfies \(P_{i,t} > M_{P}\), the data at the head of the queue are moved to \(D_{i,t}\). Most of the per-action rewards of i are 0, but when i places material \(\psi _{i,t_{d}}\) as its own subtask execution at \(t_{d}\), it receives a reward \(r_{i,t_{d}}=r_{i}^{1,h}(t_{d})\geq 0\) and adds the corresponding \(c_{i,t_{d}}\) to \(P_{i,t_{d}}\). Then, i moves on to the next transporting subtasks. When installation agent j executes its subtask using \(\psi _{i,t_{d}}\) at \(t_{e}\), after the task is completed, i receives another reward \(r_{i}^{2,h}(t_{e})\) and changes the reward of \(c_{i,t_{d}}\) in \(P_{i,t_{e}}\) to \(r_{i,t_{d}}=r_{i}^{1,h}(t_{d}) + r_{i}^{2,h}(t_{e})\). Because there is a time limit μ on the use of placed materials \(\psi _{i,t_{d}}\), \(t_{e}\) must satisfy \(t_{d} + \mu \geq t_{e} > t_{d}\).

Meanwhile, if the placed material \(\psi _{i,t_{d}}\) is not used by \(t_{d} + \mu \), the reward \(r_{i,t_{d}}\) (\(=r_{i}^{1,h}(t_{d})\)) of the experience data \(c_{i,t_{d}}\) is not changed, and \(c_{i,t_{d}}\) is moved directly to \(D_{i,t_{d}+M_{P}+1}\). Because installation agent j receives two rewards simultaneously, it does not need to use the temporary waiting queue, and adds its experiences to its local replay memory \(D_{j,t}\).

Therefore, the replay memory of carrier agent i at time t is expressed by

$$ D_{i,t}=\{c_{i,t-{M_{P}}-{M_{D}}},\ldots ,c_{i,t-{M_{P}}-1}\}, $$

where \(M_{D}\) (>0) represents the storage capacity of the replay memory. From the experience stored in this replay memory \(D_{i,t}\), agent i generates a mini-batch \(U(D_{i,t})\) consisting of \(U_{\mathit{size}}\) (>0) data by random sampling from \(D_{i,t}\), and updates the neural network parameter \(\theta _{i,t}\) to reduce the loss function \(L_{i,t}(\theta _{i,t})\) which is defined by

$$\begin{aligned}& L_{i,t}(\theta _{i,t}) \\& \quad =\mathbb{E}_{(s_{i},a_{i},r_{i},s_{i}')\sim U(D_{i,t})} \\& \qquad {}\times \Bigl[\Bigl( r_{i} + \gamma \max_{a_{i}'}Q_{i}\bigl(s'_{i}, \mathop{\operatorname{argmax}} _{a'_{i}}Q_{i}\bigl(s'_{i},a'_{i}; \theta _{i,t}\bigr);\theta _{i,t}^{-}\bigr) \\& \qquad {} -Q_{i}(s_{i},a_{i};\theta _{i,t}) \Bigr)^{2}\Bigr]. \end{aligned}$$

To minimize the loss function \(L_{i,t}(\theta _{i,t})\), for example, we calculate the gradient of the loss function \(\nabla L_{i,t}(\theta _{i,t})\) and upload parameter group \(\theta _{i,t}\) using the RMSprop in [27]. Note that the parameters are updated at every η, as mentioned above.

4.4 Observed states with beliefs

Agent i recognizes state \(s_{i,t}\in S_{i}\) at t by observing its local area specified by observable range size V and centered on itself. State \(s_{i,t}\) includes cells, materials, and other agents (with or without material possession) in the observable range, and is expressed as multiple matrices of \((2V+1)\times (2V+1)\).

Regardless of the type of agent, i forms view \(v_{i,t}\) from \(s_{i,t}\) and the internal information of i, and uses it as the input fed to the i’s DQN in [18]. The input view \(v_{i,t}\) is represented by five types of seven matrices whose elements range in value between −1 and 1 to balance the weights of their values. Using the example situation in Fig. 1, the representation of \(v_{i,t}\) for the carrier agent and the installation agent are shown in Figs. 2 and 3, respectively. We assume that the agent has internal information about the shape and size of the environment and its own position in the environment (for example, using GPS or based on the trajectories of past actions). Then, agents generate \(N\times N\) matrices of Figs. 2(e) and 3(e), which represent their own positions in the environment. The remaining four types of matrices reflect the observed state \(s_{i,t}\), and all matrices except the second-type matrices (Figs. 2(b) and 3(b)) are generated in the same way regardless of the agent type.

Figure 2
figure 2

Structure of input to carrier agents

Figure 3
figure 3

Structure of input to installation agents

Figures 2(a) and 3(a) indicate the locations of uninstalled cells in the local areas, and correspond to the first types of matrices whose elements are described as 1 for the cells represented in gray and 0 for the others. Note that the blue cells in Fig. 3 show the outside of the environment that the agent cannot observe, and are always described in the matrices as −1. The third type of matrices, which correspond to Figs. 2(c) and 3(c), describe the materials placed by the carrier agents. The color depth of a yellow cell representing the materials in these figures shows the time that elapsed since it was placed, gradually changing to gray. In the matrices at t, the time that elapsed since the material \(\psi _{i,t_{d}}\) was placed at \(t_{d}\) is \(\max((t_{d}+\mu -t)/\mu , 0)\).

The fourth type of matrices specify the positions of the other agents in their observable ranges, and are illustrated in Figs. 2(d) and 3(d). Here, agents cannot identify the types of other agents, but the IDs are uniquely assigned to all agents. Their IDs are represented by multiple matrices; for example, if there are less than 27 agents, each ID is represented by three digits \((b_{1},b_{2},b_{3})\) (where \(b_{k}\in \{0,1,-1\}\) and ∃k s.t. \(b_{k}\neq0\)) and \(b_{k}\) is the element of the kth matrix at the agent’s location. \((0,0,0)\) indicates that no agent exists at the corresponding cell. Finally, Figs. 2(b) and 3(b) represent the position of the materials being carried by the carrier agents, and the element in the second type of matrix is 1 for the position of the agent carrying the material, and 0 otherwise. Therefore, when the carrier agent has a material, the center of its matrix is 1 (see Fig. 2(b)). Furthermore, only the matrix of the carrier agent describes the material supply area in the observable range as 1 to locate it.

In addition, we assume that agent i can retain its trajectory in their views and add their locations into the matrix indicating its own position shown by Figs. 2(e) and 3(e). For this expression, the current position of i is expressed as 1, and the position of k steps prior is expressed as \(1\times \beta ^{k}\) (\(= \beta ^{k}\)), where \(\beta \in [0,1)\) is the decay rate of the trajectory data. In addition, if \(\beta ^{k} < \delta _{t}\), the corresponding cells are not included in the trajectory data, so it is not in the matrix, where \(\delta _{t}\) is the threshold used to decide the length of the trajectory to be reflected. If i moves to a cell in which the past trajectory remains, only the large values for the cell are described in the matrix.

4.5 Network structure

The network structure of the DQN in each agent is the combination of convolutional layers and max pooling layers, and three fully connected network layers (FCN layers), as listed in Table 1. The network is fed with \(M\times M\times 6\) data (Cnv-1.1 in Table 1) and \(N\times N\times 1\) data (Cnv-1.2 in Table 1), where \(M = 2V+1\) is the side length of the observable range. Then, the two separated networks of convolutional and max-pooling layers are merged in an FCN layer (FCN-1 in Table 1), and the output of the last FCN layer is the Q-values of the agent’s actions. Thus, the action with the maximum Q-value is selected.

Table 1 Structure of network

For the agent’s policy to decide actual actions, we adopt the ε-greedy strategy in which agent i acts randomly with probability \(\varepsilon _{i,t}\) and decides its action by the output of DQN with probability \(1-\varepsilon _{i,t}\) at time t. To obtain various experiences with the associated rewards by acting randomly, especially in the early stage of learning, i sets \(\varepsilon _{i}\) (\(=\varepsilon _{i,0}\)) as the initial value near 1, and decreases it exponentially; specifically, \(\varepsilon _{i,t}\) is updated as

$$ \varepsilon _{i,t} = \max \{\varepsilon _{i,t-1} * \gamma _{ \varepsilon }, \varepsilon _{l}\}, $$

where \(\varepsilon _{l}\geq 0\) is the lower bound of \(\varepsilon _{i,t}\), and \(0\leq \gamma _{\varepsilon } < 1\) is the decreasing rate \(\varepsilon _{i,t}\) of the random selection in the ε-greedy strategy. Note that the final \(\varepsilon _{i,t}\) in the previous epoch is set to its initial value in the next epoch.

5 Experiments and discussion

5.1 Experimental environment and setting

We experimentally investigated how the performance, i.e., the total numbers of completed tasks and unused materials, improved over time, and we analyzed how the value of individual reward \(r_{i}^{1}(t)=r_{i}^{1}\) is incorporated into the behaviors of agents. As previously mentioned, we fix the total reward \(R=r_{i}^{1}(t_{d})+r_{i}^{2}(t_{e})\) for carrier agents, but the ratio of \(r_{i}^{1}(t_{d})\) (\(=r_{i}^{1,h}(t_{d})\)) is changed at every epoch in the GDR scheme. Our experimental environment is shown in Fig. 1. At the beginning of each epoch, a material supply area is placed in the center of the environment, and agents are randomly placed within a \(6 \times 6\) cells of the black dotted line, including the supply area, as shown in Fig. 1. Then, K (>0) installation areas consisting of \(3 \times 3\) cells are randomly placed outside the purple dotted line such that they do not overlap in order to maintain some distance between the supply area and the installation areas, where K is an integer. We can consider various ways to decay the reward \(r_{1}\), such as linear or exponential decay, as well as the value of ε used in the ε-greedy algorithm. We conducted several experiments and set the linear decay rewards and the exponential decay of ε to maximize the agents’ rewards.

In order to investigate the effect of reward scheme on the cooperation of agents, we conducted the first experiment (Exp. 1) using two scenarios, \(r_{i}^{1}(t_{d})\) and \(r_{i}^{2}(t_{e})\) (\(=R-r_{i}^{1}(t_{d})\)) were fixed under the FRR in the first scenario, and the GDR scheme is adapted in the second one. We show the lists of experimental parameter values in Tables 2 and 3. We set the number of carrier agents to \(n_{C} = 8\), which is the number of installation agents \(n_{I} =4\) in all experiments. The difference between \(n_{C}\) and \(n_{I}\) is introduced by considering the cost of the carrier agent’s reciprocation between the supply area and installation areas. We also conducted the second experiment (Exp. 2) in the environment with \(n_{I}=2\) in order to determine how the coordinated behavior learned by agents could work well for extremely unbalanced agent numbers. The data shown below are derived from three experimental runs.

Table 2 Learning parameters
Table 3 Parameters and values in experiments

5.2 Performance comparison

First, we investigated the effect of reward assignment methods on the performance in Exp. 1. We set the individual reward \(r_{i}^{1}(t_{d})\) to 0, 0.1, 0.3, or 0.5 under the FRR scheme. Note that we consider the situations using the FRR with \(r_{i}^{1}(t_{d})=0\) and 0.5 as the baseline performance because such divisions of rewards are straightforward and intuitive. Figure 4 shows the rate of the task completion, where each plot is the moving average of every 50 epochs. All reward assignment schemes, excluding the case when \(r_{i}^{1}(t_{d}) = 0\), improved the performance. For \(r_{i}^{1}(t_{d}) = 0\), no agent can learn the behaviors for their own subtasks or for coordination; thus, the rate of task completion was not improved. The converged rate of task completion when \(r_{i}^{1}(t_{d}) = 0.5\) was nearly identical to the case when \(r_{i}^{1}(t_{d}) = 0.3\) under the FRR, but the learning speed when using \(r_{i}^{1}(t_{d}) = 0.5\) was slightly higher. The rate of task completion when GDR was adapted converged the rate between those under the FRR with \(r_{i}^{1}=0.3\) and \(r_{i}^{1}=0.1\). It should be noted that as the maximum epoch length (H) increases, the task completion rate approaches 1.0 in Fig. 4. We used a slightly shorter maximum epoch length (\(H= 600\)) to observe the difference between the methods and provides more opportunities of learning to all agents because there are many uninstalled cells in the environment.

Figure 4
figure 4

Rate of task completion

However, when \(r_{i}^{1}(t_{d})\) in the FRR scheme is high, it is likely to change the optimal behavior of the carrier agents. To verify this situation, we plotted the usage rate of the materials carried to the cells, i.e., the ratio of the number of subtasks performed by the carrier agents to the number of the subtasks performed by the installation agents, in Fig. 5 where each plot is also the moving average of every 50 epochs. Contrary to the result shown in Fig. 4, Fig. 5 indicates that the usage rate is the highest when \(r_{i}^{1}=0.1\), and the success rate decreases as \(r_{i}^{1}\) increases under the FRR scheme. This high material usage rate may be derived from certain behavior emerging in either or both types of agents to utilize the placed materials without wasting them. When GDR is applied, the success rate converges faster to the higher usage rate than that under the FRR with \(r_{i}^{1}=0.1\). The behaviors of the carrier and/or installation agents that reduced the waste of materials will be analyzed in the next section. The low success rate under the FRR with \(r_{i}^{1}=0\) is believed to be because the carrier agents could not learn the behavior for their own subtasks; thus, the installation agents could not learn either.

Figure 5
figure 5

Usage rate of placed materials

It should be noted that as the maximum epoch length (H) increases, the task completion rate approaches 1.0 in Fig. 4. We used a shorter maximum epoch length (\(H=600\)) to observe the difference between the methods. Thus, Fig. 4 indicates that agents under the GDR scheme or the FRR scheme with \(r_{i}^{1}=0.1\) took a slightly longer time to complete all tasks. In contrast, because Fig. 5 shows the usage rate of the carried and placed materials, which does not change much even if H is set longer.

5.3 Analysis of material carrying and placement behavior

The results of Exp. 1 in Sect. 5.2 indicate that there is a difference in the final efficiency in accordance with the values of the individual reward \(r_{i}^{1}\). This section provides an analysis of the agent’s behavior and discusses the underlying factors that caused the difference. First, the number of used and unused (so wasted) materials placed by carrier agents under each reward scheme are shown as stacked line graphs in Fig. 6, where the materials used by installation agents are shown in green, and those that are not used are shown in blue. We can see from Fig. 6 that the largest number of materials were placed by the carrier agent when \(r_{i}^{1}=0.5\), but many of those placed materials were not used by any installation agent, and were thus wasted. Furthermore, under the FRR scheme, the total number of materials placed decreases as the value of \(r_{i}^{1}\) decreases, while the number of unused materials also reduces. This is especially true when \(r_{i}^{1}=0.5\) (Fig. 6(d)) because the carrier agents could obtain sufficient reward from executing their own subtasks, regardless of their cooperative behavior with the installation agents. Therefore, a better strategy for carrier agents is to not consider installation agents.

Figure 6
figure 6

Number of used and unused (so wasted) materials during the course of learning

Conversely, under the FRR scheme, when \(r_{i}^{1}\) (>0) is small, the carrier agents can also gain more rewards as a result of the completion of the installation agents’ subtasks. Therefore, as shown in Fig. 6(a), although the total number of placed materials is relatively lower, the amount of wasted material is quite small. This indicates that the agents, especially the carrier agents, behaved in some coordinated or cooperative manner to encourage the installation agents to use the materials they placed. Meanwhile, when the GDR was used (Fig. 6(e)), not only did the learning converge faster, but the number of unused materials was also smaller than that when \(r_{i}^{1}=0.1\) under the FRR scheme (Fig. 6(b)). Moreover, because \(r_{i}^{1,h}\) converged to 0 but the remaining reward \(r_{i}^{2,h}\) (→R) was given to the actions of the material placements under the GDR scheme, the career agents are ultimately only focused on completing the task, rather than their own subtasks. This discussion indicates that the cooperative behavior also appeared earlier as a result of the faster learning, and it was then maintained.

5.4 Coordination among agents

To investigate cooperative or coordinated behavior among agents of the same type in Exp. 1, we considered locations in which the carrying agents executed their subtasks (i.e., placement of materials) in the environment. We then checked the surroundings when the materials were placed in order to explore the relationships between heterogeneous agents.

5.4.1 Coordination between carrier agents

First, we surveyed the locations where materials were placed between 10,000 and 13,000 epochs, and the total number of placed materials per agent is shown as heatmaps in Figs. 7(a) (under the GDR scheme) and 7(b) (under the FRR scheme with \(r_{i}^{1}=0.5\)), where the darker red cell indicates that the number of subtasks processed there was larger, and its color turns closer to white as the number of subtasks at the cell decreased. These heatmaps were generated using one experimental run, but the data of other runs also exhibited similar tendencies.

Figure 7
figure 7

Number of materials by cells placed by carrier agents

Both Figs. 7(a) and 7(b) show that the carrier agents autonomously determine their areas of unloading to reduce the overlap of the areas for their subtask executions; this simplifies the movement of the carrier agents and also forms a cooperative organizational division of labor that avoids the competition among the agents in executing the subtasks. Note that the color of the cells in Fig. 7(b) is darker because many materials were placed as shown in Fig. 6, but most of them were unused and were wasted. Furthermore, although the division of labor with respect to the work location is more apparent under the FRR and GDR schemes, Fig. 7(a) indicates that the regime was a mixture of agents who work over a relatively wide area (e.g., Agent 2 and Agent 8 in Fig. 7(a)) as well as agents who concentrate their work in specific areas (e.g., Agent 1 and Agent 5 in Fig. 7(a)). We also checked the locations where installation agents executed their subtask, and found a similar tendency, i.e., division of labor based on the location was observed. We omitted this figure because it is similar to Fig. 7.

5.4.2 Coordination between carrier and installation agents

We investigated the situations in which carrier agents placed materials to identify the coordinated behavior between different types of agents. For this purpose, when a carrier agent placed the material, we kept the number of installation agents within its observable range and the (Manhattan) distances from the closest installation agent. These data indicate the degree to which the carrier agents were interested in the installation agents nearby when the carrier agents execute their subtasks. Figure 8 is a cumulative bar graph that measures the number of installation agents in the observable range when a carrier agent places a material, and it also shows whether the material placed at that time was used or not. Note that this figure is the sum of the experimental results from \(10{,}000\) to \(13{,}000\) epochs, the white bars represent the number of materials removed, and the yellow bars represent those used by the installation agents.

Figure 8
figure 8

Number of installation agents in observable range

Figure 8(c) indicates that under the FRR scheme with \(r_{i}^{1}=0.5\), the carrier agents frequently placed materials even when there were no installation agents in the observable range, and those materials were rarely used; this is wasteful and undesirable. However, if an installation agent existed in the observable range at the time of the placement, the material was likely to be used even under this reward scheme. It means that the installation agents have learned to move to the locations to which the carrier agents are going to place the materials. Under the FRR scheme, with the decreasing values of the individual reward \(r_{i}^{1}\), there was a rapid decrease in the number of materials to be placed even when there were no installation agents in the observable ranges (Fig. 8(a) and Fig. 8(b)). Because the number of agents and other experimental setting are the same, we can consider that the behavior of postponing the unloading executions (i.e., work) until an installation agent comes close was observed in the carrier agents when \(r_{i}^{1}\) was small.

The carrier agents’ selfish behavior of not confirming the presence of nearby installation agents was further reduced under the GDR scheme (Fig. 8(d)). Note that because the reward scheme for the installation agents is the same regardless of the experimental setting, it is likely that they could have learned similar behaviors if they had sufficient opportunity to learn.

To further analyze the cooperative behavior of the carrier agents, we recorded the shortest distances to the installation agents at the time of material placement between 10,000 and 13,000 epochs, and we then listed the percentage (%) of each of the shortest distances when materials were placed and the completion rate of installations using the materials successfully in Table 4, where BOR stands for beyond the observable range. Note that because \(V=3\), an agent cannot observe other agents that are more than seven cells away, but there are a few positions at which it cannot observe other agents with the distance between four and six cells. As a reference, we showed the completion rate under the FRR scheme with \(r_{i}^{1}=0.0\), although the absolute number of the material placement was extremely low (roughly 1/500 to 1/200 of other cases). Thus, the installation agents had only a few opportunities to learn).

Table 4 Distance to the installation agents and associated task completion rate in Exp. 1

First, Table 4 shows that the completion rate of tasks is not so much related to the GDR or the FRR with various \(r_{i}^{1}\), but rather depends on the shortest distance to the installation agent. In particular, because the usable time of placed materials was set to \(\mu = 6\), the completion rate is almost zero (less than 0.01) when the shortest distance is greater than 6, or when there is no installation agent within the observable range. This indicates that the installation agents have learned almost the same extent after 10,000 epochs, except under the FRR scheme with \(r_{i}^{1}=0\) due to insufficient learning (not shown).

Based on the rate of material placement in Table 4, we also see that as \(r_{i}^{1}\) increased, the carrier agents tended to place the materials more frequently even if no installation agent was nearby. This indicates that under the FRR scheme with \(r_{i}^{1}=0.1\), the carrier agent is likely to wait for an installation agent to approach before executing its own subtask to increase the completion rate because when the carrier agents arrived at the locations where the materials should be placed, their surroundings were considered to be indifferent.

Table 4 indicates that this tendency was more apparent under the GDR scheme, and thus, it suggests that firm coordinated behavior could emerge in the carrier agents. This also reduced the number of unused materials. This is a preferred feature of our target problem, and can also reduce energy consumption. Meanwhile, it is disadvantageous for carrier agents to wait for another agent to come close in the sense that they cannot increase the number of subtasks to be executed. In fact, under the FRR scheme with \(r_{i}^{1}=0.5\), a very large number of subtasks executed by carrier agents and gained more rewards. It also increased the learning opportunities for the installation agents, although there were many wasted materials. In addition, the completion rate increased slightly as \(r_{i}^{1}\) increased (Table 4); for example, when the shortest distance is 5, it increased as 0.32 (\(r_{i}^{1}=0.1\)), 0.38 (\(r_{i}^{1}=0.3\)), and 0.45 (\(r_{i}^{1}=0.5\)), and it was 0.30 under the GDR scheme. This indicates that the installation agents had more chances to be trained for the cases when the shortest distance is relatively longer. However, the carrier agents placed the materials when an installation agent came close. This coordinated behavior could compensate for the lower completion rate under the GDR scheme, resulting in fewer unused materials.

5.5 Coordination between unbalanced heterogeneous agents

Finally, we checked whether the agents under the GDR scheme could learn and maintain the same/similar coordinated behaviors between heterogeneous agents when we set the number of installation agents to two (\(n_{I}=2\)) in Exp. 2. The purpose of Exp. 2 is to verify that the coordinated behavior learned under the GDR scheme is firm and stable, i.e., whether the carrier agents can wait for the approach or arrival of the installation agents for a longer period of time owing to the smaller number of installation agents. As expected, such a long wait may reduce the rewards of the carrier agents, and thus, without the learning of firm cooperative behavior, agents are likely to focus only on accomplishing their own subtasks and not on actions that lead to social and communal contributions; this will produce a large number of unused materials.

In Exp. 1, the performance under the GDR scheme and the FRR scheme with \(r_{i}^{1}=0.1\) was roughly the same, but it is not the case in Exp. 2. Figure 9 shows a plot of the usage rate of the materials placed on the uninstalled cells. Each plot was the moving average of every 50 epochs, as shown in Fig. 5. Interestingly, this figure highlights the significant difference between the proposed GDR and the baseline FRR method. When all agents adopted the GDR scheme, the usage rate of the placed materials in Exp. 2 was almost identical to that in Exp. 1, and thus, there was no alteration in their coordinated behavior. However, under the FRR with \(r_{i}^{1}=0.5\), 0.3, and 0.1, the usage rate converged to approximately 0.38, 0.43, and 0.70, respectively, which was considerably lower than those in Exp. 1, which were approximately 0.58, 0.69, and 0.81, respectively. Note too that agents under the FRR scheme with \(r_{i}^{1}=0.0\) could not learn the behavior for their own subtasks, as in Exp. 1.

Figure 9
figure 9

Usage rate of placed materials in Exp. 2

This tendency was also indicated in Table 5, which lists the the percentage (%) of each of the shortest distances when materials were placed and the completion rate of installations using the materials successfully in Exp. 2, as shown in Table 4. Because the rates of material placement under the FRR scheme in this table, especially the data in BOR column, significantly increased relative to those in Table 4, the carrier agents under the FRR scheme were likely to ignore the locations of the installation agents. Hence, the unused materials increased. However, under the GDR scheme, the carrier agents still performed their subtasks in the same way, aware that the installation agents were approaching. Thus, the carrier agents could still take the desired coordinated behaviors, even though the number of installation agents decreased and the waiting time appeared to increase.

Table 5 Distance to the installation agents and associated task completion rate in Exp. 2

6 Discussion

Next, we summarize and discuss the results of our experiments so far. From our experiments, we found that the agents developed different behaviors depending on the reward ratios under the GDR and FRR schemes using the two-stage reward allocation method (with delay), in which rewards were allocated along with the agents’ actions for the problem of sequential cooperative tasks. In our problem, besides learning the behaviors necessary for executing their own subtasks, agents need to learn autonomously to exhibit cooperative behaviors that facilitate the subsequent subtasks of different types of agents in order to increase the probability of the completion of cooperative tasks. Under the FRR scheme, when the individual reward \(r_{i}^{1}\) given to the carrier agents is small and the contribution reward for task completion \(r_{i}^{2}\) is relatively large, the carrier agents are likely to learn the cooperative behavior that encourages the installation agents to execute their subtasks so that the unload/placed materials were used with a high probability. When the individual reward was further made extremely small (i.e., \(r_{i}^{1}=0\)), the carrier agents cannot receive any reward for their own subtasks, resulting in few opportunities to learn the value of each action. Thus, all agents could not learn actions sufficiently for their subtasks. However, we believe that it is natural that rewards should be given after the entire task is completed from the environment if we consider the requirement of our application problem.

On the other hand, when the individual reward \(r_{i}^{1}\) is large, the carrier agents could gain enough rewards only from executing their own subtasks, so there was a rapid convergence of learning for their own subtasks. However, they did not learn enough coordinated behaviors for the installation agents, and many of their carried and placed materials were wasted. Under the GDR scheme, the carrier agents learned the behavior of their own subtasks to some extent in the earlier stage of learning, and they then gradually weighted the learning of the coordinated behavior for the subsequent subtasks of the installation agents. From the viewpoint of the installation agents, because the carrier agents could place a large number of materials from the early stage of learning, they also had many opportunities to learn for its own behavior, i.e., find the placed material and work using it. They then adapted to the coordinated behavior of the carrier agents. Therefore, both the usage rate of the materials placed by the carrier agents and the task completion rate could be higher under the GDR scheme.

Next, we focus on the coordinated behavior between the same types of agents. The agents divided the environment into several areas of responsibility and determined the areas of responsibility in which their own subtasks should be executed to prevent overlapping with other agents; this could prevent the occurrence of conflicts among agents. This coordinated behavior was observed regardless of the ratio of the first individual reward \(r_{i}^{1}\) to the second contribution reward \(r_{i}^{2}\) if they had enough opportunities for learning.

Further investigation showed that even when individual reward \(r_{i}^{1}\) was positive, the behavior of the carrier agents differed significantly depending on the value of \(r_{i}^{1}\). When it is large, the carrier agents tended to unload and place the materials regardless of whether the installation agent, who is the cooperative partner, was within the observable range or not. In contrast, when the GDR or FRR schemes with a small \(r_{i}^{1}\) was adopted in Exp. 1, the carrier agents tended to wait until the installation agent entered the observable range before placing the materials. This implies a trade-off between coordinated behavior and behavior that only pursues the completion of its own subtask. This is because in the former cases, a carrier agent waited for a while until at least one installation agent came close, which reduced the number of unused materials, but slightly lowered the efficiency.

However, the results of Exp. 2 show that even for the FRR with a small individual reward \(r_{i}^{1}\), the agents’ coordinated behavior was easily broken and could not be fully maintained. This is because the number of installation agents is small, and it takes time for them to approach one of the carrier agents. However, the carrier agents cannot wait for that to occur, but if \(r_{i}^{1}>0\), the carrier agents under the FRR scheme will be slightly rewarded even if the tasks are not completed; thus, their strategy will be altered and the cooperative behavior will be somewhat disrupted. In fact, even if \(r_{i}^{1}+r_{i}^{2}\) is fixed, as long as \(r_{i}^{1}>0\), the strategy that requires the carrier agent to keep waiting for the installation agent is not optimal. This phenomenon becomes more pronounced when the expected value of the waiting time increases. This would deviate from the objective of learning while wasting less material in the applications that we envision. However, it is clear from Exp. 1 that agents could not learn when \(r_{i}^{1} = 0\). In contrast, by using the proposed two-stage reward allocation method with delay, that is, the GDR scheme, \(r_{i}^{1}\) eventually converges to 0, and the coordinated behavior learned by agents is consistent with our target applications.

As expected, placing the material when the installation agent came close better guaranteed the execution of the subsequent task. This is partly because it is easier for the installation agent to get close to the material, but moreso because there is less overlap in their observable ranges when the two agents are apart; therefore, even if the carrier agent placed the material when the installation agent was in its observable range, it may still be moving toward another material that was not in the observable range of the carrier agent. This can occur after sufficient learning, and may thus produces a large amount of unused material. In contrast, if the distance is 1 (closest), the difference in the observable ranges is small, and they are likely to set the same destination to execute their work actions. However, the installation agent needs to learn not only to locate the material, but also to approach the agent carrying the material. For this reason, the input information on the agent carrying the material shown in Fig. 3(b) is essential.

We believe that the proposed two-stage reward allocation method with delay (the GDR scheme) can be applicable to sequential tasks with a time limit consisting of three or more subtasks. In other words, agent i should receive the reward after the completion of its own subtask and the reward when all the subtasks in the sequence are completed. Moreover, in our environment, agents could generate some degree of behavior for their own subtask and cooperative behavior for others, even when \(r_{i}^{1}=0.1\) was fixed (FRR scheme). However, it is presumed that the appropriate value of \(r_{i}^{1}\) depends on the environmental characteristics. On the other hand, it is relatively easy for agents to adjust the parameters under the GDR scheme by considering the success rate of their own subtasks, and then to vary the ratio to achieve learning of cooperative behavior. We would like to make these topics our future work.

When learning agents are deployed in our target applications, collision is a problem that must be solved. In many cases, collision avoidance is achieved by combining the learning of policies with the rules such as stop orders using infrared sensors or other means/devices. This paper, however, more focuses on the emergence of cooperative behavior through autonomous learning, and thus omits such rule-based functions necessary for real applications. However, we would like to develop the construction robots equipped with the combination of learning of policies and rule-based functions in the future work.

7 Conclusion

This paper introduced a sequential problem that abstracts the application of a robotic installation system on a construction site, and proposed a two-stage reward allocation method with delay, in order to achieve both behaviors for the efficiency of agents’ own subtasks as well as cooperative behaviors that facilitate the subtasks of other types of agents for tasks. Then, we investigated the effect of the reward allocation scheme on the learning of both behaviors. From the experimental results, when the first reward for executing the agent’s own subtask was high, it learned the behavior for its own subtask efficiently, but did not consider the behavior of other agents with whom it should cooperate; it tended to neglect the coordinated behavior enhancing the task completion. In contrast, when the first reward for executing the agent’s own subtasks was low, it could learn cooperative behavior, but took time to converge the learning for its own subtasks. Thus, the delay in learning reduced the opportunity for the next agents to learn their subtasks. Then, we proposed a reward allocation method, GDR, based on these results, and showed that it could improve the system performance by enabling both the efficient learning of behaviors necessary for self-subtask and coordination that facilitate the performance of subsequent agents. Based on the results of both experiments, we found that the coordination behavior developed by the GDR is robust and stable in the sense that it can be maintained, even when the number of different types of agents coordinating with each other is unbalanced.

Here, we did not discuss much about the convergence speed of learning. This is because our purpose was to learn balanced behavior, and for this purpose, it was necessary for the first agent to learn sufficiently in the first half and gradually shift the weight of reward. It is necessary to analyze the speed of learning and other factors, but this is beyond the purpose of this paper and complicates it; we would like to describe it in our future paper. Furthermore, we aim to analyze in more detail the relationship between the rate of decrease of the first reward \(r_{i}^{1}\) in the GDR and the emerging behavior of the agents. We also investigate the effect of reward allocation on cooperative behavior in more complex task configurations, such as problems with sequential execution of subtasks in multiple stages and problems where negative rewards (punishment) exist. We also plan to investigate the effects of the number of agents for each subtask, especially an unbalanced number of agents.