Reinforcement learning in a continuum of agents

Šošić, Adrian; Zoubir, Abdelhak M.; Koeppl, Heinz

doi:10.1007/s11721-017-0142-9

Reinforcement learning in a continuum of agents

Published: 13 October 2017

Volume 12, pages 23–51, (2018)
Cite this article

Swarm Intelligence Aims and scope Submit manuscript

1918 Accesses
3 Citations
1 Altmetric
Explore all metrics

A Publisher Correction to this article was published on 07 April 2018

This article has been updated

Abstract

We present a decision-making framework for modeling the collective behavior of large groups of cooperatively interacting agents based on a continuum description of the agents’ joint state. The continuum model is derived from an agent-based system of locally coupled stochastic differential equations, taking into account that each agent in the group is only partially informed about the global system state. The usefulness of the proposed framework is twofold: (i) for multi-agent scenarios, it provides a computational approach to handling large-scale distributed decision-making problems and learning decentralized control policies. (ii) For single-agent systems, it offers an alternative approximation scheme for evaluating expectations of state distributions. We demonstrate our framework on a variant of the Kuramoto model using a variety of distributed control tasks, such as positioning and aggregation. As part of our experiments, we compare the effectiveness of the controllers learned by the continuum model and agent-based systems of different sizes, and we analyze how the degree of observability in the system affects the learning process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Iterative Learning Control Design for Multiagent Systems Based on 2D Models

Article 28 June 2018

P. V. Pakshin, J. P. Emelianova & M. A. Emelianov

The Benefits of Interaction Constraints in Distributed Autonomous Systems

Distributed Consensus and Coordination Control of Networked Multi-agent Systems

Change history

07 April 2018
The original version of this article unfortunately contained a mistake. The presentation of Equation (21) was incorrect. The corrected equation is given below.

Notes

^{Footnote 2}
Note that we use the terms decision-making and control interchangeably in this work.
Note that both these exploration types are different from the exploration in policy space, which we discuss in detail in Sect. 4.
Note that the value in Eq. (11) is based on a global definition of reward. We can easily switch to a “local” (i.e., agent-based) value computation by choosing $R^G$ as in Eq. (10), which is in accordance with the definition of private value in Šošić et al. (2017).
This function is not to be confused with the probability density function of a single agent’s state (see Sect. 3.4) which, in contrast to the object defined here, is a deterministic quantity.
Recall that the continuum model requires only one system roll-out (see Sect. 3.3).

References

Abelson, H., Allen, D., Coore, D., Hanson, C., Homsy, G., Knight, T. F., et al. (2000). Amorphous computing. Communications of the ACM, 43(5), 74–82.
Article Google Scholar
Aumann, R. J. (1964). Markets with a continuum of traders. Econometrica, 32(1), 39–50.
Article MathSciNet MATH Google Scholar
Beal, J. (2005). Programming an amorphous computational medium. In J. P Banâtre, P. Fradet, J. L. Giavitto, & O. Michel (Eds.), Unconventional programming paradigms (pp. 121–136). Berlin: Springer.
Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819–840.
Article MathSciNet MATH Google Scholar
Billingsley, P. (1999). Convergence of probability measures. New York: Wiley.
Book MATH Google Scholar
Brambilla, M., Ferrante, E., Birattari, M., & Dorigo, M. (2013). Swarm robotics: A review from the swarm engineering perspective. Swarm Intelligence, 7(1), 1–41.
Article Google Scholar
Correll, N., & Martinoli, A. (2006). System identification of self-organizing robotic swarms. In M. Gini & R. Voyles (Eds.) Distributed autonomous robotic systems 7 (pp. 31–40). Tokyo: Springer Japan.
Couzin, I. D., Krause, J., James, R., Ruxton, G. D., & Franks, N. R. (2002). Collective memory and spatial sorting in animal groups. Journal of Theoretical Biology, 218(1), 1–11.
Article MathSciNet Google Scholar
Crutchfield, J. P., & Mitchell, M. (1995). The evolution of emergent computation. Proceedings of the National Academy of Sciences, 92(23), 10742–10746.
Article MATH Google Scholar
Dean, D. S. (1996). Langevin equation for the density of a system of interacting Langevin processes. Journal of Physics A: Mathematical and General, 29(24), L613.
Article MathSciNet Google Scholar
Deisenroth, M. P., Neumann, G., & Peters, J. (2013). A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1–2), 1–142.
Google Scholar
Doucet, A., Godsill, S., & Andrieu, C. (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10(3), 197–208.
Article Google Scholar
Dubkov, A., & Spagnolo, B. (2005). Generalized Wiener process and Kolmogorov’s equation for diffusion induced by non-Gaussian noise source. Fluctuation and Noise Letters, 5(02), L267–L274.
Article MathSciNet Google Scholar
Ermentrout, G. B., & Edelstein-Keshet, L. (1993). Cellular automata approaches to biological modeling. Journal of Theoretical Biology, 160(1), 97–133.
Article Google Scholar
Fornberg, B., & Flyer, N. (2015). Solving PDEs with radial basis functions. Acta Numerica, 24, 215–258.
Article MathSciNet MATH Google Scholar
Freitas, R. A. (2005). Current status of nanomedicine and medical nanorobotics. Journal of Computational and Theoretical Nanoscience, 2(1), 1–25.
Google Scholar
Grondman, I., Busoniu, L., Lopes, G. A. D., & Babuska, R. (2012). A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 1291–1307.
Article Google Scholar
Hamann, H. (2014). Evolution of collective behaviors by minimizing surprise. In Proceedings of the 14th international conference on the synthesis and simulation of living systems (pp. 344–351). MIT Press.
Hamann, H., & Wörn, H. (2008). A framework of space–time continuous models for algorithm design in swarm robotics. Swarm Intelligence, 2(2), 209–239.
Article Google Scholar
Hayes, A. T. (2002). How many robots? Group size and efficiency in collective search tasks. In H. Asama, T. Arai, T. Fukuda, & T. Hasegawa (Eds.), Distributed autonomous robotic systems 5 (pp. 289–298). Tokyo: Springer Japan.
Houchmandzadeh, B., & Vallade, M. (2015). Exact results for a noise-induced bistable system. Physical Review E, 91(2), 022115.
Article MathSciNet Google Scholar
Hüttenrauch, M., Šošić, A., & Neumann, G. (2017). Guided deep reinforcement learning for swarm systems. In AAMAS workshop on autonomous robots and multirobot systems. arXiv:1709.06011.
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1), 99–134.
Article MathSciNet MATH Google Scholar
Karatzas, I., & Shreve, S. (1998). Brownian motion and stochastic calculus. Berlin: Springer Science & Business Media.
Book MATH Google Scholar
Krylov, N. V. (2008). Controlled diffusion processes. Berlin: Springer Science & Business Media.
Google Scholar
Kuramoto, Y. (1975). Self-entrainment of a population of coupled non-linear oscillators. In International symposium on mathematical problems in theoretical physics (pp. 420–422). Springer.
Land, M., & Belew, R. K. (1995). No perfect two-state cellular automata for density classification exists. Physical Review Letters, 74(25), 5148.
Article Google Scholar
Lasry, J.-M., & Lions, P.-L. (2007). Mean field games. Japanese Journal of Mathematics, 2(1), 229–260.
Article MathSciNet MATH Google Scholar
Lerman, K., Martinoli, A., & Galstyan, A. (2005). A review of probabilistic macroscopic models for swarm robotic systems. In Swarm robotics: SAB 2004 international workshop (pp. 143–152). Berlin: Springer.
Lesser, V., Ortiz, C. L., & Tambe, M. (2003). Distributed sensor networks: A multiagent perspective. Berlin: Springer Science & Business Media.
Book MATH Google Scholar
MacLennan, B. J. (1990). Continuous spatial automata. Technical report, University of Tennessee, Computer Science Department.
Macua, S. V., Chen, J., Zazo, S., & Sayed, A. H. (2015). Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control, 60(5), 1260–1274.
Article MathSciNet MATH Google Scholar
Martinoli, A., Ijspeert, A. J., & Mondada, F. (1999). Understanding collective aggregation mechanisms: From probabilistic modelling to experiments with real robots. Robotics and Autonomous Systems, 29(1), 51–63.
Article Google Scholar
Michini, B., & How, J. P. (2012). Bayesian nonparametric inverse reinforcement learning. In P. A. Flach, T. De Bie, & N. Cristianini (Eds.), Machine learning and knowledge discovery in databases (pp. 148–163). Berlin: Springer.
Munos, R. (2006). Policy gradient in continuous time. Journal of Machine Learning Research, 7, 771–791.
MathSciNet MATH Google Scholar
Ohkubo, J., Shnerb, N., & Kessler, D. A. (2008). Transition phenomena induced by internal noise and quasi-absorbing state. Journal of the Physical Society of Japan, 77(4), 044002.
Article Google Scholar
Ramaswamy, S. (2010). The mechanics and statistics of active matter. Annual Review of Condensed Matter Physics, 1(1), 323–345.
Article MathSciNet Google Scholar
Risken, H. (1996). Fokker–Planck equation. In H. Haken (Ed.) The Fokker–Planck equation (pp. 63–95). Berlin, Heidelberg: Springer.
Schweitzer, F. (2003). Brownian agents and active particles: Collective dynamics in the natural and social sciences. Berlin, Heidelberg: Springer.
MATH Google Scholar
Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., & Schmidhuber, J. (2010). Parameter-exploring policy gradients. Neural Networks, 23(4), 551–559.
Article Google Scholar
Sipper, M. (1999). The emergence of cellular computing. Computer, 32(7), 18–26.
Article Google Scholar
Šošić, A., KhudaBukhsh, W. R., Zoubir, A. M., Koeppl, H. (2017). Inverse reinforcement learning in swarm systems. In Proceedings of the 16th international conference on autonomous agents and multiagent systems (pp. 1413–1421). International Foundation for Autonomous Agents and Multiagent Systems.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.
Google Scholar
Vicsek, T., Czirók, A., Ben-Jacob, E., Cohen, I., & Shochet, O. (1995). Novel type of phase transition in a system of self-driven particles. Physical Review Letters, 75(6), 1226–1229.
Article MathSciNet Google Scholar
Whitesides, G. M., & Grzybowski, B. (2002). Self-assembly at all scales. Science, 295(5564), 2418–2421.
Article Google Scholar

Download references

Acknowledgements

H. Koeppl gratefully acknowledges support from the German Research Foundation (DFG) within the Collaborative Research Center (CRC) 1053 - MAKI.

Author information

Authors and Affiliations

Department of Electrical Engineering and Information Technology, Technische Universität Darmstadt, 64283, Darmstadt, Germany
Adrian Šošić, Abdelhak M. Zoubir & Heinz Koeppl

Authors

Adrian Šošić
View author publications
You can also search for this author in PubMed Google Scholar
Abdelhak M. Zoubir
View author publications
You can also search for this author in PubMed Google Scholar
Heinz Koeppl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adrian Šošić.

Appendix: Derivation of the continuum equation

In the following, we show how the continuum equation (15) can be derived from the agent-based system of stochastic differential equations (1) as the number of agents in the system approaches infinity. Herein, we follow the derivation in Dean (1996), which we extend with the necessary control-related objects.

Our goal is to find an expression for the temporal evolution of the global agent density $\rho ^{(N)}(x,t)$ as $N\rightarrow \infty $. We start with an Itô expansion of the stochastic differential equation (5),

$$\begin{aligned} \begin{aligned} {\mathrm {d}}f\big (X_i(t)\big )&= \nabla f\big (X_i(t)\big ) \cdot h\Big (X_i(t), \pi _\theta \big (\xi (X_i(t),X(t))\big )\Big ){\mathrm {d}}t \\&\phantom {=}\ + \ D\nabla ^2f\big (X_i(t)\big )\,{\mathrm {d}}t + \nabla f\big (X_i(t)\big ) \cdot {\mathrm {d}}W_i(t) ,\end{aligned} \end{aligned}$$

(24)

where $f:\mathcal {X}\rightarrow \mathbb {R}$ is a twice-differentiable test function. Using the identity

$$\begin{aligned} f\big (X_i(t)\big ) = \int _{x\in \mathcal {X}} \rho _i(x,t)f(x) \, \mathrm {d}x ,\end{aligned}$$

(25)

which follows from the definition of the single-agent density in Eq. (14), we rewrite Eq. (24) as

$$\begin{aligned} {\mathrm {d}}f\big (X_i(t)\big )&= \int _{x\in \mathcal {X}} \bigg [ \nabla f(x) \cdot h\Big (x, \pi _\theta \big (\xi (x,X(t))\big )\Big ){\mathrm {d}}t \\&\phantom {=}\ + \ D\nabla ^2f(x)\,{\mathrm {d}}t + \nabla f(x) \cdot {\mathrm {d}}W_i(t)\bigg ] \rho _i(x,t) \, \mathrm {d}x .\end{aligned}$$

Next, we integrate this equation by parts, which yields

$$\begin{aligned} {\mathrm {d}}f\big (X_i(t)\big )&= \int _{x\in \mathcal {X}} \bigg \{-\nabla \cdot \bigg [\rho _i(x,t) h\Big (x, \pi _\theta \big (\xi (x,X(t))\big )\Big )\bigg ]{\mathrm {d}}t \\&\phantom {=}\ + D\nabla ^2\rho _i(x,t)\,{\mathrm {d}}t - \nabla \cdot \rho _i(x,t)\,{\mathrm {d}}W_i(t) \bigg \} f(x) \, \mathrm {d}x .\end{aligned}$$

On the other hand, identity (25) also implies that

$$\begin{aligned} {\mathrm {d}}f\big (X_i(t)\big ) = \int _{x\in \mathcal {X}} {\mathrm {d}}\rho _i(x,t)\,f(x) \, \mathrm {d}x .\end{aligned}$$

Comparing both equations, we conclude that

$$\begin{aligned} {\mathrm {d}}\rho _i(x,t)&= -\nabla \cdot \bigg [\rho _i(x,t) h\Big (x, \pi _\theta \big (\xi (x,X(t))\big )\Big )\bigg ]{\mathrm {d}}t \nonumber \\&\phantom {=}\ + D\nabla ^2\rho _i(x,t)\,{\mathrm {d}}t - \nabla \cdot \rho _i(x,t)\,{\mathrm {d}}W_i(t) .\end{aligned}$$

In order to obtain an expression for the global density, we sum up all agent-based increments, which gives

$$\begin{aligned} {\mathrm {d}}\rho ^{(N)}(x,t)&= \frac{1}{N} \sum _{i=1}^N {\mathrm {d}}\rho _i(x,t) \nonumber \\&= -\nabla \cdot \Big [\rho ^{(N)}(x,t) h\big (x, \bar{u}^{(N)}(x,t)\big )\Big ]{\mathrm {d}}t \nonumber \\&\phantom {=}\ + D\nabla ^2\rho ^{(N)}(x,t)\,{\mathrm {d}}t - \nabla \cdot \frac{1}{N} \sum _{i=1}^N \rho _i(x,t)\,{\mathrm {d}}W_i(t) , \end{aligned}$$

(26)

where we introduced the finite-size control field $\bar{u}^{(N)}(x,t)$,

$$\begin{aligned} \bar{u}^{(N)}(x,t) :=\pi _\theta \big (\overline{y}^{(N)}(x,t)\big ) , \end{aligned}$$

(27)

and the underlying observation field $\overline{y}^{(N)}(x,t)$,

$$\begin{aligned} \overline{y}^{(N)}(x,t)&:=\xi \big (x,X(t)\big ) = \frac{\int _\mathcal {X} \rho ^{(N)}(y,t)g(x,y)k(x,y) \, \mathrm {d}y}{\int _\mathcal {X} \rho ^{(N)}(y',t)k(x,y') \, \mathrm {d}y'} , \end{aligned}$$

(28)

as replacements for the agent-based control and observation signals, $\{u_i(t)\}$ and $\{Y_i(t)\}$, respectively. Note that the latter equation follows directly from Eq. (3) using the definition of the N-agent density in Eq. (13).

As shown by Dean (1996), the cumulative influence of the agent-dependent noise terms in Eq. (26) can be described by a statistically equivalent, agent-independent field of noise processes $\overline{W}(x,t)$ with correlation function

$$\begin{aligned} {\mathbb {E}}\left[ \overline{W}_m(x,t)\overline{W}_n\left( y,t'\right) \right] = 2D\delta _{m,n}\delta _{x,y}\min \left( t,t'\right) ,\end{aligned}$$

where $\overline{W}_m(x,t)$ denotes the $m\text {th}$ component of the field at position x and time t. Equation (26) then simplifies to

$$\begin{aligned} {\mathrm {d}}\rho ^{(N)}(x,t)&= -\nabla \cdot \Big [\rho ^{(N)}(x,t) h\big (x, \bar{u}^{(N)}(x,t)\big )\Big ]{\mathrm {d}}t\\&\phantom {=}\ + D\nabla ^2\rho ^{(N)}(x,t)\,{\mathrm {d}}t + \nabla \cdot \left[ \frac{1}{N}\sqrt{\rho ^{(N)}(x,t)}\,{\mathrm {d}}\overline{W}(x,t) \right] .\end{aligned}$$

In the limit $N\rightarrow \infty $, the stochastic component of this differential equation vanishes and we obtain our final convection–diffusion dynamics for the continuum density $\rho (x,t)$,

$$\begin{aligned} \frac{\partial \rho (x,t)}{\partial t} = -\nabla \cdot \Big [\rho (x,t) h\big (x, \bar{u}(x,t)\big )\Big ] + D\nabla ^2\rho (x,t) ,\end{aligned}$$

where the continuum control field $\bar{u}(x,t)$ and the underlying continuum observation field $\overline{y}(x,t)$ are defined as in Eqs. (27) and (28), respectively, but $\rho ^{(N)}(x,t)$ is replaced by $\rho (x,t)$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Šošić, A., Zoubir, A.M. & Koeppl, H. Reinforcement learning in a continuum of agents. Swarm Intell 12, 23–51 (2018). https://doi.org/10.1007/s11721-017-0142-9

Download citation

Received: 10 March 2017
Accepted: 29 September 2017
Published: 13 October 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s11721-017-0142-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reinforcement learning in a continuum of agents

Abstract

Access this article

Similar content being viewed by others

Iterative Learning Control Design for Multiagent Systems Based on 2D Models

The Benefits of Interaction Constraints in Distributed Autonomous Systems

Distributed Consensus and Coordination Control of Networked Multi-agent Systems

Change history

07 April 2018

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Derivation of the continuum equation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Reinforcement learning in a continuum of agents

Abstract

Access this article

Similar content being viewed by others

Iterative Learning Control Design for Multiagent Systems Based on 2D Models

The Benefits of Interaction Constraints in Distributed Autonomous Systems

Distributed Consensus and Coordination Control of Networked Multi-agent Systems

Change history

07 April 2018

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Derivation of the continuum equation

Appendix: Derivation of the continuum equation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation