Can bounded and self-interested agents be teammates? Application to planning in ad hoc teams

Abstract

Planning for ad hoc teamwork is challenging because it involves agents collaborating without any prior coordination or communication. The focus is on principled methods for a single agent to cooperate with others. This motivates investigating the ad hoc teamwork problem in the context of self-interested decision-making frameworks. Agents engaged in individual decision making in multiagent settings face the task of having to reason about other agents’ actions, which may in turn involve reasoning about others. An established approximation that operationalizes this approach is to bound the infinite nesting from below by introducing level 0 models. For the purposes of this study, individual, self-interested decision making in multiagent settings is modeled using interactive dynamic influence diagrams (I-DID). These are graphical models with the benefit that they naturally offer a factored representation of the problem, allowing agents to ascribe dynamic models to others and reason about them. We demonstrate that an implication of bounded, finitely-nested reasoning by a self-interested agent is that we may not obtain optimal team solutions in cooperative settings, if it is part of a team. We address this limitation by including models at level 0 whose solutions involve reinforcement learning. We show how the learning is integrated into planning in the context of I-DIDs. This facilitates optimal teammate behavior, and we demonstrate its applicability to ad hoc teamwork on several problem domains and configurations.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    A GUI-based software tool called Netus is freely available from http://tinyurl.com/mwrtlvg for designing I-DIDs.

  2. 2.

    Policy shown in Fig. 11a is also obtained when agent j is modeled using a level 1 I-DID, and models i at level 0.

References

  1. 1.

    Adam, B., & Dekel, E. (1993). Hierarchies of beliefs and common knowledge. International Journal of Game Theory, 59(1), 189–198

  2. 2.

    Adoe, F., Chen, Y., & Doshi, P. (2015). Fast solving of influence diagrams for multiagent planning on GPU-enabled architectures. In International conference on agents and artificial intelligence (ICAART) (pp. 183–195)

  3. 3.

    Agmon, N., Barrett, S., & Stone, P. (2014). Modeling uncertainty in leading ad hoc teams. In Proceedings of the 13th international conference on autonomous agents and multiagent systems (AAMAS)

  4. 4.

    Agmon, N., & Stone, P. (2012). Leading ad hoc agents in joint action settings with multiple teammates. In Proceedings of the 11th international conference on autonomous agents and multiagent systems. International foundation for autonomous agents and multiagent systems (Vol. 1, pp. 341–348)

  5. 5.

    Agogino, A., & Turner, K. (2005). Multi-agent reward analysis for learning in noisy domains. In Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems (pp. 81–88). Providence, RI: ACM

  6. 6.

    Albrecht, S., Crandall, J., & Ramamoorthy, S. (2016). Belief and truth in hypothesised behaviours. Artificial Intelligence 235, 63–94

  7. 7.

    Albrecht, S., & Ramamoorthy, S. (2013). A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. Tech. rep., Univ. of Edinburgh

  8. 8.

    Albrecht, S., & Ramamoorthy, S. (2013). A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems (extended abstract). In AAMAS (pp. 1155–1156)

  9. 9.

    Albrecht, S., & Ramamoorthy, S. (2014). On convergence and optimality of best-response learning with policy types in multiagent systems. In Proceedings of the 30th conference on uncertainty in artificial intelligence (UAI-14). Quebec City

  10. 10.

    Albrecht, S., & Ramamoorthy, S. (2015). Are you doing what i think you are doing? criticising uncertain agent models. In Proceedings of the 31st conference on uncertainty in artificial intelligence (UAI-15). Amsterdam

  11. 11.

    Amato, C., Konidaris, G. D., & Kaelbling, L. P. (2014). Planning with macro-actions in decentralized pomdps. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. International foundation for autonomous agents and multiagent systems (pp. 1273–1280)

  12. 12.

    Amato, C., & Oliehoek, F. A. (2015). Scalable planning and learning for multiagent pomdps. In Proceedings of the 29th AAAI conference on artificial intelligence

  13. 13.

    Aumann, R. J. (1999). Interactive epistemology II: Probability. International Journal of Game Theory, 28, 301–314.

    MathSciNet  Article  MATH  Google Scholar 

  14. 14.

    Banerjee, B., Lyle, J., Kraemer, L., & Yellamraju, R. (2012) Solving finite horizon decentralized pomdps by distributed reinforcement learning. In AAMAS workshop on MSDM (pp. 9–16)

  15. 15.

    Barrett, S., Stone, P., & Kraus, S. (2011). Empirical evaluation of ad hoc teamwork in the pursuit domain. In Autonomous agents and multi-agent systems

  16. 16.

    Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819–840.

    MathSciNet  Article  MATH  Google Scholar 

  17. 17.

    Bernstein, D. S., Hansen, E. A., & Zilberstein, S. (2005). Bounded policy iteration for decentralized pomdps. In International joint conference on artificial intelligence

  18. 18.

    Binmore, K. (1982). Essays on foundations of game theory. Boston, MA: Pitman.

    MATH  Google Scholar 

  19. 19.

    Boutilier, C. (1999). Sequential optimality and coordination in multiagent systems. IJCAI, 99, 478–485.

    Google Scholar 

  20. 20.

    Bowling, M., & McCracken, P. (2005). Coordination and adaptation in impromptu teams. AAAI, 5, 53–58.

    Google Scholar 

  21. 21.

    Bowling, M.H., & McCracken, P. (2005). Coordination and adaptation in impromptu teams. In Association for the advancement of artificial intelligence (pp. 53–58)

  22. 22.

    Brandenburger, A. (2007). The power of paradox: Some recent developments in interactive epistemology. International Journal of Game Theory, 35, 465–492.

    MathSciNet  Article  MATH  Google Scholar 

  23. 23.

    Brown, G. W. (1951). Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13(1), 374–376.

    MathSciNet  MATH  Google Scholar 

  24. 24.

    Camerer, C. (2003). Behavioral game theory: Experiments in strategic interaction. Princeton, NJ: Princeton University Press.

    MATH  Google Scholar 

  25. 25.

    Camerer, C. F., Ho, T. H., & Chong, J. K. (2004). A cognitive hierarchy model of games. The Quarterly Journal of Economics, 119(3), 861–898.

    Article  MATH  Google Scholar 

  26. 26.

    Carlin, A., & Zilberstein, S. (2008). Value-based observation compression for dec-pomdps. In Proceedings of the 7th international joint conference on autonomous agents and multiagent systems. International foundation for autonomous agents and multiagent systems (Vol. 1, pp. 501–508)

  27. 27.

    Chakraborty, D., & Stone, P. (2013). Cooperating with a markovian ad hoc teammate. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems. International foundation for autonomous agents and multiagent systems (pp. 1085–1092)

  28. 28.

    Chandrasekaran, M., Prashant, D., & Zeng, Y. (2010). Approximate solutions of interactive dynamic influence diagrams using epsilon-behavioral equivalence. In 11th international symposium on artificial intelligence and mathematics (ISAIM)

  29. 29.

    Chang, Y.h., Ho, T., & Kaelbling, L.P. (2004). All learning is local: Multi-agent learning in global reward games. In Advances in neural information processing systems (pp. 807–814)

  30. 30.

    Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Association for the advancement of artificial intelligence (pp. 183–188)

  31. 31.

    Dibangoye, J. S., Amato, C., Buffet, O., & Charpillet, F. (2013). Optimally solving Dec-POMDPs as continuous-state MDPs. In Proceedings of the twenty-third international joint conference on artificial intelligence (pp. 90–96). Palo Alto, CA: AAAI Press

  32. 32.

    Doshi, P. (2012). Decision making in complex mulitiagent contexts: A tale of two frameworks. AI Magazine, 4(33), 82–95.

    Google Scholar 

  33. 33.

    Doshi, P., Chandrasekaran, M., & Zeng, Y. (2010). Epsilon-subjective equivalence of models for interactive dynamic influence diagrams. In IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology(WI-IAT) (Vol. 2, pp. 165–172)

  34. 34.

    Doshi, P., & Zeng, Y. (2009). Improved approximation of interactive dynamic influence diagrams using discriminative model updates. In Proceedings of The 8th international conference on autonomous agents and multiagent systems. International foundation for autonomous agents and multiagent system (Vol. 2, pp. 907–914)

  35. 35.

    Doshi, P., & Zeng, Y. (2009). Improved approximation of interactive dynamic influence diagrams using discriminative model updates. In Autonomous agents and multi-agent systems

  36. 36.

    Doshi, P., Zeng, Y., & Chen, Q. (2009). Graphical models for interactive pomdps: Representations and solutions. JAAMAS, 18(3), 376–416.

    Google Scholar 

  37. 37.

    Gal, Y., & Pfeffer, A. (2003). A language for modeling agent’s decision-making processes in games. In Autonomous agents and multi-agent systems (pp. 265–272)

  38. 38.

    Gilboa, I., & Schmeidler, D. (2001). A theory of case-based decisions. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  39. 39.

    Gmytrasiewicz, P., & Doshi, P. (2005). A framework for sequential planning in multiagent settings. JAIR, 24, 49–79.

    MATH  Google Scholar 

  40. 40.

    Goodwine, B., & Antsaklis, P. (2013). Multi-agent compositional stability exploiting system symmetries. Automatica, 49(11), 3158–3166.

    MathSciNet  Article  MATH  Google Scholar 

  41. 41.

    Guestrin, C., Koller, D., & Parr, R. (2001). Multiagent planning with factored mdps. NIPS, 1, 1523–1530.

    Google Scholar 

  42. 42.

    Hansen, E.A., Bernstein, D.S., & Zilberstein, S. (2004). Dynamic programming for partially observable stochastic games. In Association for the advancement of artificial intelligence (pp. 709–715)

  43. 43.

    Harsanyi, J. C. (1967). Games with incomplete information played by bayesian players. Management Science, 14(3), 159–182.

    MathSciNet  Article  MATH  Google Scholar 

  44. 44.

    Hoang, T.N., & Low, K.H. (2013). Interactive pomdp lite: Towards practical planning to predict and exploit intentions for interacting with self-interested agents. In International joint conference on artificial intelligence (pp. 2298–2305)

  45. 45.

    Kalai, E., & Lehrer, E. (1993). Rational learning leads to nash equilibrium. Econometrica, 61(5), 1019–1045.

    MathSciNet  Article  MATH  Google Scholar 

  46. 46.

    Kim, Y., Nair, R., Varakantham, P., Tambe, M., & Yokoo, M. (2006). Exploiting locality of interaction in networked distributed POMDPs. In AAAI Spring symposium on distributed plan and schedule management

  47. 47.

    Koller, D., & Milch, B. (2001). Multi-agent influence diagrams for representing and solving games. In International joint conference on artificial intelligence (pp. 1027–1034)

  48. 48.

    Koller, D., & Milch, B. (2003). Multi-agent influence diagrams for representing and solving games. Games and Economic Behavior, 45(1), 181–221.

    MathSciNet  Article  MATH  Google Scholar 

  49. 49.

    Kumar, A., Zilberstein, S., & Toussaint, M. (2011). Scalable multiagent planning using probabilistic inference. In International joint conference on artificial intelligence

  50. 50.

    Liu, B., Singh, S., Lewis, R.L., & Qin, S. (2012). Optimal rewards in multiagent teams. In 2012 IEEE international conference on development and learning and epigenetic robotics (ICDL) (pp. 1–8)

  51. 51.

    Mccallum, A. K. (1996). Reinforcement learning with selective perception and hidden state. Ph.D. thesis, University of Rochester

  52. 52.

    Mertens, J., & Zamir, S. (1985). Formulation of bayesian analysis for games with incomplete information. International Journal of Game Theory, 14, 1–29.

    MathSciNet  Article  MATH  Google Scholar 

  53. 53.

    Meuleau, N., Peshkin, L., eung Kim, K., & Kaelbling, L. P. (1999) Learning finite-state controllers for partially observable environments. In Uncertainty in artificial intelligence (pp. 427–436)

  54. 54.

    Nair, R., Tambe, M., Yokoo, M., Pynadath, D., & Marsella, S. (2003) Taming decentralized pomdps: Towards efficient policy computation for multiagent settings. In International joint conference on artificial intelligence (pp. 705–711)

  55. 55.

    Nair, R., Varakantham, P., Tambe, M., & Yokoo, M. (2005). Networked distributed pomdps: A synthesis of distributed constraint optimization and pomdps. AAAI, 5, 133–139.

    Google Scholar 

  56. 56.

    Ng, B., Boakye, K., Meyers, C., & Wang, A. (2012). Bayes-adaptive interactive pomdps. In Association for the advancement of artificial intelligence

  57. 57.

    Oliehoek, F.A., Spaan, M.T., Amato, C., & Whiteson, S. (2013) Incremental clustering and expansion for faster optimal planning in Dec-POMDPs. Journal of Artificial Intelligence Research 46, 449–509

  58. 58.

    Panait, L., & Luke, S. (2005). Cooperative multi-agent learning: The state of the art. JAAMAS, 11(3), 387–434.

    Google Scholar 

  59. 59.

    Perkins, T.J. (2002). Reinforcement learning for pomdps based on action values and stochastic optimization. In Association for the advancement of artificial intelligence (pp. 199–204)

  60. 60.

    Pineau, J., Gordon, G., & Thrun, S. (2006). Anytime point-based approximations for large POMDPs. Journal of Artificial Intelligence Research, 27, 335–380

  61. 61.

    Pynadath, D., & Marsella, S. (2007). Minimal mental models. In Association for the advancement of artificial intelligence (pp. 1038–1044)

  62. 62.

    Pynadath, D. V., & Tambe, M. (2002). The communicative multiagent team decision problem: Analyzing teamwork theories and models. Journal of Artificial Intelligence Research, 16, 389–423.

    MathSciNet  MATH  Google Scholar 

  63. 63.

    Rathnasabapathy, B., Doshi, P., & Gmytrasiewicz, P. (2006). Exact solutions of interactive POMDPs using behavioral equivalence. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems (pp. 1025–1032). Providence, RI: ACM

  64. 64.

    Seuken, S., & Zilberstein, S. (2007). Improved memory-bounded dynamic programming for decentralized pomdps. In Uncertainty in artificial intelligence

  65. 65.

    Seuken, S., & Zilberstein, S. (2008). Formal models and algorithms for decentralized decision making under uncertainty. Autonomous Agents and Multi-Agent Systems, 17(2), 190–250.

    Article  Google Scholar 

  66. 66.

    Spaan, M., & Oliehoek, F. (2008). The multiagent decision process toolbox: Software for decision-theoretic planning in multiagent systems. In AAMAS workshop on MSDM (pp. 107–121)

  67. 67.

    Spaan, M. T. J. (2006). Decentralized planning under uncertainty for teams of communicating agents. In Autonomous agents and multi-agent systems (pp. 249–256)

  68. 68.

    Stone, P., Kaminka, G. A., Kraus, S., & Rosenschein, J. S. (2010). Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Association for the advancement of artificial intelligence

  69. 69.

    Stone, P., Kaminka, G.A., & Rosenschein, J.S. (2010). Leading a best-response teammate in an ad hoc team. In Agent-mediated electronic commerce. Designing trading strategies and mechanisms for electronic markets (pp. 132–146). New York: Springer

  70. 70.

    Stone, P., & Kraus, S. (2010). To teach or not to teach? decision making under uncertainty in ad hoc teams. In Autonomous agents and multi-agent systems

  71. 71.

    Tatman, J. A., & Shachter, R. D. (1990). Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man, and Cybernetics, 20(2), 365–379.

    MathSciNet  Article  MATH  Google Scholar 

  72. 72.

    Wageman, R., & Baker, G. (1997). Incentives and cooperation: The joint effects of task and reward interdependence on group performance. Journal of Organizational Behavior, 18(2), 139–158.

    Article  Google Scholar 

  73. 73.

    Wright, J. R., & Leyton-Brown, K. (2014). Level-0 meta-models for predicting human behavior in games. In Fifteenth ACM conference on economics and computation (EC) (pp. 857–874)

  74. 74.

    Wu, F., Zilberstein, S., & Chen, X. (2011). Online planning for ad hoc autonomous agent teams. In International joint conference on artificial intelligence (pp. 439–445)

  75. 75.

    Zeng, Y., & Doshi, P. (2012). Exploiting model equivalences for solving interactive dynamic influence diagrams. JAIR, 43, 211–255.

    MathSciNet  MATH  Google Scholar 

  76. 76.

    Zeng, Y., Doshi, P., Pan, Y., Mao, H., Chandrasekaran, M., & Luo, J. (2011). Utilizing partial policies for identifying equivalence of behavioral models. In Proceedings of the 25th AAAI conference on artificial intelligence

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Prashant Doshi.

Appendix

Appendix

We show I-DIDs for three problem domains—BP, Grid, and MABC. For the sake of clarity, we limit the illustration to two-agent settings in which the level-1 agent i considers two level-0 models for the other agent j. The two IDs differ in their beliefs over the physical states. The conditional probability distributions (CPDs) of all the nodes in each I-DID are specified according to the problem described earlier in Sect. 5.

Multiagent box pushing

Fig. 19
figure19

a Level-1 I-ID of agent i in the multiagent BP domain, b two level-0 IDs of agent j whose decision nodes are mapped to the chance nodes, \(A^1_j, A^2_j\), in a, indicated by the dotted arrows. The two IDs differ in their distributions over the chance node, Position&Orientation

First, an abstract representation of the level-1 I-ID for BP is shown in Fig. 19. The physical state specifies the joint position and orientation of both the agents. We represent this composite state space by a chance node labeled Position&Orientation. Each agent may sense the presence of a wall, other agent, a box, or an empty field in the direction it is facing. These observations are modeled by another chance node SenseFacing. We may unroll the I-ID in Fig. 19 into an I-DID spanning two time slices as shown in Fig. 20. The model node, \(M^t_{j,0}\), contains the different DIDs that are expanded from the level-0 IDs in Fig. 19b.

Fig. 20
figure20

An abstract two time-slice level 1 I-DID for agent i in BP. The model node contains level 0 DIDs of agent j that expand the IDs

We may further exploit the structure of the problem by factoring the state space into position and orientation indexed by each participating agent. We draw additional benefits from also factoring the action space because some actions only impact certain factors of the state. For example, each agent may choose to perform one of 4 possible actions—turn left (TL), turn right (TR), move forward (MF) and stay (ST). The turn actions impact the orientation of the agent only, while the move and stay actions impact the agent’s position in the grid only. We illustrate this factorization of the chance nodes Position&Orientation and \(A_j\), and the decision node \(A_i\), in Fig. 21.

Fig. 21
figure21

Position and Orientation Per-agent factors of the composite state space in BP abstractly represented by the node Position & Orientation; TurnActions and MoveActions Per-agent factors of the composite action space modeled by nodes \(A_i\) and \(A_j\)

Fig. 22
figure22

A factored representation of the two time-slice level 1 I-DID for agent i in BP

Fig. 23
figure23

a Level 1 I-ID of agent i in the multiagent Grid domain, b two level 0 IDs of agent j whose decision nodes are mapped to the chance nodes, \(A^1_j, A^2_j\), in a, indicated by the dotted arrows. The two IDs differ in the distribution over the chance node, GridLocation

Finally, in Fig. 22, we illustrate a fully factored representation of the level-1 I-DID (shown in Fig. 20) for agent i in BP.

Multiagent grid domain

An abstract representation of the level-1 I-ID for Grid is shown in Fig. 23. The physical states in this domain represent the joint location (xy coordinates) of each agent in the grid. This composite space is modeled by the chance node GridLocation. We may unroll the I-ID in Fig. 23 into the corresponding I-DID spanning two time slices as shown in Fig. 24.

Fig. 24
figure24

An abstract two time-slice level-1 I-DID for agent i in Grid. The model node contains level 0 DIDs of agent j. At horizon 1, the models of j are IDs

Fig. 25
figure25

GridX and GridY: Per-agent factors of the composite state space in Grid abstractly represented by the node GridLocation

In agent i’s I-DID, we assign the marginal distribution over the agents’ joint location to the conditional probability distribution (CPD) of the chance node GridLocation \(_i^t\). In the next time step, the CPD of the chance node GridLocation \(_i^{t+1}\), conditioned on GridLocation \(_i^t, A^t_i\), and \(A^t_j\), is the transition function. The CPDs of the chance node GridLocation \(_i^{t+1}\), the observation node SenseWall \(^{t+1}_i\), and the utility node \(R_i\) are specified according to the problem described in Sect. 5. Finally, the CPD of the chance node \(Mod[M^{t+1}_j]\) in the model node, \(M^{t+1}_{j,l1}\), reflects which prior model, action and observation of j results in a model contained in the model node.

Fig. 26
figure26

A fully factored representation of the two time-slice level 1 I-DID for agent i in the multiagent Grid domain

Fig. 27
figure27

a Level 1 I-ID of agent i in the MABC domain, b two level 0 IDs of agent j whose decision nodes are mapped to the chance nodes, \(A^1_j, A^2_j\), in a, indicated by the dotted arrows. The two IDs differ in the distribution over the chance node, BufferStatus

As in BP, we may factorize the physical state of Grid to specify the agents’ corresponding locations in terms of their x and y coordinates, as modeled by the chance nodes GridX and GridY shown in Fig. 25. On performing the action(s) at time step tj may receive observations that detect the presence of a wall on its right, left, or the absence of it on both sides, as modeled in the observation node SenseWall. This is reflected in new beliefs on agent j’s position in the grid within j’s DIDs at time step \(t+1\). Consequently, the model node, \(M^{t+1}_{j,0}\), contains more models of j and i’s updated belief on j’s possible DIDs.

Figure 26 illustrates the fully factored representation of the level-1 agent i’s I-DID (in Fig. 24) for Grid.

Multi-access broadcast channel

A representation of the level-1 I-ID for MABC is shown in Fig. 27. The physical state represents the status of the each agent’s (i.e., node’s) message buffer, whose size is assumed to be 1 in our setting. At the start of each time step, each node performs one of two actions: send a message (S) or wait (W). After performing an action, the node receives one of two noisy observations: collision (C) or no-collision (NC), as modeled by the chance node, SenseCollision. We may unroll the I-ID in Fig. 27 into the corresponding I-DID spanning two time-slices as shown in Fig. 28.

Fig. 28
figure28

A two time-slice level 1 I-DID for agent i in the MABC domain. The model node contains level-0 DIDs of agent j that expand the IDs shown in Fig. 27b

In agent i’s I-DID, we assign the marginal distribution over the agents’ joint buffer status to the CPD of the chance node BufferStatus \(_i^t\). In the next time step, the CPD of BufferStatus \(_i^{t+1}\), is the transition function. The CPDs of the chance nodes BufferStatus \(_i^{t+1}\), the observation node SenseCollision \(^{t+1}_i\), and the utility node \(R_i\) are specified according to the problem described in Sect. 5. Finally, the CPD of the chance node \(Mod[M^{t+1}_j]\) in the model node, \(M^{t+1}_{j,l1}\), reflects which prior model, action and observation of j results in a model contained in the model node.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chandrasekaran, M., Doshi, P., Zeng, Y. et al. Can bounded and self-interested agents be teammates? Application to planning in ad hoc teams. Auton Agent Multi-Agent Syst 31, 821–860 (2017). https://doi.org/10.1007/s10458-016-9354-4

Download citation

Keywords

  • Multiagent systems
  • Ad hoc teamwork
  • Sequential decision making and planning
  • Reinforcement learning