Semantics and algorithms for trustworthy commitment achievement under model uncertainty

Zhang, Qi; Durfee, Edmund H.; Singh, Satinder

doi:10.1007/s10458-020-09443-0

Semantics and algorithms for trustworthy commitment achievement under model uncertainty

Published: 18 January 2020

Volume 34, article number 19, (2020)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

398 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

We focus on how an agent can exercise autonomy while still dependably fulfilling commitments it has made to another, despite uncertainty about outcomes of its actions and how its own objectives might evolve. Our formal semantics treats a probabilistic commitment as constraints on the actions an autonomous agent can take, rather than as promises about states of the environment it will achieve. We have developed a family of commitment-constrained (iterative) lookahead algorithms that provably respect the semantics, and that support different tradeoffs between computation and plan quality. Our empirical results confirm that our algorithms’ ability to balance (selfish) autonomy and (unselfish) dependability outperforms optimizing either alone, that our algorithms can effectively handle uncertainty about both what actions do and which states are rewarding, and that our algorithms can solve more computationally-demanding problems through judicious parameter choices for how far our algorithms should lookahead and how often they should iterate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

The anchoring bias reflects rational use of cognitive resources

Article 08 May 2017

Artificial intelligence and responsibility gaps: what is the problem?

Article Open access 24 August 2022

Notes

For completeness, we should note that our semantics is also the probabalistic analogue of logic-based semantics for conditional commitments (Sect. 2). A conditional commitment asserts that a state in ${\varPhi }$ will provably be reached in worlds where the specified conditions hold, but makes no promises when those conditions do not hold. As long as the agent’s actions reach a state in ${\varPhi }$ when the conditions hold, the commitment is satisfied. Analogously, a probabilistic commitment asserts that a state in ${\varPhi }$ will be assuredly be reached whenever one out of the “good” subset of possible histories hold (where the probability of that occurring given the policy $\pi$ is no less than $\rho$), but makes no promises otherwise. So, again analogously, as long as the agent takes actions prescribed by $\pi$, the commitment is met regardless of whether a state in ${\varPhi }$ is reached in a specific episode.
We should point out that our earlier paper [43] that considered this Bayesian setting did not impose this constraint, instead insisting that whatever policy adopted from this point on, appended to the policy taken so far, would satisfy the commitment semantics if followed from the initial state. While that weaker constraint generally performed correctly, we identified corner cases where a dishonest commitment provider could exploit that constraint to increase its local reward. The constraint we provide here (also used in our more recent non-Bayesian paper [44]) closes this loophole.
Our earlier work limited to reward uncertainty exploited this [43].

References

Agotnes, T., Goranko, V., & Jamroga, W. (2007). Strategic commitment and release in logics for multi-agent systems (extended abstract). Technical Report IfI-08-01, Clausthal University.
Al-Saqqar, F., Bentahar, J., Sultan, K., & El-Menshawy, M. (2014). On the interaction between knowledge and social commitments in multi-agent systems. Applied Intelligence, 41(1), 235–259.
Article Google Scholar
Altman, E. (1999). Constrained Markov decision processes (Vol. 7). Boca Raton: CRC Press.
MATH Google Scholar
Bannazadeh, H., & Leon-Garcia, A. (2010). A distributed probabilistic commitment control algorithm for service-oriented systems. IEEE Transactions on Network and Service Management, 7(4), 204–217.
Article Google Scholar
Castelfranchi, C. (1995). Commitments: From individual intentions to groups and organizations. In Proceedings of the international conference on multiagent systems (pp. 41–48).
Chesani, F., Mello, P., Montali, M., & Torroni, P. (2013). Representing and monitoring social commitments using the event calculus. Autonomous Agents and Multi-Agent Systems, 27(1), 85–130.
Article Google Scholar
Cohen, P. R., & Levesque, H. J. (1990). Intention is choice with commitment. Artificial Intelligence, 42(2–3), 213–261.
Article MathSciNet Google Scholar
CPLEX: IBM ILOG CPLEX 12.1. https://www.ibm.com/analytics/cplex-optimizer.
Dolgov, D., & Durfee, E. (2005). Stationary deterministic policies for constrained MDPs with multiple rewards, costs, and discount factors. In International joint conference on artificial intelligence (Vol. 19, pp. 1326–1331).
Dolgov, D. A., & Durfee, E. H. (2004). Optimal resource allocation and policy formulation in loosely-coupled Markov decision processes. In Proceedings of the fourteenth international conference on automated planning and scheduling (pp. 315–324).
Durfee, E. H., & Singh, S. (2016). On the trustworthy fulfillment of commitments. In Autonomous agents and multiagent systems: AAMAS 2016 workshops best papers. (pp. 1–13). Springer lecture notes in artificial intelligence (2016). Also in Notes of the AAMAS Workshop on Trust in Agent Societies, May 2016.
Günay, A., Liu, Y., & Zhang, J. (2016). Promoca: Probabilistic modeling and analysis of agents in commitment protocols. Journal of Artificial Intelligence Research, 57, 465–508.
Article MathSciNet Google Scholar
Günay, A., Songzheng, S., Liu, Y., & Zhang, J. (2015). Automated analysis of commitment protocols using probabilistic model checking. In Twenty-ninth AAAI conference on artificial intelligence.
Gurobi: Gurobi 8.1. http://www.gurobi.com/products/gurobi-optimizer.
Hansen, E. A. (1998). Finite-memory control of partially observable systems. Ph.D. Thesis, University of Massachusetts Amherst.
Jennings, N. R. (1993). Commitments and conventions: The foundation of coordination in multi-agent systems. The Knowledge Engineering Review, 8(3), 223–250.
Article MathSciNet Google Scholar
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1–2), 99–134.
Article MathSciNet Google Scholar
Maheswaran, R. T., Szekely, P., Becker, M., Fitzpatrick, S., Gati, G., Jin, J., et al. (2008). Predictability & criticality metrics for coordination in complex environments. In Proceedings of the 7th international joint conference on autonomous agents and multiagent systems (Vol. 2, pp. 647–654).
Mallya, A. U., & Huhns, M. N. (2003). Commitments among agents. IEEE Internet Computing, 7(4), 90–93.
Article Google Scholar
MATLAB: MATLAB optimization toolbox. https://www.mathworks.com/products/optimization.html.
Meneguzzi, F., Magnaguagno, M. C., Singh, M. P., Telang, P. R., & Yorke-Smith, N. (2018). Goco: Planning expressive commitment protocols. Autonomous Agents and Multi-Agent Systems, 32(4), 459–502.
Article Google Scholar
Meneguzzi, F., Telang, P. R., & Yorke-Smith, N. (2015). Towards planning uncertain commitment protocols. In Proceedings of the 2015 international conference on autonomous agents and multiagent systems (pp. 1681–1682).
OPTI: OPTI toolbox v2.2. https://www.inverseproblem.co.nz/OPTI.
Pereira, R. F., Oren, N., & Meneguzzi, F. (2017). Detecting commitment abandonment by monitoring sub-optimal steps during plan execution. In Proceedings of the 16th conference on autonomous agents and multiagent systems (pp. 1685–1687).
Poupart, P., Malhotra, A., Pei, P., Kim, K. E., Goh, B., & Bowling, M. (2015). Approximate linear programming for constrained partially observable Markov decision processes. In Twenty-ninth AAAI conference on artificial intelligence.
Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming. Hoboken: Wiley.
MATH Google Scholar
Raffia, H. (1982). The art and science of negotiation. Cambridge: Harvard University Press.
Google Scholar
Sandholm, T., & Lesser, V. R. (2001). Leveled commitment contracts and strategic breach. Games and Economic Behavior, 35, 212–270.
Article MathSciNet Google Scholar
Santana, P., Thiébaux, S., & Williams, B. (2016). RAO*: An algorithm for chance-constrained POMDP’s. In Thirtieth AAAI conference on artificial intelligence.
Shiryaev, A. N. (1963). On optimum methods in quickest detection problems. Theory of Probability & Its Applications, 8(1), 22–46.
Article MathSciNet Google Scholar
Singh, M. P. (1999). An ontology for commitments in multiagent systems. Artificial Intelligence in the Law, 7(1), 97–113.
Article Google Scholar
Singh, M. P. (2012). Commitments in multiagent systems: Some history, some confusions, some controversies, some prospects. In The goals of cognition. Essays in honor of Cristiano Castelfranchi (pp. 601–626). London.
Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable Markov processes over a finite horizon. Operations Research, 21(5), 1071–1088.
Article Google Scholar
Smith, T., & Simmons, R. (2004). Heuristic search value iteration for POMDPs. In Proceedings of the 20th conference on uncertainty in artificial intelligence (pp. 520–527).
Sultan, K., Bentahar, J., & El-Menshawy, M. (2014). Model checking probabilistic social commitments for intelligent agent communication. Applied Soft Computing, 22, 397–409.
Article Google Scholar
Telang, P. R., Meneguzzi, F., & Singh, M. P. (2013). Hierarchical planning about goals and commitments. In Proceedings of the 2013 international conference on autonomous agents and multiagent systems (pp. 877–884).
Vokrínek, J., Komenda, A., & Pechoucek, M. (2009). Decommitting in multi-agent execution in non-deterministic environment: Experimental approach. In 8th international joint conference on autonomous agents and multiagent systems (pp. 977–984).
Winikoff, M. (2006). Implementing flexible and robust agent interactions using distributed commitment machines. Multiagent and Grid Systems, 2(4), 365–381.
Article Google Scholar
Witwicki, S. J., & Durfee, E. H. (2009). Commitment-based service coordination. International Journal of Agent-Oriented Software Engineering, 3(1), 59–87.
Article Google Scholar
Xing, J., & Singh, M. P. (2001). Formalization of commitment-based agent interaction. In Proceedings of the 2001 ACM symposium on applied computing (pp. 115–120).
Xuan, P., & Lesser, V. R. (2000). Incorporating uncertainty in agent commitments. In Intelligent agents VI. Agent theories, architectures, and languages (pp. 57–70). Springer.
Zhang, Q., Durfee, E. H., & Singh, S. (2018). Challenges in the trustworthy pursuit of maintenance commitments under uncertainty. In Proceedings of the 20th international trust workshop co-located with AAMAS/IJCAI/ECAI/ICML 2018 (pp. 75–86).
Zhang, Q., Durfee, E. H., Singh, S., Chen, A., & Witwicki, S. J. (2016). Commitment semantics for sequential decision making under reward uncertainty. In Proceedings of the twenty-fifth international joint conference on artificial intelligence (pp. 3315–3323).
Zhang, Q., Singh, S., & Durfee, E. (2017). Minimizing maximum regret in commitment constrained sequential decision making. In Twenty-seventh international conference on automated planning and scheduling (pp. 348–356).

Download references

Funding

Funding was provided by Air Force Office of Scientific Research (Grant No. FA9550-15-1-0039). We thank the anonymous reviewers for their thoughtful comments.

Author information

Authors and Affiliations

Computer Science and Engineering, University of Michigan, Ann Arbor, MI, 48109, USA
Qi Zhang, Edmund H. Durfee & Satinder Singh

Authors

Qi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Edmund H. Durfee
View author publications
You can also search for this author in PubMed Google Scholar
Satinder Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Here we present all the technical proofs of the theorems in this article.

Proof of Theorem 1

Note that the belief is a sufficient statistic: given history $h_t$ at time step t and the corresponding belief $b_t$ consistent with $h_t$, one does not need any other information in $h_t$ besides $b_t$ to predict the future state transitions and reward after time step t. Therefore, solving problem (4) is equivalent to solving a constrained MDP, where the MDP is the belief MDP defined as the tuple $\langle {\mathcal {B}}, {\mathcal {A}}, b_0, {\tilde{P}}, {\tilde{R}} \rangle$ with finite state space of beliefs, and the constraint comes from the semantics of commitment c. Our CCFL method can be viewed as a standard linear programming approach to solving a finite state constrained MDP. $\square$

Proof of Theorem 2

It is sufficient to show (1) any policy in ${\varPi }_c \cap {\varPi }_L$ can be derived from a feasible solution to the program in Fig. 5, and (2) any feasible solution to the program derives a policy in ${\varPi }_c \cap {\varPi }_L$.

To show (1), for any policy $\pi \in {\varPi }_c\cap {\varPi }_L$, we are going to define vectors $m^\pi$ and $n^\pi$ such that with $m^\pi$ treated as x and $n^\pi$ treated as y, $m^\pi$ and $n^\pi$ satisfy the constraints of the program in Fig. 5, and the L-updates policy $\pi$ can be derived via Eq. (11). Specifically, given any policy $\pi \in {\varPi }_c\cap {\varPi }_L$, let $n^\pi$ be its belief-action occupancy measure for beliefs in ${\mathcal {B}}_{\le L}^{b_0}$, and $m^\pi$ be its state-action occupancy measure for states from time step L on:

$$\begin{aligned} \forall b\in {\mathcal {B}}_{\le L}^{b_0}, a\quad n^\pi (b,a) = \Pr (B_t=b,A_t=a|B_0=b_0;\pi ) \end{aligned}$$

where t is the time of belief b, and

$$\begin{aligned} \forall s, a \quad m^\pi _{b_L,k}(s,a) ={\left\{ \begin{array}{ll} \Pr (S_t=s,A_t=a, B_L=b_L,k|B_0=b_0;\pi ) &{} t \ge L\\ 0 &{} t < L \end{array}\right. } \end{aligned}$$

where t is the time of state s. Then, with $m^\pi$ treated as x and $n^\pi$ treated as y, $m^\pi$ and $n^\pi$ satisfy the constraints of the program in Fig. 5, and the L-updates policy $\pi$ can be derived via Eq. (11).

To show (2), given a feasible solution x, y to the program, let policy $\pi$ be the derived policy via (11). Then $\pi$ is in ${\varPi }_L$ by definition. Further we have $m^\pi _{b_L,k}(s,a)=x_{b_L,k}(s,a), n^\pi (b,a)=y(b,a)$, where $m^\pi$ and $n^\pi$ are defined as above. Therefore $\pi$ is also in ${\varPi }_c$ because x satisfies commitment constraints (12i), (12h). $\square$

Proof of Theorem 3

By Theorem 2, CCL with boundary L finds the optimal policy in ${\varPi }_c\cap {\varPi }_L$. Therefore, it is sufficient to show

$$\begin{aligned} \forall L>0, {\varPi }_0 \subseteq {\varPi }_L. \end{aligned}$$

This holds because given any Markov policy $\pi _0\in {\varPi }_0$ we can define an L-updates policy $\pi _L\in {\varPi }_L$ that is equivalent to $\pi _0$:

$$\begin{aligned} \pi _L (a|h_t) = {\left\{ \begin{array}{ll} \pi _L (a|b_t)=\pi _0 (a|s_t) &{} t < L\\ \pi _L (a|s_t, b_L)=\pi _0 (a|s_t)&{} t \ge L \end{array}\right. }. \end{aligned}$$

Thus, we know that $\pi _0\in {\varPi }_L$. $\square$

Proof of Theorem 4

It is sufficient to show that the statement holds when $L'=L+1$. We next show that when $P_k=P_{k'} ~\forall k, k'$, given any policy $\pi _L \in {\varPi }_{L}$, there exists an $(L+1)$-updates policy, $\pi _{L+1}$, that mimics $\pi _L$ , and therefore $V^{\pi _L^*}_{\mu _0}(s_0) \le V^{\pi _{L+1}^*}_{\mu _0}(s_0)$.

For the first L actions, an $(L+1)$-updates policy can map the current belief to a distribution of the next actions identical to $\pi _{L}$, and the action that is going to be taken at time step L by $\pi _{L}$ can also be recovered by an $(L+1)$-updates policy, which gives

$$\begin{aligned} \pi _{L+1} (a|h_t) = {\left\{ \begin{array}{ll} \pi _{L+1} (a|b_t) =\pi _{L} (a|b_t) &{} t < L\\ \pi _{L+1} (a|b_L) = \pi _{L} (a|s_L, b_L) &{} t=L \end{array}\right. }. \end{aligned}$$

Under any L-updates policy $\pi _L$, and conditioned on being in belief $b_{L+1}$ at time step $L+1$, the agent thereafter selects actions according to $\pi _L(\cdot |s_t,b_L)$ with probability that the agent was in belief $b_L$ at time step L: $\Pr (b_L|b_{L+1};\pi _L)$. If the transition dynamics does not vary across MDPs in the environment, it is well known [26] that a Markov policy $\pi _{b_{L+1}}(\cdot |s_t), t\ge L+1$ is sufficient to recover the state occupancy measure of $\pi _L$ starting at belief $b_{L+1}$. Then $\pi _{L+1}$ can also recover $\pi _{L}$ for $t\ge L+1$ by demonstrating that $\pi _{b_{L+1}}$ satisfies

$$\begin{aligned} \pi _{L+1} (a|h_t) =\pi _{L+1} (a|s_t, b_{L+1}) = \pi _{b_{L+1}}(a|s_t) \qquad \text {for } t\ge L+1. \end{aligned}$$

This concludes the proof. $\square$

Proof of Theorem 5

In the proof of Theorem 4, we have shown that for any L-updates policy $\pi _L$ there exists an $(L+1)$-update policy that is able to mimic $\pi _L$ up to time step $L+1$. Provided that $P_k=P_{k'} ~\forall k, k'$, one can find a Markov policy that mimics $\pi _L$ starting at any belief at time step $L+1$. When $P_k=P_{k'} ~\forall k, k'$ does not hold, however, this Markov policy in general does not exist, and therefore no $(L+1)$-update policy is able to mimic $\pi _L$. Inspired by this, we next give an example as a formal constructive proof.

Consider the example shown in Fig. 13. The environment has 10 locations $\{0,1,\ldots ,9\}$, action space $\{up, down\}$, time horizon $T=4$, and $K=2$ possible MDPs. The agent starts in location 0 at time step $t=0$ with a prior probability of 0.8 for MDP $k=1$ and a prior probability of 0.2 for MDP $k=2$. In MDP $k=1$, no matter which action the agent takes, it transits to location 1 or 2 uniformly at random at time step $t=1$, and then to location 3 with probability one at time step $t=2$. Starting from location 3, on taking action up (down) the agent transits to the upper (lower) location to the right. The transition dynamics of MDP $k=2$ is the same as MDP $k=1$ until the agent reaches location 3, and thereafter the transition is flipped: starting from location 3, on taking action up (down) the agent transits to the lower (upper) location to the right. In both MDPs, the agent will receive large negative reward ($-\infty$) in location 7 and 8. In MDP $k=1$, the agent will receive $+$ 1 reward if it reaches location 6. There is no reward elsewhere. The agent commits to reaching location 9 with probability 0.5. Consider the following $(L=)1$-updates policy: if the agent was in location 1 at time step $t=1$, always choose action up; if the agent was in location 2 at time step $t=1$, always choose action down. Under this $(L=)1$-updates policy the probability of reaching the commitment location 9 is 0.5 and the expected reward is $0.8\times 0.5\times 1=0.4$. Now consider $(L=)2$-updates policies. Because the agent is in location 3 with probability one at time step $t=2$. An $(L=)2$-updates policy amounts to a Markov policy for time steps $t\ge 2$. Further the agent should minimize the probability of reaching location 7 and 8 that yields large negative reward. One can verify that the only Markov policy for time steps $t\ge 2$ that avoids reaching location 7 and 8 while respecting the commitment semantics is to always choose action down, whose expected reward is 0, smaller than that of the $(L=)1$-updates policy. $\square$

Proof of Theorem 6

We need to show $\pi _{IL}$ satisfies Eq. (3), i.e.,

$$\begin{aligned} \mathop {\mathrm{Pr}}\limits _{k\sim \mu _0} ( S_T \in {\varPhi }| S_0=s_0,k;\pi _{IL}) \ge \rho . \end{aligned}$$

Let $\pi _{L}$ be the CCL L-updates policy derived from the program in Fig. 5. The above inequality holds because:

$$\begin{aligned}&\mathop {\mathrm{Pr}}\limits _{k\sim \mu _0}(S_T\in {\varPhi }| S_0=s_0,k; \pi _{IL} )\\&\quad =\sum \limits _{b_I\in {\mathcal {B}}_{I}^{b_0}}\mathop {\mathrm{Pr}}\limits _{k\sim \mu _0}(B_I=b_I | S_0=s_0, k;\pi _{IL} )\Pr (S_T\in {\varPhi }| B_I=b_I; \pi _{IL} ) \\&\qquad \hbox {(law of total probability)}\\&\quad =\sum \limits _{b_I\in {\mathcal {B}}_{I}^{b_0}}\mathop {\mathrm{Pr}}\limits _{k\sim \mu _0}(B_I=b_I | S_0=s_0, k;\pi _{L} )\Pr (S_T\in {\varPhi }| B_I=b_I; \pi _{IL} ) \\&\qquad (\pi _L \, \hbox {and} \, \pi _{IL} \, \hbox {are identical in the first} \, I \, \hbox {steps)} \\&\quad \ge \sum \limits _{b_I\in {\mathcal {B}}_{I}^{b_0}}\mathop {\mathrm{Pr}}\limits _{k\sim \mu _0}(B_I=b_I | S_0=s_0, k;\pi _{L})\Pr (S_T\in {\varPhi }| B_I=b_I; \pi _{L} ) \\&\quad =\mathop {\mathrm{Pr}}\limits _{k\sim \mu _0}(S_T\in {\varPhi }| S_0=s_0, k;\pi _L ) \qquad \hbox {(law of total probability)} \\&\quad \ge \rho \qquad {(\pi _L\in {\varPi }_c)} \end{aligned}$$

The first inequality holds because CCIL iteratively applies L-step lookahead with the commitment probability achieved by the policy of the previous iteration. This concludes the proof. $\square$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Q., Durfee, E.H. & Singh, S. Semantics and algorithms for trustworthy commitment achievement under model uncertainty. Auton Agent Multi-Agent Syst 34, 19 (2020). https://doi.org/10.1007/s10458-020-09443-0

Download citation

Published: 18 January 2020
DOI: https://doi.org/10.1007/s10458-020-09443-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantics and algorithms for trustworthy commitment achievement under model uncertainty

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

The anchoring bias reflects rational use of cognitive resources

Artificial intelligence and responsibility gaps: what is the problem?

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Theorem 4

Proof of Theorem 5

Proof of Theorem 6

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semantics and algorithms for trustworthy commitment achievement under model uncertainty

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

The anchoring bias reflects rational use of cognitive resources

Artificial intelligence and responsibility gaps: what is the problem?

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Theorem 4

Proof of Theorem 5

Proof of Theorem 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation