Skip to main content
Log in

Placing Approach-Avoidance Conflict Within the Framework of Multi-objective Reinforcement Learning

  • Original Article
  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

Many psychiatric disorders are marked by impaired decision-making during an approach-avoidance conflict. Current experiments elicit approach-avoidance conflicts in bandit tasks by pairing an individual’s actions with consequences that are simultaneously desirable (reward) and undesirable (harm). We frame approach-avoidance conflict tasks as a multi-objective multi-armed bandit. By defining a general decision-maker as a limiting sequence of actions, we disentangle the decision process from learning. Each decision maker can then be identified as a multi-dimensional point representing its long-term average expected outcomes, while different decision making models can be associated by the geometry of their ‘feasible region’, the set of all possible long term performances on a fixed task. We introduce three example decision-makers based on popular reinforcement learning models and characterize their feasible regions, including whether they can be Pareto optimal. From this perspective, we find that existing tasks are unable to distinguish between the three examples of decision-makers. We show how to design new tasks whose geometric structure can be used to better distinguish between decision-makers. These findings are expected to guide the design of approach-avoidance conflict tasks and the modeling of resulting decision-making behavior.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

Data sharing not applicable to this article as no datasets were generated or analysed during the current study. Code is publicly available at https://github.com/eza0107/Multi-Objective-RL-and-Human-Decision-Making.

References

  • Aupperle RL, Paulus M (2010) Neural systems underlying approach and avoidance in anxiety disorders. Dialogues Clin Neurosci 12(4):517–531

    Article  Google Scholar 

  • Aupperle RL, Sullivan S, Melrose AJ, Paulus MP, Stein MB (2011) A reverse translational approach to quantify approach-avoidance conflict in humans. Behav Brain Res 225(2):455–463. https://doi.org/10.1016/j.bbr.2011.08.003

    Article  Google Scholar 

  • Bach DR, Guitart-Masip M, Packard PA, Miró J, Falip M, Fuentemilla L, Dolan RJ (2014) Human hippocampus arbitrates approach-avoidance conflict. Curr Biol 24(5):541–547

    Article  Google Scholar 

  • Bechara A, Damasio AR, Damasio H, Anderson SW (1994) Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50(1–3):7–15. https://doi.org/10.1016/0010-0277(94)90018-3

    Article  Google Scholar 

  • Castelletti A, Corani G, Rizzolli A, Soncinie-Sessa R, Weber E (2002) Reinforcement learning in the operational management of a water system. In: IFAC workshop on modeling and control in environmental issues, pp 325–330. Citeseer

  • Cochran AL, Cisler JM (2019) A flexible and generalizable model of online latent-state learning. PLoS Comput Biol 15(9):e1007331

    Article  Google Scholar 

  • Drugan MM, Nowe A (2013) Designing multi-objective multi-armed bandits algorithms: a study. In: The 2013 international joint conference on neural networks (IJCNN). IEEE, pp 1–8

  • Enkhtaivan E, Nishimura J, Ly C, Cochran AL (2021) A competition of critics in human decision-making. Comput Psychiatry 5(1)

  • Gaskett C (2003) Reinforcement learning under circumstances beyond its control. In: Proceedings of the international conference on computational intelligence for modelling control and automation

  • Gershman SJ, Blei DM, Niv Y (2010) Context, learning, and extinction. Psychol Rev 117(1):197

    Article  Google Scholar 

  • Hayes SC, Strosahl KD, Wilson KG (2011) Acceptance and commitment therapy: the process and practice of mindful change. Guilford Press, New York

    Google Scholar 

  • Haynos AF, Widge AS, Anderson LM, Redish AD (2022) Beyond description and deficits: How computational psychiatry can enhance an understanding of decision-making in anorexia nervosa. Curr Psychiatry Rep 1–11

  • Johnston WA, Dark VJ (1986) Selective attention. Annu Rev Psychol

  • Kirlic N, Young J, Aupperle RL (2017) Animal to human translational paradigms relevant for approach avoidance conflict decision making. Behav Res Ther 96:14–29

    Article  Google Scholar 

  • Kwak J-y, Varakantham P, Maheswaran R, Tambe M, Hayes T, Wood W, Becerik-Gerber B (2012) Towards robust multi-objective optimization under model uncertainty for energy conservation. In: AAMAS workshop on agent technologies for energy systems (ATES)

  • Lejuez CW, Read JP, Kahler CW, Richards JB, Ramsey SE, Stuart GL, Strong DR, Brown RA (2002) Evaluation of a behavioral measure of risk taking: the balloon analogue risk task (bart). J Exp Psychol Appl 8(2):75

    Article  Google Scholar 

  • Letkiewicz AM, Kottler HC, Shankman SA, Cochran AL (2023) Quantifying aberrant approach-avoidance conflict in psychopathology: a review of computational approaches. Neurosci Biobehav Rev 2023:105103

    Article  Google Scholar 

  • Lewin K (2013) A dynamic theory of personality-selected papers. Read Books Ltd, New York

    Google Scholar 

  • Loijen A, Vrijsen JN, Egger JI, Becker ES, Rinck M (2020) Biased approach-avoidance tendencies in psychopathology: a systematic review of their assessment and modification. Clin Psychol Rev 77:101825

    Article  Google Scholar 

  • McDermott TJ, Berg H, Touthang J, Akeman E, Cannon MJ, Santiago J, Cosgrove KT, Clausen AN, Kirlic N, Smith R et al (2022) Striatal reactivity during emotion and reward relates to approach-avoidance conflict behaviour and is altered in adults with anxiety or depression. J Psychiatry Neurosci 47(5):E311–E322

    Article  Google Scholar 

  • Moffaert KV, Drugan MM, Nowe A (2013) Scalarized multi-objective reinforcement learning: novel design techniques. In 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). https://doi.org/10.1109/adprl.2013.6615007

  • Nishimura J, Cochran AL (2020) Rescorla-Wagner models with sparse dynamic attention. Bull Math Biol 82(6):1–37

    Article  MathSciNet  MATH  Google Scholar 

  • Niv Y, Edlund JA, Dayan P, O’Doherty JP (2012) Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain. J Neurosci 32(2):551–562

    Article  Google Scholar 

  • Pittig A, Brand M, Pawlikowski M, Alpers GW (2014) The cost of fear: avoidant decision making in a spider gambling task. J Anxiety Disord 28(3):326–334

    Article  Google Scholar 

  • Redish AD, Jensen S, Johnson A, Kurth-Nelson Z (2007) Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. Psychol Rev 114(3):784

    Article  Google Scholar 

  • Reverdy PB, Srivastava V, Leonard NE (2014) Modeling human decision making in generalized gaussian multiarmed bandits. Proc IEEE 102(4):544–571

    Article  Google Scholar 

  • Rolle CE, Pedersen ML, Johnson N, Amemori K-I, Ironside M, Graybiel AM, Pizzagalli DA, Etkin A (2022) The role of the dorsal-lateral prefrontal cortex in reward sensitivity during approach-avoidance conflict. Cereb Cortex 32(6):1269–1285

    Article  Google Scholar 

  • Ross MC, Lenow JK, Kilts CD, Cisler JM (2018) Altered neural encoding of prediction errors in assault-related posttraumatic stress disorder. J Psychiatr Res 103:83–90

    Article  Google Scholar 

  • Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275(5306):1593–1599

    Article  Google Scholar 

  • Shelton CR (2001) Importance sampling for reinforcement learning with multiple objectives. PhD thesis, Massachusetts Institute of Technology

  • Smith R, Kirlic N, Stewart JL, Touthang J, Kuplicki R, Khalsa SS, Feinstein J, Paulus MP, Aupperle RL (2021) Greater decision uncertainty characterizes a transdiagnostic patient sample during approach-avoidance conflict: a computational modelling approach. J Psychiatry Neurosci 46(1):E74–E87

    Article  Google Scholar 

  • Smith R, Kirlic N, Stewart JL, Touthang J, Kuplicki R, McDermott TJ, Taylor S, Khalsa SS, Paulus MP, Aupperle RL (2021) Long-term stability of computational parameters during approach-avoidance conflict in a transdiagnostic psychiatric patient sample. Sci Rep 11(1):1–13

    Article  Google Scholar 

  • Smith R, Lavalley CA, Taylor S, Stewart JL, Khalsa SS, Berg H, Ironside M, Paulus MP, Aupperle RL (2023) Elevated decision uncertainty and reduced avoidance drives in depression, anxiety and substance use disorders during approach-avoidance conflict: a replication study. J Psychiatry Neurosci 48(3):E217–E231

    Article  Google Scholar 

  • Sripada C, Weigard A (2021) Impaired evidence accumulation as a transdiagnostic vulnerability factor in psychopathology. Front Psychiatry 12:627179

    Article  Google Scholar 

  • Steyvers M, Lee MD, Wagenmakers E-J (2009) A Bayesian analysis of human decision-making on bandit problems. J Math Psychol 53(3):168–179

    Article  MathSciNet  MATH  Google Scholar 

  • Stolz O (1885) Vorlesungen über allgemeine Arithmetik: nach den Neueren Ansichten, vol 1. BG Teubner, Berlin

    MATH  Google Scholar 

  • Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press, Cambridge

    MATH  Google Scholar 

  • Talmi D, Dayan P, Kiebel SJ, Frith CD, Dolan RJ (2009) How humans integrate the prospects of pain and reward during choice. J Neurosci 29(46):14617–14626

    Article  Google Scholar 

  • Treisman AM (1969) Strategies and models of selective attention. Psychol Rev 76(3):282

    Article  Google Scholar 

  • Van Moffaert K, Nowé A (2014) Multi-objective reinforcement learning using sets of pareto dominating policies. J Mach Learn Res 15(1):3483–3512

    MathSciNet  MATH  Google Scholar 

  • Weaver SS, Kroska EB, Ross MC, Sartin-Tarm A, Sellnow KA, Schaumberg K, Kiehl KA, Koenigs M, Cisler JM (2020) Sacrificing reward to avoid threat: characterizing ptsd in the context of a trauma-related approach-avoidance conflict task. J Abnorm Psychol 129(5):457–468. https://doi.org/10.1037/abn0000528

    Article  Google Scholar 

  • Zitzler E, Knowles J, Thiele L (2008) Quality assessment of pareto set approximations. Multiobjective Optim 2008:373–404

    Article  Google Scholar 

  • Zorowitz S, Momennejad I, Daw ND (2020) Anxiety, avoidance, and sequential evaluation. Comput Psychiatry 4

  • Zorowitz S, Rockhill AP, Ellard KK, Link KE, Herrington T, Pizzagalli DA, Widge AS, Deckersbach T, Dougherty DD (2019) The neural basis of approach-avoidance conflict: a model based analysis. Eneuro 6(4)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amy Cochran.

Ethics declarations

Conflict of interest

Authors EE and JN declare they have no financial interests. Author ALC has received financial support from the American Psychiatric Association to serve as Statistical Editor for the American Journal of Psychiatry.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Long-Term Limit

We begin by proving the claim made in Definition 1. That is, for a decision-maker D associated with action sequence \(\left( A_t\right) _{t=1}^\infty = \left( A_1,A_2,A_3, \dots \right) ,\) one has:

$$\begin{aligned} \rho (D){:=}\lim \limits _{T\rightarrow \infty }\frac{1}{T}\sum _{t=1}^T{\mathbb {E}}[{\varvec{R}}_t] = \sum _{i=1}^n \varvec{\mu }_i\pi _\infty (i)\in {\mathbb {R}}^m \end{aligned}$$
(A1)

Proof

Let \(\pi _t\) and \(\pi _\infty \) be the probability distributions of \(A_t\) and \(A_\infty \) as defined in Definition 1. Then,

$$\begin{aligned} {\mathbb {E}}[{\varvec{R}}_t] = \sum \limits _{i=1}^{n}{\mathbb {E}}[{\varvec{R}}_t\vert A_t = i]\,{\mathbb {P}}(A_t = i) = \sum \limits _{i=1}^{n}\varvec{\mu }_i\pi _t(i) \end{aligned}$$

Note that n is fixed and finite and that \(\pi _t(i)\longrightarrow \pi _\infty (i)\) for each \(i\in {\mathcal {A}},\) as \(t\longrightarrow \infty .\) Therefore,

$$\begin{aligned} \lim \limits _{t\rightarrow \infty }{\mathbb {E}}[{\varvec{R}}_t] = \lim \limits _{t\rightarrow \infty }\left( \sum \limits _{i=1}^{n}\varvec{\mu }_i\pi _t(i)\right) = \sum \limits _{i=1}^{n}\varvec{\mu }_i\pi _\infty (i)\in {\mathbb {R}}^m. \end{aligned}$$

Now the conclusion directly follows from the Stolz-Cesaro theorem of Stolz (1885):

$$\begin{aligned} \rho (D){:=}\lim \limits _{T\rightarrow \infty }\frac{1}{T}\sum _{t=1}^T{\mathbb {E}}[{\varvec{R}}_t] = \lim \limits _{t\rightarrow \infty } {\mathbb {E}}[\varvec{R}_t] = \sum _{i=1}^n \varvec{\mu }_i\pi _\infty (i)\in {\mathbb {R}}^m \end{aligned}$$

\(\square \)

Appendix B: Proof of Proposition 1

Here, we give the proof of Proposition 1, which says that

$$\begin{aligned} {\mathcal {F}}({\mathcal {D}}) = \{c_1\varvec{\mu }_1 +c_2\varvec{\mu }_2+ \cdots + c_n\varvec{\mu }_n\, \vert \, c_i\in [0,1]\,,c_1+c_2+\cdots +c_n = 1 \}. \end{aligned}$$

Proof

The proof in Appendix A tells us that for a given decision-maker D, \(\rho (D)\in \{c_1\varvec{\mu }_1 +c_2\varvec{\mu }_2+ \cdots + c_n\varvec{\mu }_n\, \vert \, c_i\in [0,1]\,\,,c_1+c_2+\cdots +c_n = 1 \}\) by simply choosing \(c_i = \pi _\infty (i).\) Therefore,

$$\begin{aligned} {\mathcal {F}}({\mathcal {D}}) \subseteq \{c_1\varvec{\mu }_1 +c_2\varvec{\mu }_2+ \cdots + c_n\varvec{\mu }_n\, \vert \, c_i\in [0,1]\,,c_1+c_2+\cdots +c_n = 1 \} \end{aligned}$$

For the other direction, any given \(c = (c_1,c_2,\dots , c_n)^T\in {\mathbb {R}}^n\) with \(c_1+c_2+\cdots +c_n = 1, c_i\ge 0\) defines a probability distribution over the action set \({\mathcal {A}}.\) If we let D be a decision-maker who chooses action i with probability \(c_i\) on every trial, then we get:

$$\begin{aligned} \{c_1\varvec{\mu }_1 +c_2\varvec{\mu }_2+ \cdots + c_n\varvec{\mu }_n\, \vert \, c_i\in [0,1]\,,c_1+c_2+\cdots +c_n = 1 \} \subseteq {\mathcal {F}}({\mathcal {D}}). \end{aligned}$$

\(\square \)

Appendix C: Proofs of Propositions 24

Before we move onto the proofs of the main propositions below, let’s briefly elaborate on the Pareto front \({\mathcal {P}}\). As we defined in the main text, \({\mathcal {P}}\) is precisely the Pareto optimal part of the boundary of the convex polytope \({\mathcal {F}}({\mathcal {D}})\). In other words, \({\mathcal {P}}\) is the finite union of the Pareto optimal faces of \({\mathcal {F}}({\mathcal {D}})\), each of which is a convex polytope of at most \(m-1\) dimension. If F is one such face of dimension s, then by renumbering vertices we can represent

$$\begin{aligned} F = \{c_1\varvec{\mu }_1 + c_2\varvec{\mu }_2 + \cdots +c_s\varvec{\mu }_s,\, \vert \, c_1+c_2+\cdots +c_s = 1, c_i\ge 0 \} \end{aligned}$$

with \(s\le m.\) Importantly, \(\varvec{\mu }_1,\dots , \varvec{\mu }_s\) will be linearly independent from our assumed non-degeneracy of \(\Omega \) (Assumption 1). Furthermore, let us note that no vertex \(\varvec{\mu }\) on \({\mathcal {P}}\) is dominated by any convex combination of the vertices in \(\Omega \). That is, if we let \(\prec \) denote Pareto dominance, then

$$\begin{aligned} \forall \varvec{\mu }\in {\mathcal {P}},\, \forall \varvec{\alpha } = (\alpha _1,\alpha _2,\dots , \alpha _n)^T:\,\, \alpha _i\ge 0\,, \sum \limits _{i=1}^{n}\alpha _i = 1\implies \varvec{\mu } \nprec \sum \limits _{j=1}^{n}\alpha _j\varvec{\mu }_j \end{aligned}$$
(C2)

But if we restrict our attention to F only, something even stronger is true:

$$\begin{aligned} \forall \varvec{\mu }, \varvec{\mu }'\in F \text { with } \varvec{\mu }\ne \varvec{\mu }'\implies \varvec{\mu }\not \prec \varvec{\mu }' \end{aligned}$$
(C3)
$$\begin{aligned} \varvec{\mu }\not \succ \varvec{\mu }' \end{aligned}$$
(C4)

Now we prove an auxiliary lemma that is a strengthening of the above property for a given face \(F\subseteq {\mathcal {P}}.\)

Lemma 1

Let F be a face on \({\mathcal {P}}\), generated by convex combinations of linearly independent, Pareto optimal vertices \(\varvec{\mu }_1,\dots , \varvec{\mu }_s.\) Then for any \(i\in \{1,2,\dots , s\}\), one has:

$$\begin{aligned} \forall \varvec{\alpha } = (\alpha _1,\alpha _2,\dots , \alpha _s)^T:\,\, \, \sum \limits _{j=1}^{s}\alpha _j = 1\implies&\varvec{\mu }_i\not \prec \sum \limits _{j=1}^{s}\alpha _j\varvec{\mu }_j \\&\varvec{\mu }_i\not \succ \sum \limits _{j=1}^{s}\alpha _j\varvec{\mu }_j \end{aligned}$$

Proof

Note that the strengthening refers to the removal of the requirement that \(\alpha _j\)’s must be non-negative. This means that none of \(\varvec{\mu _i}\) is dominated by an affine combination of \(\varvec{\mu }_1,\dots , \varvec{\mu }_s.\) Suppose otherwise and assume without loss of generality that \(\varvec{\mu }_1\) is strictly dominated by an affine but not convex combination of the \(\varvec{\mu }_1,\dots , \varvec{\mu }_s:\)

$$\begin{aligned} \varvec{\mu }_1 < \sum \limits _{j=1}^{s}\alpha _j\varvec{\mu }_j \end{aligned}$$

and vice versa. Since at least one of the \(\alpha _j\) must be negative and another of the \(\alpha _j\) must be positive, we can alternatively write the above as:

$$\begin{aligned} \varvec{\mu }_{1} + \sum \limits _{j\in I^-}\alpha _j^-\varvec{\mu }_{j}<\sum \limits _{j\in I^+}\alpha _j^+\varvec{\mu }_{j} \end{aligned}$$

with

$$\begin{aligned} \{\alpha _1, \alpha _2,\dots , \alpha _s\} = \{\alpha ^+_j\,\vert \, \alpha _j^+\ge 0,\, j\in I^+ \}\cup \{-\alpha ^-_j\,\vert \,\alpha _j^-\ge 0,\, j\in I^-\},\, \end{aligned}$$

and \( I^+\cup I^- = \{1,2,\dots , s\},\,\,\, I^+\cap I^- = \emptyset .\) Notice that:

$$\begin{aligned} \sum \limits _{j\in I^+}\alpha _j^+ = 1 + \sum \limits _{j\in I^-}\alpha _j^-{:=}A > 1. \end{aligned}$$

Therefore, we obtain:

$$\begin{aligned} \frac{\varvec{\mu }_{1}}{A} + \sum \limits _{j\in I^-}\frac{\alpha _j^-}{A}\varvec{\mu }_{j}<\sum \limits _{j\in I^+}\frac{\alpha _j^+}{A}\varvec{\mu }_{j} \end{aligned}$$

However, this leads to a contradiction to (C2) since the convex combination of \(\varvec{\mu }_{1}\) and \(\{\varvec{\mu }_{j}: j\in I^-\}\), a Pareto optimal point on the face F, cannot be dominated by a convex combination of \(\{\varvec{\mu }_{j}: j\in I^+\}.\) The inequality in the other direction follows the exact same reasoning. \(\square \)

We need another auxiliary result, namely the famous Farkas’ lemma which we state here without proof.

Lemma 2

(Farkas’ lemma) Let \(\varvec{A}\in {\mathbb {R}}^{m\times n}\) and \(\varvec{b}\in {\mathbb {R}}^m\). Then exactly one of the following two assertions are true:

  1. 1.

    There exists an \(\varvec{x}\in {\mathbb {R}}^n\) such that \(\varvec{A}\varvec{x} = \varvec{b}\) and \(\varvec{x}\ge 0.\)

  2. 2.

    There exists an \(\varvec{y}\in {\mathbb {R}}^m\) such that \(\varvec{A}^T\varvec{y}\ge 0\) and \(\varvec{b}^T\varvec{y}<0.\)

Now we are ready to state and prove our final auxiliary lemma:

Lemma 3

Let F be a Pareto optimal face with vertices \(\varvec{\mu }_1,\dots , \varvec{\mu }_s.\) There exist strictly positive \(\varvec{y} \in {\mathbb {R}}^m\) such that

$$\begin{aligned} \varvec{y}^T \varvec{\mu }_1 = \varvec{y}^T \varvec{\mu }_2 = \cdots = \varvec{y}^T \varvec{\mu }_s \end{aligned}$$

and for \(j=s+1,\ldots ,n\),

$$\begin{aligned} \varvec{y}^T \varvec{\mu }_1 > \varvec{y}^T \varvec{\mu }_j. \end{aligned}$$

Proof

From Proposition 1, the feasible region \(\mathcal{F}(\mathcal{D})\) is a compact, convex polytope. As such, it can be expressed as the intersection of a finite number of half-spaces (Weyl-Minkowski theorem):

$$\begin{aligned} \mathcal{F}(\mathcal{D}) = \cap _{i=1}^k \{ \varvec{\mu } \in {\mathbb {R}}^m: \varvec{x}_i^T \varvec{\mu } \le b_i \}. \end{aligned}$$

Consider indices of those half-spaces for which \(\varvec{\mu }_{1},\ldots ,\varvec{\mu }_{s}\) are on its boundary:

$$\begin{aligned} \mathcal{I} = \{ i: \varvec{x}_i^T \varvec{\mu }_j = b_i, j=1,\ldots , s\} \end{aligned}$$

This set, in fact, characterizes the face F:

$$\begin{aligned} F = \mathcal{F}(\mathcal{D}) \cap \left( \cap _{i \in \mathcal{I}} \{ \varvec{\mu } \in {\mathbb {R}}^m: \varvec{x}_i^T \varvec{\mu } = b_i \}\right) . \end{aligned}$$

Use \(\mathcal{I}\) to define the closed, non-empty, convex set:

$$\begin{aligned} C:= \cap _{i \in \mathcal{I}}\{ \varvec{\mu } \in {\mathbb {R}}^m: \varvec{x}_i^T \varvec{\mu } \le 0 \}. \end{aligned}$$

Convexity follows since C is the intersection of a finite set of half-planes. Critically, C does not contain any non-negative vectors, except for \(\varvec{0}\); if it did, we could move \(\varvec{\mu }_1\) in the direction of this non-negative, non-zero \(\varvec{\nu } \in C\) to get a point \(\varvec{\mu }_1 + \delta \varvec{\nu } \in \mathcal{F}(\mathcal{D})\) for small \(\delta >0\) that Pareto dominates \(\varvec{\mu }_1\), contradicting \(\varvec{\mu }_1\) being a vertex of a Pareto optimal face. Consequently, we can find a hyperplane, with normal \(\varvec{z}\), to separate C from the (compact, non-empty, convex) set of convex combinations of standard basis vectors \(\varvec{e}_i\) in \({\mathbb {R}}^m\):

$$\begin{aligned} \varvec{z}^T \varvec{\mu }&\le 0 \qquad \forall \varvec{\mu } \in C \\ \varvec{z}^T \varvec{e}_i&> 0 \qquad i=1,\ldots ,m. \end{aligned}$$

The last inequality says each entry in \(\varvec{z}\) is strictly positive, and hence, \(\varvec{z}\) is strictly positive.

We will use the fact that the vector \(\varvec{z}\) can be expressed as a conic combination of the vectors \(\varvec{x}_i\) for \(i\in \mathcal{I}\):

$$\begin{aligned} \sum _{i \in \mathcal{I}} \alpha _i \varvec{x}_i \qquad \alpha _i \ge 0. \end{aligned}$$

This follows from Farkas’ lemma, since otherwise we could find \(\varvec{\beta } \in {\mathbb {R}}^m\) such that

$$\begin{aligned} \varvec{x}_i^T \varvec{\beta }&\le 0 \qquad i \in \mathcal{I}\\ \varvec{z}^T \varvec{\beta }&> 0. \end{aligned}$$

The first inequality says that \(\varvec{\beta }\) lives in C, but then the second equality contradicts the condition that \(\varvec{z}^T \varvec{\mu } \le 0\) for all \(\varvec{\mu } \in C\).

We are now ready to construct \(\varvec{y}\). Let

$$\begin{aligned} \varvec{y}:= \varvec{z} + \epsilon \sum _{i \in I} x_i \end{aligned}$$

for \(\epsilon >0\) sufficiently small so that \(\varvec{y}\) is strictly positive. Such an \(\epsilon \) exists, since \(\varvec{z}\) is strictly positive. Further, our characterization of \(\varvec{z}\) above allows us to express \(\varvec{y}\) in the form \(\sum _{i \in \mathcal{I}} \gamma _i \varvec{x}_i\) for some strictly positive \(\gamma _i\). Therefore,

$$\begin{aligned} \varvec{y}^T \varvec{\mu }_1 = \cdots = \varvec{y}^T \varvec{\mu }_s, \end{aligned}$$

since for each \(j = 1,\ldots ,s\),

$$\begin{aligned} \varvec{y}^T \varvec{\mu }_j = \sum _{i \in \mathcal{I}} \gamma _i \varvec{x}_i^T \varvec{\mu }_j = \sum _{i \in \mathcal{I}} \gamma _i b_i. \end{aligned}$$

It is also the case that, for each \(j = s+1,\ldots ,n\),

$$\begin{aligned} \varvec{y}^T \varvec{\mu }_1 = \sum _{i \in \mathcal{I}} \gamma _i b_i > \varvec{y}^T \varvec{\mu }_j \end{aligned}$$

since

$$\begin{aligned} \varvec{x}^T_i \varvec{\mu }_1 = b_i > \varvec{x}^T_i \varvec{\mu }_j \end{aligned}$$

for some \(i \in \mathcal{I}\). Otherwise, \(\varvec{\mu }_j\) would also be on F, which contradicts the initial assumption that \(\varvec{\mu }_{1},\dots ,\varvec{\mu }_{s}\) are all the vertices on F. We can conclude that \(\varvec{y}\) has the desired properties. \(\square \)

Now we are ready to prove the propositions:

Proof of Proposition 2

Fix the parameters \(\varepsilon \) and \(\varvec{w}.\) Recall that epsilon-greedy decision-maker samples uniformly at random with probability \(\varepsilon \) from all the vertices and with probability \(1-\varepsilon \) choose a vertex that maximizes \(\varvec{w}^T\varvec{\mu }_i\) from \(i\in {\mathcal {A}}.\) Therefore, we can write the probability of action i being chosen as:

$$\begin{aligned} p_i(\varepsilon ,\varvec{w}) = {\left\{ \begin{array}{ll} \dfrac{\varepsilon }{n},&{} i\notin M(\varvec{w}) \\ \dfrac{\varepsilon }{n} +\dfrac{1-\varepsilon }{\vert M(\varvec{w}) \vert }, &{} i\in M(\varvec{w}) \end{array}\right. } \end{aligned}$$
(C5)

where

$$\begin{aligned} M(\varvec{w}) = \left\{ k\in {\mathcal {A}}\,\vert \,k = \arg \max \limits _{i\in {\mathcal {A}}}\varvec{w}^T\varvec{\mu }_{i} \right\} . \end{aligned}$$

Our epsilon-greedy decision-maker with the given parameters is characterized by:

$$\begin{aligned} \rho (\varepsilon , \varvec{w})&= \sum _{i=1}^n p_i(\varepsilon , \varvec{w})\varvec{\mu }_i = \sum \limits _{i\in M(\varvec{w}) }\left( \dfrac{\varepsilon }{n} +\dfrac{1-\varepsilon }{\vert M(\varvec{w})\vert }\right) \varvec{\mu }_i + \sum \limits _{i\notin M(\varvec{w})} \dfrac{\varepsilon }{n}\varvec{\mu }_i \\&=\dfrac{\varepsilon }{n}\left( \sum \limits _{i=1}^{n}\varvec{\mu }_i\right) + \dfrac{(1-\varepsilon )}{\vert M(\varvec{w})\vert }\sum \limits _{i\in M(\varvec{w})}\varvec{\mu }_i = \varepsilon \bar{\varvec{\mu }} + (1-\varepsilon )\overline{M(\varvec{w})} \end{aligned}$$

which trace out the line segment connecting \(\bar{\varvec{\mu }} = \dfrac{1}{n}\left( \sum \limits _{i=1}^{n}\varvec{\mu }_{i}\right) \), the centroid of the vertices of \(\Omega \), and \(\overline{M(\varvec{w})}\), the centroid of the vertices with indices in \(M(\varvec{w}).\)

First note that \(M(\varvec{w})\) must be the indices of vertices of some Pareto optimal face F. To see this, suppose \(\varvec{\mu }_{1}, \varvec{\mu }_{2},\dots \varvec{\mu }_{s}\) are the vertices with indices in \(M(\varvec{w})\), but not the vertices of any Pareto optimal face. Then by nature of not being the vertices of a Pareto optimal face, there must exist a convex combination \(\varvec{\mu }\) of these vertices that is strictly dominated by a point \(\varvec{\mu }''\) on the Pareto front:

$$\begin{aligned} \varvec{\mu } < \varvec{\mu }''. \end{aligned}$$

Together with \(\varvec{w} \ge 0\), \(\varvec{w} \ne 0\), \(\varvec{w}^T\varvec{\mu }_{1} = \cdots = \varvec{w}^T \varvec{\mu }_{s}\), and \(\varvec{\mu }\) being a convex combination of \(\varvec{\mu }_{1},\ldots ,\varvec{\mu }_{s}\), this implies that

$$\begin{aligned} \varvec{w}^T \varvec{\mu } = \varvec{w}^T \varvec{\mu }_{1} < \varvec{w}^T \varvec{\mu }''. \end{aligned}$$

Meanwhile, \(\varvec{\mu }''\) would be itself a convex combination of vertices \(\varvec{\mu }_{n_1},\ldots ,\varvec{\mu }_{n_k}\) on the Pareto front:

$$\begin{aligned} \sum _{i=1}^{k} \alpha _i \varvec{\mu }_{n_i} =\varvec{\mu }'', \end{aligned}$$

and so,

$$\begin{aligned} \varvec{w}^T \varvec{\mu }_{1} < \varvec{w}^T \varvec{\mu }'' = \sum _{i=1}^{k} \alpha _i \varvec{w}^T \varvec{\mu }_{n_i} \implies \sum _{i=1}^{k} \alpha _i (\varvec{w}^T\varvec{\mu }_{n_i}- \varvec{w}^T\varvec{\mu }_{1}) > 0. \end{aligned}$$

Since \(\alpha _i\ge 0,\) it follows that for at least one index i, one has \(\varvec{w}^T\varvec{\mu }_{n_i} > \varvec{w}^T\varvec{\mu }_{1},\) which contradicts \(1 \in M(\varvec{w})\). What this shows is that \(M(\varvec{w})\) must be the indices of vertices of some Pareto optimal face F.

Now let F be a Pareto optimal face of \({\mathcal {P}}\) with vertices \(\varvec{\mu }_{1},\dots ,\varvec{\mu }_{s}\). Our goal is to construct \(\varvec{w}\) so that the set \(M(\varvec{w})\) is exactly the indices \(\{1,\dots ,s\}\). From Lemma 3, we can find strictly positive \(\varvec{y}\) such that

$$\begin{aligned} \varvec{y}^T \varvec{\mu }_1 = \cdots = \varvec{y}^T \varvec{\mu }_s \end{aligned}$$

and for each \(j = s+1,\ldots ,n\),

$$\begin{aligned} \varvec{y}^T \varvec{\mu }_1 > \varvec{y}^T \varvec{\mu }_j. \end{aligned}$$

Rescaling \(\varvec{y}\) by a positive number does not change the equalities and inequalities above. Therefore, letting \(\varvec{w} = \varvec{y} / \Vert \varvec{y} \Vert _1\) gives us \(M(\varvec{w}) = \{1,\ldots ,s\}\), as desired.

We now have that there is some \(\varvec{w} \ge 0\) with \(\Vert \varvec{w} \Vert _1 = 1\) such that \(M(\varvec{w}) = \{1,\ldots ,s\}\) if and only if \(\varvec{\mu }_1,\ldots , \varvec{\mu }_s\) are vertices of a Pareto optimal face. This completes the proof, since now we know that the feasible region of epsilon-greedy decision makers is characterized by line segments between the centroid of the vertices of the feasible region and the centroid of the vertices of each Pareto optimal face. Note also that any Pareto optimal vertex is itself a Pareto optimal face due to Assumption 1. \(\square \)

Proof of Proposition 3

Consider the space

$$\begin{aligned} \mathcal{U}:= \{ (\tau ,\varvec{u}) \in {\mathbb {R}}^m: (\tau ,\varvec{u}) \ge 0, \Vert \varvec{u} \Vert _1 \le 1 \}. \end{aligned}$$

Parameters \(\tau ,\varvec{w}\) for a softmax decision-maker can then be specified by selecting any \((\tau ,\varvec{u})\) from \(\mathcal{U}\) and then letting \(\varvec{w}\) have \(\varvec{u}\) as its first \(m-1\) entries and \(1-\Vert \varvec{u} \Vert _1\) as its remaining entry. The proof will be divided into two parts. In the first part, we show that if \(\Omega \) is not degenerate and has a non-empty interior, then the softmax decision-maker \(\rho (\tau , \varvec{w})\) maps an interior point of \(\mathcal U\) to an interior point of \(\Omega \). This will mean that any boundary points of the feasible region for softmax decision-makers will be generated from either the boundary of \(\mathcal U\) (i.e. when one of the \(w_j\) are zero as stated in the proposition) or from taking the limit as \(\tau \rightarrow \infty \). In the second part, we will prove that any point on the Pareto front can be approximated by a sequence of interior points of \(\mathcal{U}\) in the limit \(\tau \rightarrow \infty .\)

For the first part, we assume that \(\Omega \) has a non-empty interior. Fix \((\tau , \varvec{u})\), an interior point of \(\mathcal U\), define \(\varvec{w}\) accordingly, and do a linear change of variables \(\tau \varvec{w} = \varvec{x}\) (only for this part of the proof):

$$\begin{aligned} P_{\text {softmax}}(i,\tau , \varvec{w}){:=}p_i(\varvec{x}) = \dfrac{e^{\varvec{x}^T\varvec{\mu }_i}}{\sum \limits _{i=1}^n e^{\varvec{x}^T\varvec{\mu }_i}} = \dfrac{f_i(\varvec{x})}{G(\varvec{x})} \end{aligned}$$
(C6)

where \(G(\varvec{x}) = \sum \limits _{k=1}^n f_k(\varvec{x}).\) Note that this change of variables, from the interior of \(\mathcal U\) to \((0,\infty )^m\), is a local homeomorphism, since it has a non-vanishing Jacobian:

$$\begin{aligned} \det \bigg \vert \dfrac{\partial (\tau ,\varvec{u})}{\partial \varvec{x}}\bigg \vert = \dfrac{1}{\tau ^{m-1}(1-\Vert \varvec{u} \Vert _1)} > 0. \end{aligned}$$

With this setup, consider the mapping from \(\varvec{x}\) to the long-term average expected rewards for the decision-maker with parameter \(\varvec{x}\):

$$\begin{aligned} \varvec{F} = \left( F_1, F_2,\dots ,F_m \right) ^T:{\mathbb {R}}^m\longrightarrow {\mathbb {R}}^m,\quad \text {where}\quad F_j(\varvec{x}) = \sum \limits _{k=1}^np_k(\varvec{x})\mu _k^{(j)} \end{aligned}$$

Now for the actual proof, we first show that the Jacobian of \(\varvec{F}\) is positive semi-definite for \(\varvec{x}>0\). In fact,

$$\begin{aligned} \dfrac{\partial p_i}{\partial x_j} = \dfrac{1}{G}\dfrac{\partial f_i}{\partial x_j} - \dfrac{f_i}{G}\sum _{k=1}^n\dfrac{1}{G}\dfrac{\partial f_k}{\partial x_j} = \dfrac{f_i}{G}\left( \mu _{i}^{(j)} - \sum \limits _{k=1}^{n}\dfrac{f_k}{G}\mu _{k}^{(j)}\right) = p_i\left( \mu _{i}^{(j)} - \sum \limits _{k=1}^{n}p_k\mu _{k}^{(j)}\right) \end{aligned}$$

and therefore:

$$\begin{aligned} \dfrac{\partial F_i}{\partial x_j} = \sum \limits _{k=1}^{n} \dfrac{\partial p_k}{\partial x_j}\mu _{k}^{(i)} = \sum \limits _{k=1}^{n}p_k\mu _{k}^{(i)}\mu _{k}^{(j)} - \sum \limits _{k=1}^{n}p_k\mu _{k}^{(i)}\sum \limits _{k=1}^{n}p_k\mu _{k}^{(j)}. \end{aligned}$$
(C7)

For given \(\varvec{x} > 0,\) \(p_1(\varvec{x}), \dots , p_n(\varvec{x})\) defines a discrete probability distribution on the vertices of \(\Omega .\) Namely, there exists a random vector \(\varvec{X} = (X_1,X_2,\dots , X_m)^T\) which takes the value \(\varvec{\mu }_i\) with probability \(p_i(\varvec{x})\) for \(i = 1,2,\dots , n.\) That is:

$$\begin{aligned} {\mathbb {P}}\left( \varvec{X} = \begin{bmatrix}\mu _{i}^{(1)} \\ \mu _{i}^{(2)} \\ \vdots \\ \mu _{i}^{(m)} \end{bmatrix}\right) = p_i. \end{aligned}$$

The component \(X_i\) then takes the value \(\mu _{k}^{(i)}\) with probability \(p_k\) and therefore:

$$\begin{aligned} {\mathbb {E}}(X_i) = \sum \limits _{k=1}^{n}p_k\mu _{k}^{(i)} \end{aligned}$$

Furthermore,

$$\begin{aligned} {\mathbb {E}}(X_iX_j)&= \sum \limits _{k=1}^{n}\sum \limits _{s=1}^{n}{\mathbb {P}}\left( X_i = \mu _{s}^{(i)}\cap X_j=\mu _{k}^{(j)}\right) \mu _{s}^{(i)}\mu _{k}^{(j)} \\&= \sum \limits _{k=1}^{n}\sum \limits _{s=1}^{n}\delta _{sk}p_k \mu _{s}^{(i)}\mu _{k}^{(j)} = \sum \limits _{k=1}^{n}p_k\mu _{k}^{(i)}\mu _{k}^{(j)} \end{aligned}$$

Putting everything together, we obtain:

$$\begin{aligned} \dfrac{\partial F_i}{\partial x_j} = {\mathbb {E}}(X_iX_j) - {\mathbb {E}}(X_i){\mathbb {E}}(X_j) = \text {Cov}(X_i, X_j) \end{aligned}$$
(C8)

This means that the Jacobian matrix is:

$$\begin{aligned} J&= \dfrac{\partial (F_1,F_2,\dots ,F_m)}{\partial (x_1,x_2,\dots x_m)} \nonumber \\&= \begin{bmatrix} \text {Cov}(X_1, X_1) &{} \text {Cov}(X_1, X_2) &{} \ldots &{} \text {Cov}(X_1, X_m) \\ \text {Cov}(X_2, X_1) &{} \text {Cov}(X_2, X_2) &{} \ldots &{} \text {Cov}(X_2, X_m) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \text {Cov}(X_m, X_1) &{} \text {Cov}(X_m, X_2) &{} \ldots &{} \text {Cov}(X_m, X_m) \\ \end{bmatrix} \nonumber \\&= \text {Cov}(\varvec{X}, \varvec{X}) = \text {Var}(\varvec{X}) \end{aligned}$$
(C9)

It’s well-known that the covariance matrix is positive semi-definite but we claim that this is positive definite. If it were not positive-definite, then there exists a non-zero vector \(\varvec{a}\in {\mathbb {R}}^m\) such that \(\text {Var}(\varvec{X})\varvec{a} = \varvec{0}.\) Therefore,

$$\begin{aligned} 0 = \varvec{a}^T\text {Var}(\varvec{X})\varvec{a} = \sum _{i,j=1}^ma_i\text {Cov}(X_i,X_j)a_j = \text {Var}(\varvec{a}^T\varvec{X}) \end{aligned}$$

This means with probability 1, the vector \(\varvec{X}\) lives on some hyperplane \(\varvec{a}^T\varvec{X} = b\), which would be impossible unless \(\Omega \) has zero volume in \({\mathbb {R}}^m,\) contradicting our assumption that \(\Omega \) has non-zero volume. To see this, a simple computation of the variance yields:

$$\begin{aligned} 0 = \text {Var}(\varvec{a}^T\varvec{X}) = {\mathbb {E}}\left[ (\varvec{a}^T\varvec{X})^2\right] - {\mathbb {E}}[\varvec{a}^T\varvec{X}]^2 = \dfrac{1}{2}\sum \limits _{1\le i<j\le n}p_ip_j(\varvec{a}^T\varvec{\mu }_{i} - \varvec{a}^T\varvec{\mu }_{j})^2. \end{aligned}$$

Since \(p_i >0\), this means that:

$$\begin{aligned} \varvec{a}^T\varvec{\mu }_{1} = \varvec{a}^T\varvec{\mu }_{2} = \cdots =\varvec{a}^T\varvec{\mu }_{n} = b\in {\mathbb {R}}. \end{aligned}$$

Therefore, the Jacobian is non-vanishing on the interior of the domain \([0,\infty )^m\), and our map \(\varvec{F}\) is locally a homeomorphism on \([0,\infty )^m\). In addition, we have already shown changing variables from \((\tau ,\varvec{u})\) in the interior of \(\mathcal U\) to \((0,\infty )^m\) is a local homeomorphism. Therefore, the composition of these two local homeomorphisms is a local homeomorphism. We can conclude that \(\rho (\tau ,\varvec{w})\) maps interior points of \(\mathcal U\) to interior points of \(\Omega \).

For the second part of the proof, we pick a point on \(\varvec{\mu }\) on the Pareto front. It will belong to a Pareto optimal face, F, which is generated by a finite number of vertices \(\varvec{\mu }_1, \dots , \varvec{\mu }_s\in F\subseteq {\mathcal {P}}.\) That is, we have:

$$\begin{aligned} \varvec{\mu } = \lambda _1 \varvec{\mu }_1 + \cdots + \lambda _s \varvec{\mu }_s,\,\, \lambda _i\ge 0,\,\, \sum \limits _{i=1}^s\lambda _i = 1. \end{aligned}$$

We can assume that the vectors \(\varvec{\mu }_1, \dots , \varvec{\mu }_s\in F\) are linearly independent (Assumption 1) and the maximal set of vectors that generate F. Recall that:

$$\begin{aligned} P(i,\tau , \varvec{w}) = p_i(\tau , \varvec{w}) = \dfrac{e^{\tau \varvec{w}^T\varvec{\mu }_i}}{\sum \limits _{j=1}^{n}e^{\tau \varvec{w}^T\varvec{\mu }_j}} =\dfrac{e^{\tau \varvec{w}^T(\varvec{\mu }_i-\varvec{\mu }_1)}}{\sum \limits _{j=1}^{n}e^{\tau \varvec{w}^T(\varvec{\mu }_j-\varvec{\mu }_1)}} \end{aligned}$$

We will show that a limiting sequence of \((\tau ,\varvec{w})\) have \(p_i(\tau ,\varvec{w})\) limiting to \(\lambda _i\) for \(i=1,\ldots ,s\) and to zero for \(i=s+1,\ldots ,n\). The corresponding sequence of points in \(\Omega \) will have

$$\begin{aligned} \sum _{i=1}^n p_i(\tau ,\varvec{w}) \varvec{\mu }_{i} \rightarrow \sum _i \lambda _i \varvec{\mu }_{i} = \mu , \end{aligned}$$

which would complete the second part of our proof.

Without loss of generality, assume \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _k > 0\) and \(\lambda _i=0\) for \(k < i \le s\). Consider \(\alpha > 0\). If \(s=1\), let \(\varvec{z}\) be any non-negative, non-zero vector. If \(s>1\), let us find non-negative, non-zero \(\varvec{z}:= \varvec{z}(\alpha ) \in {\mathbb {R}}^m\) that solves

$$\begin{aligned} \varvec{A} \varvec{z} = \varvec{b} \end{aligned}$$

where

$$\begin{aligned} \varvec{A}&= \begin{bmatrix} \varvec{\mu }_1^T - \varvec{\mu }_2^T \\ \vdots \\ \varvec{\mu }_1^T - \varvec{\mu }_s^T \end{bmatrix}\\ \varvec{b}&:= \begin{bmatrix} \ln \frac{\lambda _1}{\lambda _2} \\ \vdots \\ \ln \frac{\lambda _1}{\lambda _k} \\ \alpha \\ \vdots \\ \alpha \end{bmatrix}. \end{aligned}$$

To show that such a solution \(\varvec{z}\) exists, we will prove that the second assertion in Farkas’ lemma is violated. Let us pick some \(\varvec{y}\in {\mathbb {R}}^{s-1}\) such that \(\varvec{b}^T\varvec{y} < 0.\) This implies \(\varvec{y}\ne \varvec{0}\) and if we let \(\varvec{y}^T = (y_2,y_3,\dots y_s),\) then:

$$\begin{aligned} \varvec{A}^T\varvec{y}&= y_2(\varvec{\mu }_1 - \varvec{\mu }_2) + \cdots +y_s(\varvec{\mu }_1 - \varvec{\mu }_s) \\&= \varvec{\mu }_1(y_2+\cdots + y_s) - \sum _{i=2}^s y_i\varvec{\mu }_i \\&=\left( \sum _{i=2}^s y_i\right) \left( \varvec{\mu }_1 - \sum _{i=2}^s\beta _i\varvec{\mu }_i\right) . \end{aligned}$$

If \(y_2+y_3+\cdots +y_s\ne 0\), then let

$$\begin{aligned} \beta _i = \dfrac{y_i}{y_2+y_3+\cdots +y_s} \text { with } \sum \limits _{i=2}^s\beta _i = 1. \end{aligned}$$

We know from Lemma 1 that neither of \(\varvec{\mu }_{1}\) and \(\sum \limits _{i=2}^s\beta _i\varvec{\mu }_i\) can Pareto dominate another. This means that no matter the sign of \(y_2+y_3+\cdots +y_s\ne 0,\) \(\varvec{A}^T\varvec{y}\) will have a negative component.

If \(y_2+y_3+\cdots +y_s = 0,\) suppose that:

$$\begin{aligned} \varvec{A}^T\varvec{y} =-\sum \limits _{i=2}^sy_i\varvec{\mu }_{i}\ge 0 \end{aligned}$$

This implies \( \textstyle \sum \limits _{i=1}^sy_i\varvec{\mu }_{i}\le \varvec{\mu }_{1}\) with \(y_1=1\), which makes \(y_1+y_2+\cdots +y_s = 1.\) Here, we cannot have equality since \(\varvec{\mu }_{1},\dots \varvec{\mu }_{s}\) are linearly independent so we get \( \sum \limits _{i=1}^sy_i\varvec{\mu }_{i} \prec \varvec{\mu }_{1}\). But this contradicts Lemma 1, and we can conclude that non-negative solution \(\varvec{z}\) does exist for the equation \(\varvec{A} \varvec{z} = \varvec{b}\).

Take the solution \(\varvec{z}\) from above as well as the strictly positive \(\varvec{y}\) invoked from Lemma 3 and define

$$\begin{aligned} \tau&:= \Vert \varvec{z} + \alpha ^2 \varvec{y} \Vert _1 \\ \varvec{w}&:= \frac{\varvec{z} + \alpha ^2 \varvec{y}}{\tau }. \end{aligned}$$

Because of the way that \(\varvec{z}\) and \(\varvec{y}\) were constructed, we have that

$$\begin{aligned} \tau \varvec{w}^T (\varvec{\mu }_i - \varvec{\mu }_1)&= \ln \frac{\lambda _i}{\lambda _1}{} & {} i=1,\ldots ,k \\ \tau \varvec{w}^T (\varvec{\mu }_i - \varvec{\mu }_1)&= -\alpha{} & {} i=k+1,\ldots ,s \\ \tau \varvec{w}^T (\varvec{\mu }_i - \varvec{\mu }_1)&= \varvec{z}^T (\varvec{\mu }_i - \varvec{\mu }_1) - \alpha ^2 \varvec{y}^T (\varvec{\mu }_1 - \varvec{\mu }_i){} & {} i=s+1,\ldots ,n. \end{aligned}$$

Note that for \(i=s+1,\ldots ,n\),

$$\begin{aligned} \varvec{y}^T (\varvec{\mu }_1 - \varvec{\mu }_i) > 0 \end{aligned}$$

and \(\varvec{z}\) scales at most linearly with \(\alpha \) so that \(\alpha ^2 \varvec{y}^T (\varvec{\mu }_1 - \varvec{\mu }_i)\) becomes the dominant term in \(\varvec{z}^T (\varvec{\mu }_1 - \varvec{\mu }_i) + \alpha ^2 \varvec{y}^T (\varvec{\mu }_1 - \varvec{\mu }_i) \) for large enough \(\alpha \). To see this linear scaling, note that \(\varvec{A}\) is of full row rank and our equation is consistent. Using the Moore-Penrose pseudo-inverse \(\varvec{A}^+\) of the matrix A, all solutions to our equation can be expressed as:

$$\begin{aligned} \varvec{A}^{+}\varvec{b} + (\varvec{I}_m -\varvec{A}^+\varvec{A} )\varvec{v}, \end{aligned}$$

where \(\varvec{v}\) is an arbitrary vector in \({\mathbb {R}}^m\). Observe that \(\varvec{A}^+\varvec{b}\) scales linearly with \(\alpha \) and that, since \(\varvec{I}_m -\varvec{A}^+\varvec{A}\) is fixed, we can choose \(\varvec{v}\) to scale linearly with \(\alpha \) such that \(\varvec{A}^{+}\varvec{b} + (\varvec{I}_m -\varvec{A}^+\varvec{A} )\varvec{v}\) is non-negative, non-zero and scales linearly with \(\alpha \).

Since \(\alpha >0\) was arbitrary and \(\sum _{j=1}^k \lambda _i = 1\), we can let \(\alpha \rightarrow \infty \), resulting in

$$\begin{aligned} e^{\tau \varvec{w}^T (\varvec{\mu }_i - \varvec{\mu }_1)}&\rightarrow \frac{\lambda _i}{\lambda _1}{} & {} i=1,\ldots ,k \\ e^{\tau \varvec{w}^T (\varvec{\mu }_i - \varvec{\mu }_1)}&\rightarrow 0{} & {} i=k+1,\ldots , n. \end{aligned}$$

and therefore

$$\begin{aligned} p_i(\tau , \varvec{w})&\rightarrow \lambda _i{} & {} i=1,\ldots ,k \\ p_i(\tau , \varvec{w})&\rightarrow 0{} & {} i=k+1,\ldots ,s \end{aligned}$$

This was the desired result. \(\square \)

Proof of Proposition 4

Recall Assumption 1 requires that only one vertex of \(\mathcal{F}(\mathcal D)\) maximizes an individual dimension. Therefore, let \(l_j = \arg \max \limits _{1\le k\le n}\mu ^{(j)}_k\) so that \(\mu ^{(j)}_{l_j} = \max \limits _{1\le k\le n}\mu ^{(j)}_k\), for \(j=1,2,\dots m.\) Notice that as \(\tau \rightarrow \infty \):

$$\begin{aligned} p_{i}^{(j)}(\tau ) = \dfrac{e^{\tau \mu _{i}^{(j)}}}{\sum \limits _{k=1}^{n}e^{\tau \mu _{k}^{(j)}}} = \dfrac{1}{\sum \limits _{k=1}^{n}e^{\tau (\mu _{k}^{(j)}-\mu _{i}^{(j)})}}\longrightarrow 0 \end{aligned}$$

unless \(i = l_j,\) in which case we obviously have \(p_{i}^{(j)}(\tau )\longrightarrow 1.\) This means that:

$$\begin{aligned} \lim \limits _{\tau \rightarrow \infty }P(i,\tau ,\varvec{w}) = \sum \limits _{j=1}^{m}w_j\lim \limits _{\tau \rightarrow \infty }p_{i}^{(j)}(\tau ) = \sum \limits _{j=1}^{m}w_j \delta _{il_j} \end{aligned}$$
(C10)

Consequently:

$$\begin{aligned} \lim \limits _{\tau \rightarrow \infty }\sum \limits _{i=1}^{n}P(i,\tau ,\varvec{w})\mu _{i}^{(k)} = \sum \limits _{i=1}^{n}\sum \limits _{j=1}^{m}w_j\delta _{il_j}\mu _{i}^{(k)} = \sum \limits _{j=1}^{m}w_j\sum \limits _{i=1}^{n}\delta _{il_j}\mu _{i}^{(k)} = \sum \limits _{j=1}^{m}w_j\mu _{l_j}^{(k)} \end{aligned}$$

which finally implies:

$$\begin{aligned} \lim \limits _{\tau \rightarrow \infty }\rho (\tau , \varvec{w}) = w_1\varvec{\mu }_{l_1} + w_2\varvec{\mu }_{l_2}+\cdots +w_m\varvec{\mu }_{l_m}. \end{aligned}$$

What we have concluded here is that in the limit \(\tau \rightarrow \infty \), \(\rho (\tau , \varvec{w})\) is a convex combination of the points \(\varvec{\mu }_{l_j}\) for \(j=1,2,\dots m.\) But if we recall the fact that \(\varvec{\mu }_{l_j}\) is exactly the point that maximizes objective j, then it follows that the image of \(\rho (\tau , \varvec{w})\) on the boundary \(\tau \rightarrow \infty \) is precisely the hyperplane \(\Delta .\)

Furthermore, the selective attention decision-maker with parameters \(\tau , \varvec{w}\) takes a particularly nice form:

$$\begin{aligned} \rho (\tau , \varvec{w}) = \Omega \varvec{P}^T(\tau )\varvec{w}\in {\mathbb {R}}^m \end{aligned}$$

where

$$\begin{aligned} \varvec{P}(\tau ) = \begin{bmatrix} p_{1}^{(1)}(\tau ) &{} p_{2}^{(1)}(\tau ) &{}\dots &{} p_{n}^{(1)}(\tau ) \\ p_{1}^{(2)}(\tau ) &{} p_{2}^{(2)}(\tau ) &{}\dots &{} p_{n}^{(2)}(\tau ) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ p_{1}^{(m)}(\tau ) &{} p_{2}^{(m)}(\tau ) &{}\dots &{} p_{n}^{(m)}(\tau ) \\ \end{bmatrix}\in (0,1)^{m\times n}, \text { with } p_{i}^{(j)}(\tau ) = \dfrac{e^{\tau \mu _{i}^{(j)}}}{\sum \limits _{k=1}^{n}e^{\tau \mu _{k}^{(j)}}} \end{aligned}$$

This means that for a given \(\tau \), \(\rho (\tau , \varvec{w})\) is a linear transformation on \({\mathbb {R}}^m\) and therefore takes the boundary with \(w_j = 0\) for some \(j\in \{1,2,\dots , m\}\) to a boundary point in the image. In other words, this region can thus be described as a subset of a hyperplane in \({\mathbb {R}}^m\) bounded by the \(w_j= 0\) curves. By taking the union of such regions over \(\tau \), we recover the feasible region of the selective attention decision-makers. \(\square \)

Appendix D: Choosing a Scale for Outcomes

When we analyze AAC tasks, we must assign a numerical value to each outcome. How to do this is not always obvious, especially for harmful outcomes. For example in the spider gambling task (Pittig et al. 2014), one draw of deck C might give \(\$25\) as a reward and a picture of spider as harm. The natural question is then what numerical value we should assign to the picture of spider?

It turns out that this choice of numerical value does not matter for some of the families of decision-makers. Specifically, we say a family is scale-invariant if we were to re-scale any outcome dimension so that the reward structure \(\Omega \) becomes \(A\Omega \) for some diagonal matrix A with positive diagonal entries, then the family’s new feasible region would simply be the original feasible region transformed by A. Since the feasible region for the set \(\mathcal{D}\) of all decision-makers is just the convex hull of the \(\varvec{\mu }_i\), then the set of all decision-makers \(\mathcal{D}\) is clearly scale-invariant.

Both the family of softmax decision-makers and the family of epsilon-greedy decision-makers are also scale-invariant. To see this for the former family, suppose that \(\alpha _i\) is the ith diagonal entry of A. Then when moving from \(\Omega \) to \(A \Omega \), we can adjust the \((\tau , {\varvec{w}})\) parameters by introducing \({\varvec{w}}' = (w_1',w_2',\dots , w_m')^T\) and

$$\begin{aligned} {\left\{ \begin{array}{ll} \tau ' =\tau \sum \limits _{j=1}^m \dfrac{w_j}{\alpha _j} \\ w_k' = \dfrac{\dfrac{w_k}{\alpha _k}}{\sum \limits _{j=1}^m \dfrac{w_j}{\alpha _j}} \end{array}\right. } \end{aligned}$$

It can be observed that:

$$\begin{aligned} \tau '{\varvec{w}}'^T A\varvec{\mu }_i = \tau {\varvec{w}}^T\varvec{\mu }_i \quad \text {and}\quad \Vert {\varvec{w}}'\Vert _1 = 1 \end{aligned}$$

which ensures the probability of choosing action i stays the same after scaling. This shows that for every scalarized softmax decision-maker D under \(\Omega \) whose long-term average expected rewards are \(\rho (D)\), there exists a scalarized softmax decision-maker \(D'\) under \(A\Omega \) whose long-term average expected rewards are \(\rho (D')=A \rho (D)\). Since A is invertible, then we can conclude that the family of scalarized soft-max decision-makers are scale-invariant. Furthermore, the same re-scaling of \(\varvec{w}\) to \(\varvec{w}\) also shows that the epsilon-greedy decision-maker is scale-invariant:

$$\begin{aligned} \varvec{w}'^T A\varvec{\mu }_i = \sum \limits _{j=1}^m w_j'\alpha _j\mu _{i}^{(j)} = \dfrac{\sum \limits _{j=1}^m w_j\mu _{i}^{(j)}}{\sum \limits _{j=1}^m\dfrac{w_j}{\alpha _j}} =\left( \sum \limits _{j=1}^m\dfrac{w_j}{\alpha _j}\right) ^{-1} \varvec{w}^T \varvec{\mu }_i. \end{aligned}$$

This means that the actions that maximizes the left-hand side also maximizes the right-hand side and vice versa, so the probability distribution of the epsilon-greedy decision-maker remains unaffected.

Interestingly, it is not clear if the family of selective attention decision-makers is scale-invariant. Thus, the choice in scale for outcome dimensions may affect the geometry of the feasible region, beyond a linear transformation, for the family of selective attention decisions.

Appendix E: Feasible Regions for \(m = 3\) Objectives

In our study of AAC, we mainly focused on the scenario with two outcome dimensions (\(m=2\)). This emphasis is due to the existing literature in AAC, which centers around conflicts between two objectives, like minimizing harm while maximizing rewards. Nevertheless, the insights we gained from characterizing the feasible regions (Propositions 14) extend to larger dimensions. In essence, the qualitative differences are limited when we move beyond two dimensions. For instance, the feasible region for a softmax decision-maker remains enclosed by the curves \(w_j = 0\) for \(j=1,2,3\), as well as the Pareto Front, even in larger dimensions. This can be visualized in Fig. 10, which outlines the feasible region for AAC tasks with 4 actions and 3 dimensions. The mean outcomes for the 4 actions are, in the first case, (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1), and, in the second case, (0.7, 0.7, 0.7), (1, 0, 0), (0, 1, 0), and (0, 0, 1). The feasible region is outlined by plotting mean outcomes for softmax decision-makers with a range of values for \(\tau \) and \({\textbf{w}}\).

Fig. 10
figure 10

(Color figure online) Scatter plot of three-dimensional mean outcomes for softmax decision-makers with varying parameters \(\tau \) and \({\textbf{w}}\). Each panel represents a different set of mean outcomes for the four actions. Red dots denote possible mean outcomes associated with four actions. Warmer colors represent larger distance to the origin

Similarly, Fig. 11 outlines the feasible region for the selective attention decision-maker for the same three-dimensional AAC tasks described above. As is the case for the two-dimensional figures, the feasible region for the selective attention decision-maker is characterized by the hyperplane \(\Delta \), connecting the actions that maximize mean outcomes along one of the dimensions, and the curves \(w_j = 0, j=1,2,3.\)

Fig. 11
figure 11

(Color figure online) Scatter plot of three-dimensional mean outcomes for selective attention decision-makers with varying parameters \(\tau \) and \({\textbf{w}}\). Each panel represents a different set of mean outcomes for the four actions. Red dots denote possible mean outcomes associated with four actions. Warmer colors represent larger distance to the origin

Appendix F: Learning Multi-objective Preferences

In our study, we primarily focused on the process of decision-making rather than learning. While learning is the process that estimates multidimensional rewards for actions, decision-making navigates the trade-offs between objectives.

One possible scenario between decision making and learning would be when the agent must learn their own preferences in how they wish to make multi-objective trade-offs. Indeed, one’s relative preference for different objectives must arise from somewhere, and at least some portion of those preferences are likely learned. One benefit of learning a preference for multiple objectives is that a preference over objectives might be generalizable across tasks if those tasks offer similar multi-objectives (i.e. approach/avoidance, money/pleasure/autonomy, etc.).

To explore learning multi-objective preferences, suppose the different objectives discussed in this paper are merely precursors to some more ultimate reward, and suppose the agent observes that ultimate reward but does not know ahead of time how that ultimate reward depends on the objectives.

For the scalarization based methods, if the different objectives are treated as cues for the ultimate reward, and the different cues have association strengths, \(\textbf{w}\), then this setting is now a very familiar single-objective learning scenario, amenable to many approaches, including temporal difference learning with an assumed linear model (Sutton and Barto 2018). If a softmax or an epsilon-greedy decision-maker begins to use these associations, \(\textbf{w}\), then they will presumably get a better ultimate reward, as they have learned the relative benefits of the different objectives.

In the special case of selective attention, which does not scalarize the objectives, it might makes sense to have temporal difference learning that also adheres to the same focus on a single objective at a time. For instance, the recently introduced Sparse Attention Rescorla-Wagner for Inference (SAR-WI) model (Nishimura and Cochran 2020) introduces a modified Rescorla-Wagner model designed specifically for settings where only a subset of all possible cues is attended to at a single time, much as how selective attention only attends to a single objective at a time. In order to ensure proper convergence SAR-WI requires simultaneously learning the mean of each cue and the reward and then centers their observations about that mean.

Whether a scenario has truly different objectives, or whether objectives are merely cues hinting at a more ultimate objective is a modeling question whose contours might be best understood through this sort of hierarchical lens.

Appendix G: Other Scalarization Techniques

It is important to recognize that there other ways to scalarize multidimensional outcomes than the linear scaling used for the softmax and epsilon greedy decision-makers. Chebyshev scalarization, for instance, is a popular technique in the multi-objective optimization literature (c.f., Van Moffaert and Nowé 2014; Moffaert et al. 2013). This scalarization approach uses a nonlinear function to turn multidimensional \(Q_t\) into a scalar, viz:

$$\begin{aligned} SQ_t(i) = \max _{j=1,2,\ldots m}w_j \begin{vmatrix} Q_t^{(j)}(i) - z_j \end{vmatrix} \end{aligned}$$
(G11)

where \(z_j\) is known as a utopia point in the multi-objective optimization literature. It is often the point that is found by optimizing each dimension independently and so it is a point that is realistically unattainable when there is conflict. Thus, one can expect the utopia point to drive the behavior of the decision-maker heavily, and this additional consideration makes it harder to characterize the resulting feasible regions precisely. In addition, if we look at (G11), the scalarization function is non-differentiable which also imposes an another challenge to qualitatively analyze the feasible region of the Chebyshev-scalarized decision-maker.

Fig. 12
figure 12

(Color figure online) Feasible region of Chebyshev-scalarized decision-maker with the utopia point that maximizes each objective dimension simultaneously

In Fig. 12, we show the feasible region of the Chebyshev decision-maker where the utopia point optimizes each objective dimension independently. The underlying AAC task has four actions with mean outcomes of \((0,-1)\), \((-0.1,.4)\), \((-0.4,0.9)\), and \((-1,1)\). Each Chebyshev decision-maker is drawn towards the utopia point. As a result, the feasible region of the Chebyshev decision-maker would overlap with the feasible region of the softmax decision-maker.

On the other hand, we also produced figures to see what happens when we use some other choices for the utopia point. In Fig. 13, the feasible regions of the Chebyshev decision maker are shown when the utopia point is each of the possible vertices. One can immediately see that the Chebyshev decision maker is drawn towards the utopia point.

Fig. 13
figure 13

(Color figure online) Feasible regions of Chebyshev-scalarized decision-makers with Utopia points that idealize one given action only

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Enkhtaivan, E., Nishimura, J. & Cochran, A. Placing Approach-Avoidance Conflict Within the Framework of Multi-objective Reinforcement Learning. Bull Math Biol 85, 116 (2023). https://doi.org/10.1007/s11538-023-01216-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11538-023-01216-6

Keywords

Navigation