Constrained Markov Decision Processes with Non-constant Discount Factor

Jasso-Fuentes, Héctor; Prieto-Rumeau, Tomás

doi:10.1007/s10957-024-02453-y

Constrained Markov Decision Processes with Non-constant Discount Factor

Open access
Published: 30 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Constrained Markov Decision Processes with Non-constant Discount Factor

Download PDF

207 Accesses
Explore all metrics

Abstract

This paper studies constrained Markov decision processes under the total expected discounted cost optimality criterion, with a state-action dependent discount factor that may take any value between zero and one. Both the state and the action space are assumed to be Borel spaces. By using the linear programming approach, consisting in stating the control problem as a linear problem on a set of occupation measures, we show the existence of an optimal stationary Markov policy. Our results are based on the study of both weak-strong topologies in the space of occupation measures and Young measures in the space of Markov policies.

Constrained Markov decision processes in Borel spaces: from discounted to average optimality

Article 20 June 2016

Constrained Continuous-Time Markov Decision Processes on the Finite Horizon

Article 15 April 2016

Discrete-time control with non-constant discount factor

Article 27 June 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Optimal control problems under the discounted optimality criterion have been widely studied in the last decades, both from the theoretical point of view by progressively providing more general and flexible models, and from the point of view of applications by modeling a large number of problems arising in areas such as mathematical economy, population growth, finance, biology, extraction of natural resources, among others. The discounted criterion’s appeal is its representation as a fair value of future flows, which is a common practice in the real world.

Most of the papers in the literature on discounted Markov decision processes (MDPs) consider a constant discount factor. This assumption produces, from the technical point of view, a significant simplification when solving the optimal control problem. In many situations, however, this assumption does not fit into real applications, as the discount factor may vary with time, or it could be affected by the current state of the system or, even more, it could be affected by the controller’s actions. This shows the need for new models that allow to deal with state-action dependent discount factors. In this new line of research, we can cite the references [27,28,29, 34, 38]; however, most of these references assume that the corresponding discount factor is uniformly bounded above by some constant $\alpha _0$ strictly less than one.

In this sense, an important contribution of this paper is to consider a state-action dependent discount factor that is allowed to take any value in the interval [0, 1] and, in particular, a discounting equal to one is permitted. From a modeling point of view, this allows us to deal with a wide variety of situations. As an illustration, a discount factor equal to zero can be interpreted as “killing” the process, as in optimal stopping problems [29, 31], while a discount factor equal to one can be interpreted, in the context of a hybrid control model, as the possibility for the controller to take several simultaneous actions [30].

On the other side, independently of the type of discount (constant or not) already discussed, there exist models whose nature implies a restriction for some of their elements. The type of restriction we are interested in is one producing that certain additional discounted costs in the model do not exceed a given value. This type of cost restrictions is very useful in several areas and has generated mathematical challenges different from those arising in the unconstrained case; see, e.g., [2, 3, 8, 17, 20,21,22,23,24, 28, 33, 35].

This paper considers a discrete-time infinite horizon Markov decision process with varying discount factor. The controller wants to minimize a total expected discounted cost, subject to some constraints on some similar total expected discounted payoffs. The approach we follow is to rewrite this original constrained problem as a linear programming problem defined on a space of occupation measures. To study this linear programming problem, we introduce a suitable topology (namely, the weak-strong topology) and we study topological properties such as compactness of the set of occupation measures and semicontinuity of the functionals to be minimized. In addition, it is also useful to define a topology in the family of Markov stationary policies, which is achieved by considering the Young topology on a family of stochastic kernels. Once the linear programming problem has been solved, an optimal control policy can be derived by using a standard result on a decomposition of product measures.

From the previous paragraphs, we can conclude that the main contribution of this paper lies in filling an existing gap of previous results on discounted constrained problems [2, 3, 8, 17, 21,22,23,24, 28, 33, 35], with recent results in [29,30,31] that study unconstrained control problems with payoffs containing state-action dependent discount factors that have the flexibility to take the extreme values zero or one. To the best of our knowledge, models like the one in this paper have not been previously analyzed.

The rest of the paper is organized as follows: the remaining of this introductory section is concerned with some notation and terminology used in the manuscript. Later, in Sect. 2 we describe the basic elements of the Markov decision model under study, the control policies, and the optimization problem with constraints (CP) we want to solve. The section ends with the statement of two important facts: (i) our main result in this paper that establishes the existence of optimal control policies for (CP) and (ii) the arguments ensuring the existence of control policies for which the objective function is finite. On the other hand, Sect. 3 is devoted to characterizing (CP) in terms of a linear programming problem (LP) defined on space of occupation measures. The rest of the section focuses on some preliminary results aimed at proving topological properties of the aforementioned space of occupation measures. Section 4 is devoted to showing that (LP) is solvable by establishing compactness of the space of occupation measures. This fact, in turn, proves that the original (CP) has a solution as well. Section 5 gives several complementary results. First, a discussion on the hypotheses of the model: we aim to propose an alternative set of mild conditions that imply those originally proposed. We also give sufficient conditions yielding the existence of “simpler” optimal policies (namely, deterministic or chattering policies). We also describe some applications: we introduce a class of queueing systems with failure which can be modeled with our results herein. The paper ends with a section containing our conclusions.

Notation and terminology. For a measurable space $({\textbf{X}},\varvec{\mathfrak {X}})$, we introduce the following notation: ${\mathcal {M}}({\textbf{X}})$ stands for the set of nonnegative measures, ${\mathcal {M}}^f({\textbf{X}})$ is the set of finite nonnegative measures, and ${\mathcal {P}}({\textbf{X}})$ denotes the family of probability measures. For any $x\in {\textbf{X}}$ we will denote by $\delta _x\in {\mathcal {P}}({\textbf{X}})$ the Dirac probability measure concentrated at x, i.e., $\delta _x(B)={\textbf{I}}_B(x)$ for $B\in \varvec{\mathfrak {X}}$, where ${\textbf{I}}_B(\cdot )$ is the indicator function of the set B. Given a measurable function $f:{\textbf{X}}\rightarrow {\mathbb {R}}$ and $\mu \in {\mathcal {M}}({\textbf{X}})$ we will use the notation $\langle f,\mu \rangle =\int _{\textbf{X}}fd\mu $ provided that the integral is well defined.

If $({\textbf{Y}},\varvec{\mathfrak {Y}})$ is another measurable space then we will consider the product space ${\textbf{X}}\times {\textbf{Y}}$ endowed with the product $\sigma $-algebra $\varvec{\mathfrak {X}}\otimes \varvec{\mathfrak {Y}}$. For a measure $\mu \in {\mathcal {M}}({\textbf{X}}\times {\textbf{Y}})$ we will denote by $\mu ^{\textbf{X}}\in {\mathcal {M}}({\textbf{X}})$ and $\mu ^{{\textbf{Y}}}\in {\mathcal {M}}({\textbf{Y}})$ the corresponding marginal measures, i.e.,

$$\begin{aligned} \mu ^{\textbf{X}}(B)=\mu (B\times {\textbf{Y}})\quad \hbox {and}\quad \mu ^{{\textbf{Y}}}(C)=\mu ({\textbf{X}}\times C)\quad \hbox {for}\quad B\in \varvec{\mathfrak {X}}\quad \hbox {and}\quad C\in \varvec{\mathfrak {Y}}. \end{aligned}$$

Similarly, if $\varvec{\mathcal {O}}\subseteq {\mathcal {M}}({\textbf{X}}\times {\textbf{Y}})$ then we will write

$$\begin{aligned} \varvec{\mathcal {O}}^{\textbf{X}}=\{\mu ^{\textbf{X}}:\mu \in \varvec{\mathcal {O}}\}\subseteq {\mathcal {M}}({\textbf{X}})\quad \hbox {and}\quad \varvec{\mathcal {O}}^{{\textbf{Y}}}=\{\mu ^{{\textbf{Y}}}:\mu \in \varvec{\mathcal {O}}\}\subseteq {\mathcal {M}}({\textbf{Y}}). \end{aligned}$$

A substochastic kernel $R(\cdot |\cdot )$ on ${\textbf{Y}}$ given ${\textbf{X}}$ is a function $R:\varvec{\mathfrak {Y}}\times {\textbf{X}}\rightarrow [0,1]$ such that $R(\cdot |x)$ is in ${\mathcal {M}}^f({\textbf{Y}})$ for each $x\in {\textbf{X}}$ (with, necessarily, $R({\textbf{Y}}|x)\le 1$) and such that the mapping $x\mapsto R(B|x)$ is measurable on ${\textbf{X}}$ for each $B\in \varvec{\mathfrak {Y}}$. We say that R is a stochastic kernel when $R({\textbf{Y}}|x)=1$ for all $x\in {\textbf{X}}$. Given a measure $\mu \in {\mathcal {M}}({\textbf{X}})$ we define $\mu R\in {\mathcal {M}}({\textbf{Y}})$ by means of

$$\begin{aligned} \mu R\, (C)=\int _{\textbf{X}}R(C|x)\mu (dx) \end{aligned}$$

(1)

and $\mu \otimes R\in {\mathcal {M}}({\textbf{X}}\times {\textbf{Y}})$ as

$$\begin{aligned} (\mu \otimes R)\,(B\times C)= \int _B R(C|x)\mu (dx) \end{aligned}$$

for $B\in \varvec{\mathfrak {X}}$ and $C\in \varvec{\mathfrak {Y}}$, with the obvious relation $(\mu \otimes R)^{\textbf{Y}}=\mu R$.

Given a probability measure $\lambda \in {\mathcal {P}}({\textbf{X}})$, we denote by $L^1({\textbf{X}},\varvec{\mathfrak {X}},\lambda )$ and $L^\infty ({\textbf{X}},\varvec{\mathfrak {X}},\lambda )$ the spaces of measurable functions $f:{\textbf{X}}\rightarrow {\mathbb {R}}$ which are $\lambda $-integrable and $\lambda $-essentially bounded, respectively. As usual, we will identify functions which are $\lambda $-a.s. equal. Suppose that the functions $\{f_n\}_{n\in {\mathbb {N}}}$ and f are in $L^1({\textbf{X}},\varvec{\mathfrak {X}},\lambda )$. We will write

$$\begin{aligned} f_n{\mathop {\longrightarrow }\limits ^{L^1}}f\quad \hbox {when} \quad \int _{\textbf{X}}|f_n-f|d\lambda \rightarrow 0. \end{aligned}$$

On $L^\infty ({\textbf{X}},\varvec{\mathfrak {X}},\lambda )$ we will consider the weak$^*$ convergence defined as follows: given functions $\{f_n\}_{n\in {\mathbb {N}}}$ and f in $L^\infty ({\textbf{X}},\varvec{\mathfrak {X}},\lambda )$, we will write $f_n \ {{\mathop {\rightharpoonup }\limits ^{*}}\ }f$ when

$$\begin{aligned} \int _{\textbf{X}}f_nh\,d\lambda \rightarrow \int _{\textbf{X}}fh\,d\lambda \quad \hbox {for every}\quad h\in L^1({\textbf{X}},\varvec{\mathfrak {X}},\lambda ). \end{aligned}$$

A metric space ${\textbf{S}}$ will be always endowed with its Borel $\sigma $-algebra ${\mathcal {B}}({\textbf{S}})$. We say that $f:{\textbf{S}}\rightarrow {\mathbb {R}}\cup \{+\infty \}$ is lower semicontinuous (abbreviated l.s.c.) when the lower sections $\{s\in {\textbf{S}}:f(s)\le z\}$ are closed for every $z\in {\mathbb {R}}$. A function f is upper semicontinuous (u.s.c.) when $-f$ is l.s.c.

For the measurable space $({\textbf{X}},\varvec{\mathfrak {X}})$ and the metric space ${\textbf{S}}$, we say that $f:{\textbf{X}}\times {\textbf{S}}\rightarrow {\mathbb {R}}$ is a Carathéodory function if $f(\cdot ,s)$ is measurable on ${\textbf{X}}$ for each $s\in {\textbf{S}}$ and $f(x,\cdot )$ is continuous on ${\textbf{S}}$ for each $x\in {\textbf{X}}$. We note that if ${\textbf{S}}$ is separable then any Carathéodory function is jointly measurable on $\varvec{\mathfrak {X}}\otimes {\mathcal {B}}({\textbf{S}})$. We will denote by $Car({\textbf{X}}\times {\textbf{S}})$ the set of all Carathéodory functions and by $Car_b({\textbf{X}}\times {\textbf{S}})$ the family of bounded Carathéodory functions. If $\lambda \in {\mathcal {P}}({\textbf{X}})$, we denote by $Car_1({\textbf{X}}\times {\textbf{S}})$ the set of Carathéodory functions for which there exists $G\in L^1({\textbf{X}},\varvec{\mathfrak {X}},\lambda )$ such that $|f(x,s)|\le G(x)$ for every $(x,s)\in {\textbf{X}}\times {\textbf{S}}$. In the notation $Car_1({\textbf{X}}\times {\textbf{S}})$, since there will be no risk of confusion, we will not make $\lambda $ explicit.

We recall that a Borel space is a measurable subset of a Polish space (which is, in turn, a complete and separable metric space). Finally, we shall use the convention that $\sum _{k}^{k-1}\bullet =0$ and $\prod _{k}^{k-1}\bullet =1$ for any $k\ge 0$, and will use the notation ${\mathbb {N}}_0:={\mathbb {N}}\cup \{0\}$.

2 The Control Problem

In this section we introduce the basic elements of the Markov decision process under study, such as the definitions of the state and action spaces, the transition law of the random process, the cost per stage, and the discount factor, among other elements. Next, we define the set of control policies to be used for the statement of the optimization problem with constraints we want to solve. Using the previous definitions, we introduce our main set of assumptions in order to (i) set our main result in this manuscript concerning the existence of optimal control policies for our constrained problem, and (ii) show that our control problem is well posed by proving the existence of at least a control policy that makes the total expected discount factor as well as the total expected discount cost to have a finite value under our stated assumptions.

Elements of the control model. We will consider a constrained Markov control model given by the following elements.

The state space ${\textbf{X}}$ is a Borel space endowed with its Borel $\sigma $-algebra ${\mathcal {B}}({\textbf{X}})$.
The action space ${\textbf{A}}$ is a Borel space endowed with its Borel $\sigma $-algebra ${\mathcal {B}}({\textbf{A}})$. For each $x\in X$, let ${\textbf{A}}(x)$ be a nonempty measurable subset of ${\textbf{A}}$, which stands for the actions available at state x. We also define
$$\begin{aligned} {\mathbb {K}}=\{(x,a)\in {\textbf{X}}\times {\textbf{A}}: a\in {\textbf{A}}(x)\} \end{aligned}$$
and assume that it is a measurable subset of ${\textbf{X}}\times {\textbf{A}}$ for the $\sigma $-algebra ${\mathcal {B}}({\textbf{X}})\otimes {\mathcal {B}}({\textbf{A}})$.
The transitions of the system are given by a stochastic kernel Q on ${\textbf{X}}$ given ${\textbf{X}}\times {\textbf{A}}$.
We define the following functions: the discount factor $\alpha :{\textbf{X}}\times {\textbf{A}}\rightarrow [0,1]$, the cost-per-stage $c:{\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}^+$, and the constraints $d_i:{\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}$ for $i=1,\ldots ,q$, which are bounded below functions. All these functions are assumed to be measurable. The constraint constants are denoted by $k_i\in {\mathbb {R}}$ for $i=1,\ldots ,q$. Finally, we will write
$$\begin{aligned} {\textbf{d}}=(d_{1},\ldots ,d_{q}):{\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}^{q} \quad \hbox {and}\quad {\textbf{k}}=(k_{1},\ldots ,k_{q}) \in {\mathbb {R}}^q. \end{aligned}$$
(2)

Control policies. We define the following sets: ${\textbf{H}}_{0}={\textbf{X}}$ and, for any $t\in {\mathbb {N}}$, we let ${\textbf{H}}_{t}=({\textbf{X}}\times {\textbf{A}})^{t}\times {\textbf{X}}$ endowed with the corresponding product $\sigma $-algebra. A typical element $h_t$ of ${\textbf{H}}_t$ will be denoted by

$$\begin{aligned} h_{t}=(x_{0},a_{0},\ldots ,x_{t-1},a_{t-1},x_{t}). \end{aligned}$$

A randomized history-dependent control policy (or a control policy, for short) is a sequence $\pi _t$, for $t\in {\mathbb {N}}_0$, of stochastic kernels on ${\textbf{A}}$ given ${\textbf{H}}_t$ which satisfy, in addition, that

$$\begin{aligned} \pi _t({\textbf{A}}(x_t)|h_t)=1\quad \hbox {for every}\quad h_t\in {\textbf{H}}_t. \end{aligned}$$

(3)

We denote by $\mathbf {\Pi }$ the set of all control policies.

Let $\mathbf {\Phi }$ be the set of all stochastic kernels $\varphi $ on ${\textbf{A}}$ given ${\textbf{X}}$ that satisfy $\varphi ({\textbf{A}}(x)|x)=1$ for every $x\in {\textbf{X}}$. We shall identify $\varphi \in \mathbf {\Phi }$ with the control policy given by

$$\begin{aligned} \pi _t(\cdot |h_t)=\varphi (\cdot |x_t)\quad \hbox {for any}\quad t\in {\mathbb {N}}_0\quad \hbox {and}\quad h_t\in {\textbf{H}}_t, \end{aligned}$$

so that we have $\mathbf {\Phi }\subseteq \mathbf {\Pi }$. We will say that $\mathbf {\Phi }$ is the family of randomized stationary Markovian policies (in short, stationary policies). Let ${\textbf{F}}$ be the family of measurable functions $f:{\textbf{X}}\rightarrow {\textbf{A}}$ such that $f(x)\in {\textbf{A}}(x)$ for every $x\in {\textbf{X}}$. Each $f\in {\textbf{F}}$ is identified with the policy $\pi _t(\cdot |h_t)=\delta _{f(x_t)}(\cdot )$, and we will refer to ${\textbf{F}}$ as to the set of deterministic stationary Markov policies, with ${\textbf{F}}\subseteq \mathbf {\Phi }$.

The dynamic system. The space of all sample paths $\mathbf {\Omega }=({\textbf{X}}\times {\textbf{A}})^{{\mathbb {N}}_0}$ is endowed with its product $\sigma $-algebra $\varvec{\mathcal {F}}$. An element of $\mathbf {\Omega }$ is denoted by $\omega =(x_0,a_0,x_1,a_1,\ldots )$. We define the projection functions $X_t,A_t,H_t$ on $\mathbf {\Omega }$ and taking values in ${\textbf{X}},{\textbf{A}},{\textbf{H}}_t$ as the functions which associate to each $\omega \in \mathbf {\Omega }$ the corresponding component $x_t,a_t,h_t$, respectively.

By the Ionescu–Tulcea theorem (see, e.g., [25, Section 2]), given any initial distribution $\nu \in {\mathcal {P}}({\textbf{X}})$ and an arbitrary control policy $\pi \in \mathbf {\Pi }$, there exists a unique probability measure ${\textbf{P}}^\pi _\nu $ on $(\mathbf {\Omega },\varvec{\mathcal {F}})$ that satisfies:

(i).
${\textbf{P}}^\pi _\nu \{X_0\in B\}=\nu (B)$;
(ii).
${\textbf{P}}^\pi _\nu \{A_t\in C\mid H_t\}=\pi _t(C| H_t)$;
(iii).
${\textbf{P}}^\pi _\nu \{X_{t+1}\in B\mid H_t,A_t\}=Q(B|X_t,A_t)$

for any $B\in {\mathcal {B}}({\textbf{X}})$, $C\in {\mathcal {B}}({\textbf{A}})$ and $t\in {\mathbb {N}}_0$. The probability measure ${\textbf{P}}_{\nu }^\pi $ is referred to as a strategic probability measure. Note that, by construction of the process, we have ${\textbf{P}}^\pi _\nu ({\mathbb {K}}^{{\mathbb {N}}_0})=1$. The expectation operator of ${\textbf{P}}_{\nu }^{\pi }$ is denoted by ${\textbf{E}}_{\nu }^{\pi }$. If the initial distribution is given by the Dirac distribution at some $x\in {\textbf{X}}$ then we will simply write ${\textbf{P}}^\pi _x$ and ${\textbf{E}}^\pi _x$.

The optimization problem. Given the initial probability measure $\nu \in {\mathcal {P}}({\textbf{X}})$ and a control policy $\pi \in \mathbf {\Pi }$, the objective function to be minimized by the controller is

$$\begin{aligned} V_{0}(\pi ,\nu )={\textbf{E}}_{\nu }^{\pi }\left[ \sum _{t=0}^{\infty }c(X_{t},A_{t})\prod _{j=0}^{t-1}\alpha (X_j,A_j)\right] . \end{aligned}$$

(4)

Since the cost-per-stage function c is nonnegative, it follows that $V_0(\pi ,\nu )$ is well defined, although it might be infinite. The total expected discounted value of the restrictions is defined in a similar way:

$$\begin{aligned} V_{i}(\pi ,\nu )={\textbf{E}}_{\nu }^{\pi }\left[ \sum _{t=0}^{\infty }d_i(X_{t},A_{t})\prod _{j=0}^{t-1}\alpha (X_j,A_j)\right] \quad \hbox {for}\quad 1\le i\le q. \end{aligned}$$

(5)

Our assumptions below will ensure that these expectations are well-defined.

The constrained optimization problem (CP) for the controller is to minimize $V_0(\pi ,\nu )$ over the set of all policies $\pi \in \mathbf {\Pi }$ satisfying the restrictions: $V_i(\pi ,\nu )\le k_i$ for each $1\le i\le q$. This problem will be written

$$\begin{aligned} \hbox {(CP):}\quad \min \ V_0(\pi ,\nu ) \quad \hbox {subject to}\quad \pi \in \mathbf {\Pi }\quad \hbox {and}\quad V_i(\pi ,\nu )\le k_i\quad \hbox {for}\quad 1\le i\le q. \end{aligned}$$

(6)

Markov chains induced by stationary policies. Given an initial state $x_0\in {\textbf{X}}$ and a stationary policy $\varphi \in \mathbf {\Phi }$, it is clear that the stochastic process $\{X_t\}_{t\ge 0}$ taking values in ${\textbf{X}}$ is a time-homogeneous Markov chain under the probability measure ${\textbf{P}}^\varphi _{x_0}$. Namely, its transition probabilities are given by the stochastic kernel on ${\textbf{X}}$ defined as

$$\begin{aligned} Q_\varphi (B|x)= \int _{\textbf{A}}Q(B|x,a)\varphi (da|x)\quad \hbox {for}\quad B\in {\mathcal {B}}({\textbf{X}})\quad \hbox {and}\quad x\in {\textbf{X}}. \end{aligned}$$

In addition, we also have that given an initial state-action pair $(x_0,a_0)\in {\textbf{X}}\times {\textbf{A}}$ and a stationary policy $\varphi \in \mathbf {\Phi }$, the process $\{(X_t,A_t)\}_{t\ge 0}$ is also a time-homogeneous Markov chain taking values in ${\textbf{X}}\times {\textbf{A}}$ with transition probability given by

$$\begin{aligned} \overline{Q}_\varphi (B\times C\mid x,a)= \int _B\varphi (C|y) Q(dy|x,a)\quad \hbox {for}\quad B\in {\mathcal {B}}({\textbf{X}})\quad \hbox {and}\quad C\in {\mathcal {B}}({\textbf{A}}). \end{aligned}$$

The probability measure on $\mathbf {\Omega }$ induced by this Markov chain is denoted by ${\textbf{P}}^\varphi _{x_0,a_0}$ and its corresponding expectation operator is ${\textbf{E}}^\varphi _{x_0,a_0}$. This Markov chain will be referred to as the state-action Markov chain. It is related to the state-action process probability measure by the obvious relation

$$\begin{aligned} {\textbf{P}}^\varphi _{x_0,a_0}\{\cdot \}={\textbf{P}}^\varphi _{x_0}\{\cdot \mid A_0=a_0\}. \end{aligned}$$

To conclude this section, we introduce the notations $Q_\varphi ^t$ and $\overline{Q}_\varphi ^t$, which stand for the t-step probability transitions of the previously defined Markov chains. Namely, we let

$$\begin{aligned} Q^0_\varphi (B|x)=\delta _x(B) \quad \hbox {and}\quad Q^{t+1}_\varphi (B|x)=\int _{{\textbf{X}}} Q_\varphi (B|y)Q^{t}_\varphi (dy|x) \end{aligned}$$

for every $t\in {\mathbb {N}}_0$, $x\in {\textbf{X}}$, and $B\in {\mathcal {B}}({\textbf{X}})$. Similarly, we write $\overline{Q}_\varphi ^0(\cdot |x,a)=\delta _{(x,a)}(\cdot )$ and

$$\begin{aligned} \overline{Q}^{t+1}_\varphi (B\times C|x,a)=\int _{{\textbf{X}}\times {\textbf{A}}} \overline{Q}_\varphi (B\times C|z,b) \overline{Q}^t_{\varphi }(dz,db|x,a) \end{aligned}$$

for all $t\in {\mathbb {N}}_0$, $(x,a)\in {\textbf{X}}\times {\textbf{A}}$, $B\in {\mathcal {B}}({\textbf{X}})$, and $C\in {\mathcal {B}}({\textbf{A}})$. Note that $Q_\varphi =Q^1_\varphi $ and $\overline{Q}_\varphi =\overline{Q}^1_\varphi $.

Assumptions. Let us now state the conditions we impose on the control model. Given $0<\epsilon <1$, we define the following lower and upper sections of the discount factor function:

$$\begin{aligned} L_\epsilon =\{(x,a)\in {\textbf{X}}\times {\textbf{A}}:\alpha (x,a)< 1-\epsilon \} \end{aligned}$$

and

$$\begin{aligned} U_\epsilon =\{(x,a)\in {\textbf{X}}\times {\textbf{A}}:\alpha (x,a)\ge 1-\epsilon \}, \end{aligned}$$

which are both measurable subsets of ${\textbf{X}}\times {\textbf{A}}$.

Assumption 2.1

There exist constants $0<\epsilon ,\beta <1$, $k_0>0$, and an integer $t_0\in {\mathbb {N}}$ such that:

(a).
For every $(x,a)\in U_\epsilon $ we have $c(x,a)\ge k_0$.
(b).
For every $(x,a)\in U_\epsilon $ and $\varphi \in \mathbf {\Phi }$, we have ${\textbf{P}}^\varphi _{x,a}\{(X_{t_0},A_{t_0})\in L_\epsilon \}\ge \beta $.
(c).
The optimization problem (CP) is feasible, i.e., there exists $\pi _0\in \mathbf {\Pi }$ with $V_0(\pi _0,\nu )<\infty $ and $V_i(\pi _0,\nu )\le k_i$ for every $1\le i\le q$.

We make some comments on this assumption. Observe that there is no loss of generality in supposing that the same $\epsilon $ is valid in Assumptions 2.1(a) and (b). This is because, as $\epsilon $ decreases to zero, the sets $L_{\epsilon }$ increase while the sets $U_{\epsilon }$ decrease. The condition in (a) fails when there exists a sequence $\{(x_n,a_n)\}$ in ${\textbf{X}}\times {\textbf{A}}$ with $\alpha (x_n,a_n)\uparrow 1$ and $c(x_n,a_n)\downarrow 0$. The interpretation of (a) is that discount factors close to 1 are somehow penalized, so that it cannot be optimal for the controller to remain “very long” in $U_\epsilon $.

The condition in (b) states that there is a probability at least $\beta $ of reaching $L_\epsilon $ after $t_0$ transitions from whatever initial state-action pair when using any stationary policy.

Note that the condition (c) is just to avoid trivial cases. Indeed, if any policy satisfying the constraints is such that $V_0(\pi ,\nu )=\infty $, then the optimization problem is trivial.

We also need to make the following technical hypotheses on the control model.

Assumption 2.2

(a)
The action space ${\textbf{A}}$ is a compact metric space, the action sets ${\textbf{A}}(x)$ are nonempty and compact for every $x\in {\textbf{X}}$, and the correspondence $x\mapsto {\textbf{A}}(x)$ from ${\textbf{X}}$ to ${\textbf{A}}$ is weakly measurable.
(b)
There exist a measurable function $q:{\textbf{X}}\times {\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}^+$ and $\lambda \in {\mathcal {P}}({\textbf{X}})$ such that
$$\begin{aligned} Q(B|x,a)=\int _B q(y|x,a)\lambda (dy)\quad \hbox {for every}\quad B\in {\mathcal {B}}({\textbf{X}})\quad \hbox {and}\quad (x,a)\in {\textbf{X}}\times {\textbf{A}}. \end{aligned}$$
For each fixed $x,y\in {\textbf{X}}$, the mapping $a\mapsto q(y|x,a)$ is continuous on ${\textbf{A}}$. There exists a nonnegative measurable function $(x,y)\mapsto q(y|x)$ on ${\textbf{X}}^2$ and a constant ${\textbf{q}}>0$ with
$$\begin{aligned} 0\le q(y|x,a)\le q(y|x)\quad \hbox {and}\quad \int _{\textbf{X}}q(y|x)\lambda (dy)\le {\textbf{q}} \end{aligned}$$
for every $x,y\in {\textbf{X}}$ and $a\in {\textbf{A}}$.
(c)
The cost function $c:{\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}^+$ is l.s.c. on ${\textbf{A}}$ for each fixed $x\in {\textbf{X}}$.

The constraint functions $d_i:{\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}$ are bounded below and l.s.c. on ${\textbf{A}}$ for each fixed $x\in {\textbf{X}}$, for every $i=1,\ldots ,q$.

The discount factor function $\alpha :{\textbf{X}}\times {\textbf{A}}\rightarrow [0,1]$ is in $Car_b({\textbf{X}}\times {\textbf{A}})$.
(d)
The initial distribution $\nu $ satisfies $\nu \ll \lambda $.

Using [1, Theorem 18.6], it follows from Assumption 2.2(a) that ${\mathbb {K}}$ is a measurable subset of ${\textbf{X}}\times {\textbf{A}}$. Moreover, the Kuratowsky–Ryll–Nardzewski selection theorem [1, Theorem 18.13] implies the existence of measurable selectors for the correspondence $x\mapsto {\textbf{A}}(x)$. In particular, the sets ${\textbf{F}}$, $\mathbf {\Phi }$, and $\mathbf {\Pi }$ are nonempty. As a consequence of Assumption 2.2(b), for each $x\in {\textbf{X}}$ and any sequence $a_n\rightarrow a$ in ${\textbf{A}}$ we have $q(\cdot |x,a_n){\mathop {\longrightarrow }\limits ^{L^1}} q(\cdot |x,a)$.

We note that our hypotheses in Assumption 2.1 and 2.2 are quite standard, except Assumption 2.1(b). At the end of the paper—see Proposition 5.1 below—we will provide sufficient conditions for Assumption 2.1(b) which are more easily verifiable for some instances or special cases.

Main result. Under the previous conditions, we can solve the control problem (CP). More specifically, we have the following:

Theorem 2.3

If Assumptions 2.1 and 2.2 hold then there exists a randomized stationary Markovian policy which is optimal for the constrained control problem (CP).

The proof of this result will be given in Sect. 4.

Finiteness of the total expected discount factor. Now we prove some results that will be useful in the forthcoming. In particular, we will explore conditions under which the total expected discount factor over the whole time horizon, i.e.,

$$\begin{aligned} {\textbf{E}}^\pi _\nu \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1}\alpha (X_j,A_j)\right] \end{aligned}$$

(7)

is finite. Note that in a control model with a discount factor bounded away from one, that is, $\alpha (x,a)\le \alpha _0<1$, such a result is trivial because we have

$$\begin{aligned} \sum _{t=0}^\infty \prod _{j=0}^{t-1}\alpha (X_j,A_j) \le \frac{1}{1-\alpha _0}. \end{aligned}$$

In our case, however, the discount factor can take values arbitrarily close to one, or even equal to one, and the issue of the finiteness of (7) is far from being trivial. In fact, it is a key result because, as we shall see in later sections, it is closely related to the finiteness of the occupation measure of a policy.

Proposition 2.1

Suppose that Assumption 2.1 is satisfied.

(i)
If $\pi \in \mathbf {\Pi }$ is such that $V_0(\pi ,\nu )$ is finite then
$$\begin{aligned} {\textbf{E}}_\nu ^\pi \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j)\right] <\infty . \end{aligned}$$
(ii)
For every $\varphi \in \mathbf {\Phi }$ and $x\in {\textbf{X}}$ we have
$$\begin{aligned} {\textbf{E}}_x^\varphi \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j)\right] < \frac{1+t_0/\beta }{\epsilon }. \end{aligned}$$

Proof

We define the random times $S_r$ and $T_r$ on $\mathbf {\Omega }$ taking values in ${\mathbb {N}}_0\cup \{+\infty \}$ as follows. Let $S_0=0$ and for every $r\in {\mathbb {N}}$ define

$$\begin{aligned} T_r=\min \{k\ge S_{r-1}:(x_k,a_k)\in U_{\epsilon }\}\quad \text{ and }\quad S_r=\min \{k>T_{r}:(x_k,a_k)\in L_{\epsilon }\}. \end{aligned}$$

Consequently, we have either:

$(x_0,a_0)\in L_{\epsilon }$ and then $0=S_0<T_1<S_1<T_2\ldots $, or
$(x_0,a_0)\in U_{\epsilon }$ and then $ 0=S_0=T_1<S_1<T_2\ldots $.

Observe that when some $T_r$ or $S_r$ is infinite, then all the subsequent times $\{S_k\}_{k\ge r}$ and $\{T_k\}_{k\ge r+1}$ are infinite as well. Hence, for every sample path $\omega =(x_0,a_0,x_1,a_1,\ldots )$ the time horizon is partitioned into disjoint intervals (some of them could be empty)

$$\begin{aligned} \{0,1,2,\ldots \}= [S_0,T_1)\cup [T_1,S_1)\cup [S_1,T_2)\ldots \end{aligned}$$

so that

$$\begin{aligned} \begin{array}{ll} \alpha (x_k,a_k)< 1-\epsilon &{}\hbox {when}\quad S_{r-1}\le k< T_r,\quad \hbox {and} \\ \alpha (x_k,a_k)\ge 1-\epsilon &{}\hbox {when}\quad T_r\le k<S_r. \end{array} \end{aligned}$$

(8)

Given some $\omega \in \mathbf {\Omega }$ and $t\in {\mathbb {N}}$, we are going to derive bounds on $\prod _{j=0}^{t-1}\alpha (x_j,a_j)$. Indeed, recalling (8), we will use the bounds

$$\begin{aligned} \begin{array}{ll} \alpha (x_j,a_j)< 1-\epsilon &{} \hbox {when}\quad S_{r-1}\le j<T_r\quad \hbox {for some}\quad r\in {\mathbb {N}},\\ \alpha (x_j,a_j)\le 1 &{} \hbox {when}\quad T_r\le j<S_r\quad \hbox {for some}\quad r\in {\mathbb {N}}. \end{array} \end{aligned}$$

Therefore, if t is such that $S_{r-1}<t\le T_r$ for some $r\in {\mathbb {N}}$ then

$$\begin{aligned} \prod _{j=0}^{t-1}\alpha (x_j,a_j)\le (1-\epsilon )^{(T_1-S_0)+\cdots +(T_{r-1}-S_{r-2})+t-S_{r-1}}, \end{aligned}$$

while if $T_r<t\le S_r$ for some $r\in {\mathbb {N}}$ then

$$\begin{aligned} \prod _{j=0}^{t-1}\alpha (x_j,a_j)\le (1-\epsilon )^{(T_1-S_0)+\cdots +(T_{r}-S_{r-1})}. \end{aligned}$$

Once these bounds have been established, we observe that

$$\begin{aligned} \sum _{t=0}^\infty \prod _{j=0}^{t-1}\alpha (x_j,a_j)= & {} 1+\sum _{r=1}^\infty \sum _{S_{r-1}<t\le T_r}\prod _{j=0}^{t-1}\alpha (x_j,a_j) \nonumber \\{} & {} +\sum _{r=1}^\infty \sum _{T_{r}<t\le S_r}\prod _{j=0}^{t-1}\alpha (x_j,a_j). \end{aligned}$$

(9)

Combining this equation with the previous bounds we observe that

$$\begin{aligned}&\sum _{r=1}^\infty \sum _{S_{r-1}<t\le T_r}\prod _{j=0}^{t-1}\alpha (x_j,a_j) \\&\quad<\sum _{r=1}^\infty \sum _{S_{r-1}<t\le T_r} (1-\epsilon )^{(T_1-S_0)+\cdots +(T_{r-1}-S_{r-2})+t-S_{r-1}}\\&\quad \le \sum _{k=1}^\infty (1-\epsilon )^k=\frac{1-\epsilon }{\epsilon } \end{aligned}$$

because, as r varies, the exponent in the last expression starts at 1 and then increases by one unit at each step: if some $S_r$ is infinite then there will be a finite number of terms and, otherwise, it will go across the whole geometric series. In either case, we obtain the desired inequality. This gives a bound on the first term of the right hand side of (9). For the second term in (9) we have

$$\begin{aligned} \sum _{r=1}^\infty \sum _{T_{r}<t\le S_r}\prod _{j=0}^{t-1}\alpha (x_j,a_j)\le & {} \sum _{r=1}^\infty \sum _{T_r<t\le S_r} (1-\epsilon )^{(T_1-S_0)+\cdots +(T_{r}-S_{r-1})}\\\le & {} \sum _{r=1}^\infty (S_r-T_r) (1-\epsilon )^{(T_1-S_0)+\cdots +(T_{r}-S_{r-1})}\\\le & {} \sum _{r=1}^\infty (S_r-T_r)(1-\epsilon )^{r-1} \end{aligned}$$

because each term in the exponent of $1-\epsilon $ is greater than or equal to 1 except perhaps $T_1-S_0$ which may vanish. Summarizing, we have so far established the following general inequalities: for any $\omega \in \mathbf {\Omega }$

$$\begin{aligned} \sum _{t=0}^\infty \prod _{j=0}^{t-1}\alpha (x_j,a_j)< & {} \frac{1}{\epsilon } +\sum _{r=1}^\infty \sum _{T_{r}<t\le S_r}\prod _{j=0}^{t-1}\alpha (x_j,a_j)\end{aligned}$$

(10)

$$\begin{aligned}\le & {} \frac{1}{\epsilon }+\sum _{r=1}^\infty (S_r-T_r)(1-\epsilon )^{r-1}. \end{aligned}$$

(11)

Proof of (i). Let $\pi \in \mathbf {\Pi }$ be such that $V_0(\pi ,\nu )$ is finite. By (10) and recalling Assumption 2.1(a) we have

$$\begin{aligned}&{\textbf{E}}_\nu ^\pi \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j)\right] \\&\quad \le \frac{1}{\epsilon }+ \frac{1}{k_0}{\textbf{E}}_\nu ^\pi \left[ \sum _{r=1}^\infty \sum _{T_{r}<t\le S_r}c(X_{t-1},A_{t-1})\prod _{j=0}^{t-2}\alpha (X_j,A_j)\right] \end{aligned}$$

since $c(x_{t-1},a_{t-1})\ge k_0$ when $T_r<t\le S_r$ because, for such t, we have $\alpha (x_{t-1},a_{t-1})\ge 1-\epsilon $. Therefore, the cost function c being nonnegative,

$$\begin{aligned} {\textbf{E}}_\nu ^\pi \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j)\right] <\frac{1}{\epsilon }+\frac{V(\pi ,\nu )}{k_0}, \end{aligned}$$

which concludes the proof of statement (i).

Proof of (ii). Let $\varphi \in \mathbf {\Phi }$ be an arbitrary stationary policy. Suppose that we are given an initial state-action pair $(x_0,a_0)\in U_{\epsilon }$ and let the integer $t_0$ be as in Assumption 2.1. We have

$$\begin{aligned} {\textbf{P}}^\varphi _{x_0,a_0}\{ S_1>t_0\}\le & {} {\textbf{P}}^\varphi _{x_0,a_0}\{(X_{t_0},A_{t_0})\in U_{\epsilon }\} \nonumber \\= & {} 1-{\textbf{P}}^\varphi _{x_0,a_0}\{(X_{t_0},A_{t_0})\in L_{\epsilon }\}\le 1-\beta \end{aligned}$$

(12)

for every $(x_0,a_0)\in U_{\epsilon }$. On the other hand, for any integer $k\ge 1$

$$\begin{aligned} {\textbf{P}}^\varphi _{x_0,a_0}\{S_1>kt_0\}= {\textbf{P}}^\varphi _{x_0,a_0}\Big \{\bigcap _{j=1}^k\{S_1>jt_0\}\Big \}. \end{aligned}$$

(13)

Given any j we can write, for $(x_0,a_0),\ldots ,(x_{(j-1)t_0},a_{(j-1)t_0})\in U_{\epsilon }$ and using the Markov property for the stationary policy $\varphi \in \mathbf {\Phi }$,

$$\begin{aligned}&{\textbf{P}}^\varphi _{x_0,a_0}\{ S_1>jt_0 \mid X_0=x_0, A_0=a_0,\ldots \\&\qquad \ldots , X_{(j-1)t_0}=x_{(j-1)t_0},A_{(j-1)t_0}=a_{(j-1)t_0}\}\\&\quad = {\textbf{P}}^\varphi _{x_{(j-1)t_0},a_{(j-1)t_0}}\{S_1>t_0\}\le 1-\beta , \end{aligned}$$

where we make use of (12) for the initial state $(x_{(j-1)t_0},a_{(j-1)t_0})\in U_{\epsilon }$. Therefore, for all $j=1,\ldots ,k$,

$$\begin{aligned}&{\textbf{P}}^\varphi _{x_0,a_0} \{ S_1>jt_0\mid S_1>t_0,\ldots , S_1>(j-1)t_0\}\\&\quad = {\textbf{E}}^\varphi _{x_0,a_0}\left[ {\textbf{P}}^\varphi _{x_0,a_0} \{S_1>jt_0\mid X_0,A_0,\ldots ,X_{(j-1)t_0},A_{(j-1)t_0}\} \right. \\&\qquad \left. \mid S_1>t_0,\ldots , S_1>(j-1)t_0\right] \le 1-\beta . \end{aligned}$$

Consequently, recalling (13) and using the “total probability rule”,

$$\begin{aligned} {\textbf{P}}^\varphi _{x_0,a_0}\{C_1\cap C_2\cap \cdots C_k\}= & {} {\textbf{P}}^\varphi _{x_0,a_0}\{C_k\mid C_1\cap \cdots \cap C_{k-1}\}\times \cdots \\{} & {} \times {\textbf{P}}^\varphi _{x_0,a_0}\{C_2|C_1\}{\textbf{P}}^\varphi _{x_0,a_0}\{C_1\}, \end{aligned}$$

with $C_j=\{S_1>jt_0\}$, we obtain

$$\begin{aligned} {\textbf{P}}^\varphi _{x_0,a_0}\{S_1>kt_0\}= & {} \prod _{j=1}^k {\textbf{P}}^\varphi _{x_0,a_0} \{ S_1>jt_0\mid S_1>t_0,\ldots , S_1>(j-1)t_0\}\\\le & {} (1-\beta )^k. \end{aligned}$$

In conclusion,

$$\begin{aligned} {\textbf{E}}^\varphi _{x_0,a_0}[S_1] = \sum _{r=0}^\infty {\textbf{P}}^\varphi _{x_0,a_0}\{ S_1>r\} \le t_0\cdot \sum _{k=0}^\infty {\textbf{P}}^\varphi _{x_0,a_0}\{ S_1>kt_0\} \le t_0/\beta . \end{aligned}$$

(14)

We proceed with the proof and, at this point, we recall the inequality (11). We know that the expected time the process spends in $U_\epsilon $ when starting from $(x_0,a_0)\in U_\epsilon $ is bounded by $t_0/\beta $ uniformly in $(x_0,a_0)$—recall (14)—and, therefore, using the Markov property and the fact that $\varphi $ is stationary, it follows that ${\textbf{E}}^\varphi _{x}[S_r-T_r]\le t_0/\beta $ for any $r\in {\mathbb {N}}$ and $x\in {\textbf{X}}$. Consequently, taking ${\textbf{E}}^\varphi _x$-expectation in (11) and using this bound, we derive that

$$\begin{aligned} {\textbf{E}}^\varphi _{x}\left[ \sum _{t=1}^\infty \prod _{j=0}^{t-1}\alpha (X_j,A_j)\right] <\frac{1+t_0/\beta }{\epsilon }, \end{aligned}$$

which concludes the proof. $\square $

3 Occupation Measures and Young Measures

In this section, we introduce the set of occupation measures and establish (i) sufficient conditions that guarantee the finiteness of such occupation measures, and (ii) a useful characterization of this set of measures by means of linear equations. This will allow us to establish the equivalence between the original constrained control problem (CP) defined in (6) and a linear programming problem on a space of measures defined in (20) below. We also introduce Young measures and the Young topology as a useful tool to define a topology in the space $\mathbf {\Phi }$ of stationary policies, for which $\mathbf {\Phi }$ becomes a compact metric space.

Definition 3.1

Given a policy $\pi \in \mathbf {\Pi }$ and the initial distribution $\nu \in {\mathcal {P}}({\textbf{X}})$, we define its occupation measure $\mu _{\pi }\in {\mathcal {M}}({\textbf{X}}\times {\textbf{A}})$ as follows: for any measurable set $\Gamma \subseteq {\textbf{X}}\times {\textbf{A}}$, let

$$\begin{aligned} \mu _\pi (\Gamma )={\textbf{E}}_\nu ^\pi \left[ \sum _{t=0}^\infty {\textbf{I}}_\Gamma (X_t,A_t)\prod _{j=0}^{t-1} \alpha (X_j,A_j) \right] . \end{aligned}$$

(15)

By construction of the controlled process, the occupation measure is supported on ${\mathbb {K}}$, meaning that $\mu _\pi ( {\mathbb {K}}^c)=0$. Since the initial distribution $\nu $ will remain fixed, we will not make it explicit in the notation $\mu _\pi $. The total mass of the measure $\mu _\pi $ is given by (cf. (7))

$$\begin{aligned} \mu _\pi ({\mathbb {K}})= {\textbf{E}}_\nu ^\pi \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j) \right] \end{aligned}$$

(16)

and so, the occupation measure $\mu _\pi $ might not necessarily be a finite measure. Under Assumption 2.1 and as a consequence of Proposition 2.1, however, if $\pi \in \mathbf {\Pi }$ is either stationary of such that $V_0(\pi ,\nu )$ is finite, then necessarily $\mu _\pi $ is a finite measure.

The following proposition concerns a useful result about the decomposition of product measures. See the references [7, pp. 139–144] or [13, pp. 88–89] for further details.

Proposition 3.1

Let $\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$.

(i)
There exists a $\mu ^{\textbf{X}}$-a.s. unique^{Footnote 1} stochastic kernel $\varphi $ on ${\textbf{A}}$ given ${\textbf{X}}$ such that $\mu =\mu ^{\textbf{X}}\otimes \varphi $.
(ii)
If $\mu $ is supported on ${\mathbb {K}}$ (i.e., $\mu ({\mathbb {K}}^c)=0$), then there exists $\varphi \in \mathbf {\Phi }$ such that $\mu =\mu ^{\textbf{X}}\otimes \varphi $.

Our next result provides a useful characterization of the occupation measures introduced in (15) as well as some of their properties. In (17) below, note that $\alpha Q\equiv \alpha (x,a)Q(B|x,a)$ is a substochastic kernel on ${\textbf{X}}$ given ${\textbf{X}}\times {\textbf{A}}$, and so the notation $\mu \alpha Q=\mu (\alpha Q)$ comes from (1).

Theorem 3.1

Let Assumption 2.1 hold true and consider an initial distribution $\nu \in {\mathcal {P}}({\textbf{X}})$.

(i)
(Linear equations.) Given any $\pi \in \mathbf {\Pi }$, its occupation measure $\mu _\pi $ satisfies the conditions
$$\begin{aligned} \mu ({\mathbb {K}}^c)=0\quad \hbox {and}\quad \mu ^{\textbf{X}}=\nu +\mu \alpha Q, \end{aligned}$$
(17)
written in the variable $\mu \in {\mathcal {M}}({\textbf{X}}\times {\textbf{A}})$.
(ii)
(Conditions for finiteness.) The occupation measure of any stationary policy $\varphi \in \mathbf {\Phi }$ is finite: $\mu _\varphi \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$. If $\pi \in \mathbf {\Pi }$ is such that $V_0(\pi ,\nu )$ is finite, then its occupation measure is finite: $\mu _\pi \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$.
(iii)
(Sufficiency of stationary policies.) If a finite measure $\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$ satisfies (17) then there exists $\varphi \in \mathbf {\Phi }$ such that $\mu =\mu _\varphi $. In particular, if $\pi \in \mathbf {\Pi }$ is such that $V_0(\pi ,\nu )$ is finite then there exists $\varphi \in \mathbf {\Phi }$ with $\mu _\pi =\mu _\varphi $.
(iv)
(Boundedness.) The set of measures $\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$ satisfying (17) is bounded.

Proof

(i). The fact that $\mu _\pi ({\mathbb {K}}^c)=0$ follows by construction of the state-action process. Fix arbitrary $B\in {\mathcal {B}}({\textbf{X}})$. By definition, we have

$$\begin{aligned} \mu ^{\textbf{X}}_\pi (B)= & {} {\textbf{E}}_\nu ^\pi \left[ \sum _{t=0}^\infty {\textbf{I}}_B(X_t)\prod _{j=0}^{t-1} \alpha (X_j,A_j) \right] \\= & {} \nu (B) + {\textbf{E}}_\nu ^\pi \left[ {\textbf{E}}_\nu ^\pi \left[ \sum _{t=1}^\infty {\textbf{I}}_B(X_t)\prod _{j=0}^{t-1} \alpha (X_j,A_j)\mid H_{t-1},A_{t-1}\right] \right] \\= & {} \nu (B) + {\textbf{E}}_\nu ^\pi \left[ \sum _{t=1}^\infty Q(B|X_{t-1},A_{t-1})\prod _{j=0}^{t-1} \alpha (X_j,A_j) \right] . \end{aligned}$$

The previous expression becomes (after changing the indices)

$$\begin{aligned} \mu ^{\textbf{X}}_\pi (B)= & {} \nu (B) + {\textbf{E}}_\nu ^\pi \left[ \sum _{t=0}^\infty \alpha (X_t,A_t)Q(B|X_t,A_t) \prod _{j=0}^{t-1} \alpha (X_j,A_j) \right] \\= & {} \nu (B)+ \int _{{\textbf{X}}\times {\textbf{A}}} \alpha (x,a)Q(B|x,a) \mu _\pi (dx,da), \end{aligned}$$

which is precisely the condition $\mu _\pi ^{\textbf{X}}=\nu +\mu _\pi \alpha Q$.

(ii). Regarding the result for stationary policies, recalling the expression (15), it follows that

$$\begin{aligned} \mu _\varphi ({\mathbb {K}})=\int _{\textbf{X}}{\textbf{E}}_x^\pi \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j) \right] \nu (dx). \end{aligned}$$

By the result in Proposition 2.1(ii) we obtain that $\mu _\varphi ({\mathbb {K}})$ is finite. Concerning a policy $\pi \in \mathbf {\Pi }$, the fact that $\mu _\pi $ is finite when $V_0(\pi ,\nu )$ also comes from Proposition 2.1(i).

(iii). Suppose that $\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$ is a finite measure satisfying the equations $\mu ({\mathbb {K}}^c)=0$ and $\mu ^{\textbf{X}}=\nu +\mu \alpha Q$. By the disintegration result in Proposition 3.1(ii), there exists $\varphi \in \mathbf {\Phi }$ such that $\mu =\mu ^{\textbf{X}}\otimes \varphi $ and so we have

$$\begin{aligned} \mu ^{\textbf{X}}(B)=\nu (B) +\int _{\textbf{X}}\int _{\textbf{A}}\alpha (x,a)Q(B|x,a)\varphi (da|x) \mu ^{\textbf{X}}(dx). \end{aligned}$$

Iterating this expression, we obtain for each $n\ge 1$

$$\begin{aligned} \mu ^{\textbf{X}}(B)= & {} {\textbf{E}}_\nu ^\varphi \left[ \sum _{t=0}^{n-1} {\textbf{I}}_B(X_t) \prod _{j=0}^{t-1} \alpha (X_j,A_j)\right] \nonumber \\{} & {} +\int _{\textbf{X}}{\textbf{E}}_x^\varphi \left[ {\textbf{I}}_B(X_n) \prod _{j=0}^{n-1} \alpha (X_j,A_j)\right] \mu ^{\textbf{X}}(dx). \end{aligned}$$

(18)

The first expression in the right hand side of this equation converges, by monotone convergence, to $\mu ^{\textbf{X}}_\varphi (B)$. Regarding the second expression, we have the inequality

$$\begin{aligned} 0\le {\textbf{E}}_x^\varphi \left[ {\textbf{I}}_B(X_n) \prod _{j=0}^{n-1} \alpha (X_j,A_j)\right] \le {\textbf{E}}_x^\varphi \left[ \prod _{j=0}^{n-1} \alpha (X_j,A_j)\right] \quad \hbox {for any}\quad x\in {\textbf{X}}. \end{aligned}$$

(19)

By the result in Proposition 2.1, we have that the following series converges:

$$\begin{aligned} \sum _{n=0}^\infty {\textbf{E}}_x^\varphi \left[ \prod _{j=0}^{n-1} \alpha (X_j,A_j)\right] \le \frac{1+t_0\beta }{\epsilon }, \end{aligned}$$

so that for any $x\in {\textbf{X}}$ we have that

$$\begin{aligned} \lim _{n\rightarrow \infty } {\textbf{E}}_x^\varphi \left[ \prod _{j=0}^{n-1} \alpha (X_j,A_j)\right] =0. \end{aligned}$$

We can now use dominated convergence with respect to the finite measure $\mu ^{\textbf{X}}(dx)$ to conclude from (19) that the rightmost term in (18) converges to 0 as $n\rightarrow \infty $. This shows, taking the limit as $n\rightarrow \infty $ in (18) that $\mu ^{\textbf{X}}$ coincides with the ${\textbf{X}}$-marginal of the occupation measure of $\varphi $: $\mu ^{\textbf{X}}=\mu ^{\textbf{X}}_\varphi $. Since, by disintegration, we have $\mu =\mu ^{\textbf{X}}\otimes \varphi $, we conclude that $\mu =\mu _\varphi $.

The last statement in (iii) easily follows from parts (i) and (ii).

(iv). If $\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$ satisfies (17) then $\mu =\mu _\varphi $ for some $\varphi \in \mathbf {\Phi }$ (see item (iii)) and so, recalling (16),

$$\begin{aligned} \mu ({\textbf{X}}\times {\textbf{A}})=\mu ({\mathbb {K}})={\textbf{E}}^\varphi _\nu \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j) \right] \le \frac{1+t_0\beta }{\epsilon } \end{aligned}$$

by Proposition 2.1(ii). Hence,

$$\begin{aligned} \sup _{\varphi \in \mathbf {\Phi }} \mu _\varphi ({\textbf{X}}\times {\textbf{A}})<\infty , \end{aligned}$$

which completes the proof. $\square $

Linear programming formulation. It is clear from the definition of the occupation measure, and recalling that c is nonnegative, that for any $\pi \in \mathbf {\Pi }$ we have

$$\begin{aligned} V_0(\pi ,\nu )=\int _{{\textbf{X}}\times {\textbf{A}}}c\,d\mu _\pi =\langle c,\mu _\pi \rangle \in {\mathbb {R}}^+\cup \{+\infty \}. \end{aligned}$$

In addition, if $\pi \in \mathbf {\Pi }$ is such that $\mu _\pi \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$ then the value of the constraints $V_i(\pi ,\nu )$ for $1\le i\le q$ are well defined and they verify

$$\begin{aligned} V_i(\pi ,\nu )=\int _{{\textbf{X}}\times {\textbf{A}}}d_i\,d\mu _\pi =\langle d_i,\mu _\pi \rangle \in {\mathbb {R}}\cup \{+\infty \}. \end{aligned}$$

This is because, the functions $d_i$ being bounded below and $\mu _\pi $ being finite, the integral of the negative part of $d_i$ is necessarily finite.

In the sequel, we will use the following notation. Let

$$\begin{aligned} \varvec{\mathcal {O}}=\{\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}}): \mu ({\mathbb {K}}^c)=0\quad \hbox {and}\quad \mu ^{\textbf{X}}=\nu +\mu \alpha Q\}, \end{aligned}$$

which corresponds to the set of finite occupation measures, and let

$$\begin{aligned} \varvec{\mathcal {O}}_{{d}}=\varvec{\mathcal {O}}\cap \{\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}}): \langle d_i,\mu \rangle \le k_i \,\hbox {for}\,1\le i\le q\} \end{aligned}$$

which stands for the measures in $\varvec{\mathcal {O}}$ which, in addition, satisfy the inequalities for the constraint functions ${\textbf{d}}$. Observe that all the equations and inequalities defining the sets $\varvec{\mathcal {O}}$ and $\varvec{\mathcal {O}}_d$ are linear.

By Assumption 2.1(c), there is no loss of generality in imposing, in the definition of the optimization problem (CP), that minimization is carried over all policies $\pi \in \mathbf {\Pi }$ that satisfy the constraints and such that $V_0(\pi ,\nu )$ is finite. In view of Theorem 3.1, we have the following characterization.

Corollary 3.1

Under Assumption 2.1, the constrained problem (CP) is equivalent to the linear programming problem

$$\begin{aligned} (LP):\quad \min \ \langle c,\mu \rangle \quad \hbox {subject to}\quad \mu \in \varvec{\mathcal {O}}_{{d}}. \end{aligned}$$

(20)

Once we have stated (CP) as a linear programming problem, the plan of the proof on the existence of an optimal stationary policy consists in endowing $\varvec{\mathcal {O}}$ with an adequate topology under which it becomes compact, and under which $\langle c,\mu \rangle $ is a lower semicontinuous function whose infimum is, therefore, attained.

Topologies on measure spaces. In the sequel, we suppose that Assumptions 2.1 and 2.2 are satisfied. We define the following topologies on the sets ${\mathcal {M}}^f({\textbf{X}})$ and ${\mathcal {M}}^f({\textbf{A}})$ of finite measures on ${\textbf{X}}$ and ${\textbf{A}}$, respectively.

Definition 3.2

(a)
The strong topology (s-topology) on ${\mathcal {M}}^f({\textbf{X}})$ is the coarsest topology for which the mappings $D\mapsto \mu (D)$ are continuous for every $D\in {\mathcal {B}}({\textbf{X}})$.
(b)
The weak topology (in short, the w-topology) on ${\mathcal {M}}^f({\textbf{A}})$ is the coarsest topology for which the mappings $\mu \mapsto \int _{{\textbf{A}}}fd\mu $ are continuous for every bounded and continuous function $f:{\textbf{A}}\rightarrow {\mathbb {R}}$.

The next result is a useful characterization of relative compactness in the s-topology. For further details see [37, Lemma 3.5].

Proposition 3.2

For any $\Gamma \subseteq {\mathcal {M}}^f({\textbf{X}})$, the following statements are equivalent:

(i)
$\Gamma $ is relatively compact it the s-topology.
(ii)
For any sequence $\{f_n\}$ of bounded measurable functions on ${\textbf{X}}$ such that $f_n\downarrow 0$
$$\begin{aligned} \lim _{n\rightarrow \infty }\sup _{\mu \in \Gamma }\int _{{\textbf{X}}}f_nd\mu = 0. \end{aligned}$$

Finally, we define a topology on the set ${\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$ of finite measures on ${\textbf{X}}\times {\textbf{A}}$ when endowed with the product $\sigma $-algebra ${\mathcal {B}}({\textbf{X}})\otimes {\mathcal {B}}({\textbf{A}})$. We recall that $Car_b({\textbf{X}}\times {\textbf{A}})$ is the set of real-valued bounded Carathéodory functions on ${\textbf{X}}\times {\textbf{A}}$.

Definition 3.3

The weak-strong (ws) topology on ${\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$ is the coarsest topology for which the mappings $\mu \mapsto \int _{{\textbf{X}}\times {\textbf{A}}} fd\mu $ are continuous for every function $f\in Car_b({\textbf{X}}\times {\textbf{A}})$.

We will write $\mu _n{\mathop {\longrightarrow }\limits ^{ws}}\mu $ when the sequence $\{\mu _n\}$ converges to $\mu $ in the weak-strong topology.

Young measures. Recall that $\lambda \in {\mathcal {P}}({\textbf{X}})$ has been introduced in Assumption 2.2(b). On the set $\mathbf {\Phi }$ of all stationary policies we define the following equivalence relation

$$\begin{aligned} \varphi \sim \varphi '\quad \hbox {when}\quad \varphi (\cdot |x)=\varphi '(\cdot |x) \quad \hbox {with}\quad \lambda -\hbox {probability one}. \end{aligned}$$

We denote by $\varvec{\mathcal {Y}}$ the family of all equivalence classes and we will say that an element in $\varvec{\mathcal {Y}}$ (i.e., an equivalence class) is a Young measure.

Definition 3.4

The Young topology on $\varvec{\mathcal {Y}}$ is the coarsest topology for which the mappings

$$\begin{aligned} \varphi \mapsto \int _{{\textbf{X}}}\int _{\textbf{A}}f(x,a)\varphi (da|x)\lambda (dx) \end{aligned}$$

are continuous for every $f\in Car_1({\textbf{X}}\times {\textbf{A}})$.

Note that the integral in the above definition does not depend on the element $\varphi $ that is chosen within the class of equivalence in $\varvec{\mathcal {Y}}$.

Lemma 3.1

If $\varphi ,\varphi '\in \mathbf {\Phi }$ are in the same class of equivalence of $\varvec{\mathcal {Y}}$ then the probability measures ${\textbf{P}}^\varphi _\nu $ and ${\textbf{P}}^{\varphi '}_\nu $ coincide.

Proof

We will show that the joint distribution of $(X_0,A_0,\ldots ,X_t,A_t)$ is the same under both ${\textbf{P}}^\varphi _\nu $ and ${\textbf{P}}^{\varphi '}_\nu $ for any $t\in {\mathbb {N}}$. We will show it by induction.

For $t=0$, note that

$$\begin{aligned} {\textbf{P}}^\varphi _\nu \{X_0\in B, A_0\in C\}= & {} \int _B\varphi (C|x) \nu (dx) = \int _B\varphi (C|x)\frac{d\nu }{d\lambda }(x) \lambda (dx)\\= & {} \int _B\varphi '(C|x)\frac{d\nu }{d\lambda }(x) \lambda (dx)= {\textbf{P}}^{\varphi '}_\nu \{X_0\in B, A_0\in C\} \end{aligned}$$

where we use the fact that $\nu \ll \lambda $ and that the equality $\varphi (C|x)=\varphi '(C|x)$ holds $\lambda $-a.s.

Suppose now that the stated result is true for some $t\in {\mathbb {N}}_0$. We have

$$\begin{aligned}&{\textbf{P}}^\varphi _\nu \{X_{t+1}\in B, A_{t+1}\in C \mid H_t=h_t,A_t=a_t\} \\&\quad =\int _B \varphi (C|y)q(y|x_t,a_t)\lambda (dy)= \int _B \varphi '(C|y)q(y|x_t,a_t)\lambda (dy)\\&\quad = {\textbf{P}}^{\varphi '}_\nu \{X_{t+1}\in B, A_{t+1}\in C \mid H_t=h_t,A_t=a_t\}. \end{aligned}$$

Since the distribution of $(H_t,A_t)$ is the same under ${\textbf{P}}^\varphi _\nu $ and ${\textbf{P}}^{\varphi '}_\nu $, we conclude the result. $\square $

It follows from this lemma that two stationary policies $\varphi ,\varphi '$ in the same class of equivalence of $\varvec{\mathcal {Y}}$ have the same strategic probability measure. They are thus indistinguishable and they yield, in particular, the same expected payoffs and the same occupation measures. In the sequel, we shall therefore identify the family of Markov stationary policies $\mathbf {\Phi }$ with the family of Young measures $\varvec{\mathcal {Y}}$, and so we will refer to the Young topology on $\mathbf {\Phi }$. We will write $\varphi _n{\mathop {\longrightarrow }\limits ^{y}}\varphi $ when the sequence $\{\varphi _n\}$ in $\mathbf {\Phi }$ converges to $\varphi \in \mathbf {\Phi }$ in the Young topology.

The space $L^1({\textbf{X}},{\mathcal {B}}({\textbf{X}}),\lambda )$ is separable (this is because ${\mathcal {B}}({\textbf{X}})$ is countably generated), and so by [5, Lemma 1] we have the following.

Proposition 3.3

The set of stationary policies $\mathbf {\Phi }$ is a compact metric space for the Young topology.

Lemma 3.2

For any $t\in {\mathbb {N}}_0$ and any functions $W_0,\ldots ,W_t$ in $Car_b({\textbf{X}}\times {\textbf{A}})$, we have that

$$\begin{aligned} \varphi \mapsto {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^t W_i(X_i,A_i)\right] ~ \hbox {is a continuous function on}~\mathbf {\Phi }. \end{aligned}$$

Proof

We shall consider the measurable product space $({\textbf{X}}^{t+1},{\mathcal {B}}({\textbf{X}})\otimes \cdots \otimes {\mathcal {B}}({\textbf{X}}),\overline{\lambda })$ where the probability measure ${\overline{\lambda }}\in {\mathcal {P}}({\textbf{X}}^{t+1})$ is the $(t+1)$-times product of $\lambda $:

$$\begin{aligned} \overline{\lambda }(B_0\times \cdots \times B_t)=\prod _{j=0}^{t} \lambda (B_j)\quad \hbox {for}\quad B_0,\ldots ,B_t\in {\mathcal {B}}({\textbf{X}}). \end{aligned}$$

Similarly, considering the compact product Borel space ${\textbf{A}}^{t+1}$ of the action spaces, we shall be dealing with Carathéodory functions on $({\textbf{X}}\times {\textbf{A}})^{t+1}$ that, for consistency in the notation, will be rather written as ${\textbf{X}}^{t+1}\times {\textbf{A}}^{t+1}$.

Observe that the expectation in the statement of the lemma can be written in integral form as

$$\begin{aligned}&{\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^t W_i(X_i,A_i)\right] \nonumber \\&\quad = \int _{\textbf{X}}\int _{\textbf{A}}\cdots \int _{\textbf{X}}\int _{\textbf{A}}\left[ \prod _{i=0}^{t-1} W_{i+1}(x_{i+1},a_{i+1})q(x_{i+1}|x_{i},a_{i})\right] \cdot W_0(x_0,a_0)\cdot \frac{d\nu }{d\lambda }(x_0)\nonumber \\&\quad \cdot \varphi (da_t|x_t)\lambda (dx_t)\cdots \varphi (da_0|x_0)\lambda (dx_0) . \end{aligned}$$

(21)

Under our conditions (recall Assumption 2.2(b)) we have that the function F given by

$$\begin{aligned}&F(x_0,a_0,\ldots ,x_t,a_t)\nonumber \\&\quad = \left[ \prod _{i=0}^{t-1} W_{i+1}(x_{i+1},a_{i+1})q(x_{i+1}|x_{i},a_{i})\right] \cdot W_0(x_0,a_0)\cdot \frac{d\nu }{d\lambda }(x_0) \end{aligned}$$

(22)

is in $ Car({\textbf{X}}^{t+1}\times {\textbf{A}}^{t+1})$. Let us now show that we have $F\in Car_1({\textbf{X}}^{t+1}\times {\textbf{A}}^{t+1})$. Let ${\textbf{w}}_i$ be a bound for $W_i$, that is, $|W_i(x,a)|\le {\textbf{w}}_i$ for any $(x,a)\in {\textbf{X}}\times {\textbf{A}}$. We also have, by Assumption 2.2(b), for every $0\le i\le t-1$, $x_i,x_{i+1}\in {\textbf{X}}$ and $a_i\in {\textbf{A}}$

$$\begin{aligned} q(x_{i+1}|x_i,a_i)\le q(x_{i+1}|x_i). \end{aligned}$$

Hence, we conclude that

$$\begin{aligned} |F(x_0,a_0,\ldots ,x_t,a_t)|\le G(x_0,\ldots ,x_t) \end{aligned}$$

for every $(x_0,a_0,\ldots ,x_t,a_t)\in ({\textbf{X}}\times {\textbf{A}})^{t+1}$, where G is defined on ${\textbf{X}}^{t+1}$ as

$$\begin{aligned} G(x_0,\ldots ,x_t)= \prod _{i=0}^t {\textbf{w}}_i \cdot \prod _{i=0}^{t-1} q(x_{i+1}|x_i)\cdot \frac{d\nu }{d\lambda }(x_0). \end{aligned}$$

It is now easy to see, integrating first with respect to $\lambda (dx_t)$, then $\lambda (dx_{t-1})$, etc., that

$$\begin{aligned} \int _{{\textbf{X}}^{t+1}} G(x_0,\ldots ,x_t)\lambda (dx_t)\cdots \lambda (dx_0)=\int _{{\textbf{X}}^{t+1}}Gd\overline{\lambda } \le {\textbf{q}}^{t+1}\prod _{i=0}^t {\textbf{w}}_i \end{aligned}$$

so that $G\in L^1({\textbf{X}}^{t+1},{\mathcal {B}}({\textbf{X}})^{t+1},\overline{\lambda })$. This completes the proof that the function F is in $Car_1({\textbf{X}}^{t+1}\times {\textbf{A}}^{t+1})$.

On the space $({\textbf{X}}^{t+1},{\textbf{A}}^{t+1})$ we can consider the family $\overline{\varvec{\mathcal {Y}}}$ of Young measures consisting of all stochastic kernels on ${\textbf{A}}^{t+1}$ given ${\textbf{X}}^{t+1}$, equipped with the corresponding Young topology (cf. Definition 3.4). Given any $\varphi \in \mathbf {\Phi }$ we consider the tensor product $\overline{\varphi }=\bigotimes _{i=0}^t \varphi \in \overline{\varvec{\mathcal {Y}}}$ of $t+1$ copies of the Young measures $\varphi $ defined as

$$\begin{aligned} \overline{\varphi }(da_0,\ldots ,da_t|x_0,\ldots ,x_t)=\varphi (da_0|x_0)\times \cdots \times \varphi (da_t|x_t). \end{aligned}$$

By continuity of the tensor product of Young measures (see [18, Theorem 3.90] or [4, Theorem 4.17]) we have that the mapping from $\mathbf {\Phi }$ to $\overline{\varvec{\mathcal {Y}}}$ given by $\varphi \mapsto \overline{\varphi }$ is continuous.

Now we proceed with the proof of the lemma. Suppose that $\{\varphi _n\}$ is a sequence in $\mathbf {\Phi }$ with $\varphi _n{\mathop {\longrightarrow }\limits ^{y}}\varphi $. By continuity of the tensor product we have $\overline{\varphi }_n{\mathop {\longrightarrow }\limits ^{y}}\overline{\varphi }$ in the Young topology of $\overline{\varvec{\mathcal {Y}}}$ and so, since F is in $Car_1({\textbf{X}}^{t+1}\times {\textbf{A}}^{t+1})$,

$$\begin{aligned}&\int _{{\textbf{X}}^{t+1}}\int _{{\textbf{A}}^{t+1}} F(x_0,a_0,\ldots ,x_t,a_t)\overline{\varphi }_n(da_0,\ldots ,da_t)\overline{\lambda }(dx_0,\ldots ,dx_t)\\&\quad \rightarrow \int _{{\textbf{X}}^{t+1}}\int _{{\textbf{A}}^{t+1}} F(x_0,a_0,\ldots ,x_t,a_t)\overline{\varphi }(da_0,\ldots ,da_t)\overline{\lambda }(dx_0,\ldots ,dx_t). \end{aligned}$$

But the latter convergence is precisely (recall (21) and (22))

$$\begin{aligned} {\textbf{E}}^{\varphi _n}_\nu \left[ \prod _{i=0}^t W_i(X_i,A_i)\right] \rightarrow {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^t W_i(X_i,A_i)\right] , \end{aligned}$$

which completes the proof. $\square $

4 Proof of Theorem 2.3

In this section, we will show that the linear programming problem (LP) in (20) is solvable using both: (a) topological properties (i.e., compactness) of the sets $\varvec{\mathcal {O}}$ and $\varvec{\mathcal {O}}_d$, and (b): regularity properties of the mapping $\mu \mapsto \langle c,\mu \rangle $. Then, from Corollary 3.1, we will derive the existence of a solution to the constrained problem (CP) in (6). In Sect. 5 we will provide mild sufficient conditions for the statement of Assumption 2.1(b).

We start this section by analyzing the compactness property of the sets $\varvec{\mathcal {O}}$ and $\varvec{\mathcal {O}}_d$. To this end, we need the following results that we borrow from [6] and whose proofs can be found in Proposition 2.3 and Theorems 2.5 and 3.1 of this same reference.

Lemma 4.1

Let $({\textbf{U}},{\mathcal {U}})$ be a measurable space and ${\textbf{Z}}$ be a Polish space.

(i)
Assume that ${\mathcal {U}}$ is countably generated. Every $\Gamma \subseteq {\mathcal {M}}^f({\textbf{U}}\times {\textbf{Z}})$ with the property of $\Gamma ^{{\textbf{U}}}$ being relatively compact in the s-topology is metrizable in the ws-topology.
(ii)
Given $\Gamma \subseteq {\mathcal {M}}^f({\textbf{U}}\times {\textbf{Z}})$, the following statements are equivalent.
1. (a)
  $\Gamma ^{{\textbf{U}}}\subseteq {\mathcal {M}}^f({\textbf{U}})$ is relatively compact in the s-topology and $\Gamma ^{{\textbf{Z}}}\subseteq {\mathcal {M}}^f({\textbf{Z}})$ is tight.
2. (b)
  $\Gamma $ is tight in the ws-topology.
3. (c)
  Every sequence in $\Gamma $ is tight in the ws-topology.
4. (d)
  $\Gamma $ is relatively sequentially compact in the ws-topology.
5. (e)
  $\Gamma ^{{\textbf{U}}}\subseteq {\mathcal {M}}^f({\textbf{U}})$ is relatively compact in the s-topology and $\Gamma ^{{\textbf{Z}}}\subseteq {\mathcal {M}}^f({\textbf{Z}})$ is relatively sequentially compact in the w-topology.
(iii)
Given $\{\mu _n\}$ and $\mu $ in ${\mathcal {M}}^f({\textbf{U}}\times {\textbf{Z}})$, we have that $\mu _n{\mathop {\longrightarrow }\limits ^{ws}}\mu $ in ${\mathcal {M}}^f({\textbf{U}}\times {\textbf{Z}})$ if and only if
$$\begin{aligned} \liminf _{n\rightarrow \infty }\int _{{\textbf{U}}\times {\textbf{Z}}}gd\mu _n\ge \int _{{\textbf{U}}\times {\textbf{Z}}}gd\mu \end{aligned}$$
for every real-valued measurable function g on ${\textbf{U}}\times {\textbf{Z}}$ such that $z\mapsto g(u,z)$ is l.s.c on ${\textbf{Z}}$ for each $u\in {\textbf{U}}$, and such that
$$\begin{aligned} \lim _{\gamma \rightarrow \infty }\sup _{n\ge 1}\int _{\{g\le -\gamma \}}\max (-g,0)d\mu _n=0 \end{aligned}$$
(in particular, any bounded below g satisfies the latter condition).

Proposition 4.1

Under Assumptions 2.1 and 2.2, the sets $\varvec{\mathcal {O}}$ and $\varvec{\mathcal {O}}_d$ are compact metric spaces for the ws-topology.

Proof

The proof proceeds in several steps.

(i)
The set $\varvec{\mathcal {O}}^{\textbf{X}}$ is relatively compact in the s-topology. To show this result, we will use the characterization given in Proposition 3.2. Namely, let $\{f_n\}$ be an arbitrary sequence of bounded measurable functions on ${\textbf{X}}$ decreasing to the null function. We want to show that
$$\begin{aligned} \lim _{n\rightarrow \infty }\sup _{\mu \in \varvec{\mathcal {O}}} \int _{\textbf{X}}f_n d\mu ^{\textbf{X}}=0.\end{aligned}$$
Recalling that any $\mu \in \varvec{\mathcal {O}}$ is the occupation measure of some stationary policy $\varphi \in \mathbf {\Phi }$, it suffices to show that
$$\begin{aligned} \lim _{n\rightarrow \infty } \sup _{\varphi \in \mathbf {\Phi }} {\textbf{E}}^\varphi _\nu \left[ \sum _{t=0}^\infty f_n(X_t)\prod _{j=0}^{t-1} \alpha (X_j,A_j)\right] =0. \end{aligned}$$
(23)
Choose now an arbitrary stationary policy $\varphi \in \mathbf {\Phi }$. Observe that, for some constant ${\textbf{f}}>0$ and any integer k,
$$\begin{aligned}&{\textbf{E}}^\varphi _\nu \left[ \sum _{t=0}^\infty f_n(X_t)\prod _{i=0}^{t-1} \alpha (X_i,A_i)\right] \nonumber \\&\quad \le \sum _{t=0}^k {\textbf{E}}^\varphi _\nu [f_n(X_t)\prod _{i=0}^{t-1} \alpha (X_i,A_i)] + {\textbf{f}}\cdot {\textbf{E}}^\varphi _\nu \left[ \sum _{t=k+1}^\infty \prod _{i=0}^{t-1} \alpha (X_i,A_i)\right] . \end{aligned}$$
(24)
Regarding the rightmost term in (24),
$$\begin{aligned}&{\textbf{E}}^\varphi _\nu \left[ \sum _{t=k+1}^\infty \prod _{i=0}^{t-1} \alpha (X_i,A_i)\right] = {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i)\sum _{t=k+1}^\infty \prod _{i=k+1}^{t-1} \alpha (X_i,A_i)\right] \\&\quad = {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i) \cdot {\textbf{E}}^\varphi _\nu \left[ \sum _{t=k+1}^\infty \prod _{i=k+1}^{t-1} \alpha (X_i,A_i)\mid H_{k+1} \right] \right] . \end{aligned}$$
Observe now that
$$\begin{aligned} {\textbf{E}}^\varphi _\nu \left[ \sum _{t=k+1}^\infty \prod _{i=k+1}^{t-1} \alpha (X_i,A_i)\mid H_{k+1}=h_{k+1} \right]= & {} {\textbf{E}}^\varphi _{x_{k+1}}\left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j) \right] \\\le & {} \frac{1+t_0/\beta }{\epsilon }, \end{aligned}$$
by Proposition 2.1(ii), so that
$$\begin{aligned} {\textbf{E}}^\varphi _\nu \left[ \sum _{t=k+1}^\infty \prod _{i=0}^{t-1} \alpha (X_i,A_i)\right] \le \frac{1+t_0/\beta }{\epsilon }\cdot {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i)\right] . \end{aligned}$$
Since the series
$$\begin{aligned} \sum _{k=0}^\infty {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i)\right] \end{aligned}$$
converges (Proposition 2.1(i)), it follows that for any $\varphi \in \mathbf {\Phi }$ we have
$$\begin{aligned} \lim _{k\rightarrow \infty } {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i)\right] =0 \end{aligned}$$
and convergence is monotonically decreasing as $k\rightarrow \infty $. The mapping on $\mathbf {\Phi }$ given by $\displaystyle \varphi \mapsto {\textbf{E}}^\varphi _\nu [\prod _{i=0}^{k} \alpha (X_i,A_i)]$ is continuous (Lemma 3.2) and, since $\mathbf {\Phi }$ is compact, it follows from Dini’s theorem that convergence is uniform on $\mathbf {\Phi }$. Therefore,
$$\begin{aligned} \lim _{k\rightarrow \infty } \sup _{\varphi \in \mathbf {\Phi }} {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i)\right] =0 \end{aligned}$$
and so the rightmost term in (24) converges to 0 uniformly on $\mathbf {\Phi }$. Consequently, given some $\varepsilon >0$ we can find $k_0$ such that, for any $\varphi \in \mathbf {\Phi }$ we have
$$\begin{aligned} {\textbf{f}}\cdot {\textbf{E}}^\varphi _\nu \left[ \sum _{t=k_0+1}^\infty \prod _{i=0}^{t-1} \alpha (X_i,A_i)\right] <\varepsilon /2. \end{aligned}$$
(25)
Now we turn our attention to the leftmost term in (24) for $k_0$. For any fixed $0\le t\le k_0$, the mapping
$$\begin{aligned} \varphi \mapsto {\textbf{E}}^\varphi _\nu [f_n(X_t)\prod _{i=0}^{t-1} \alpha (X_i,A_i)] \end{aligned}$$
is continuous on $\mathbf {\Phi }$ (Lemma 3.2) and, by dominated convergence, it converges monotonically to zero as $n\rightarrow \infty $. Using Dini’s theorem again, we obtain that convergence is uniform on $\mathbf {\Phi }$. Hence, we can find some $n_0$ such that $n\ge n_0$ implies that
$$\begin{aligned} \sum _{t=0}^{k_0}{\textbf{E}}^\varphi _\nu [f_n(X_t)\prod _{i=0}^{t-1} \alpha (X_i,A_i)]\le \varepsilon /2(k_0+1)\quad \hbox {for any}\quad \varphi \in \mathbf {\Phi }. \end{aligned}$$
This completes the proof of (23), and this establishes that $\varvec{\mathcal {O}}^{\textbf{X}}$ is indeed relatively s-compact. Once we know that $\varvec{\mathcal {O}}^{\textbf{X}}$ is relatively s-compact, we can invoke Lemma 4.1(i) to conclude that $\varvec{\mathcal {O}}$ is a metric space for the ws-topology.
(ii)
The set $\varvec{\mathcal {O}}^{\textbf{A}}$ is relatively w-compact. This is obvious because ${\textbf{A}}$ is a compact metric space and $\varvec{\mathcal {O}}$ is bounded; recall Theorem 3.1(iv).

From items (i) and (ii), and Lemma 4.1(ii) we derive that $\varvec{\mathcal {O}}$ is relatively compact for the ws-topology. To prove that $\varvec{\mathcal {O}}$ is compact in this topology it only remains to show the next result.

(iii)
$\varvec{\mathcal {O}}$ is closed. To see this, suppose that $\{\mu _n\}$ is a sequence in $\varvec{\mathcal {O}}$ that converges to some $\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$, i.e., $\mu _n{\mathop {\longrightarrow }\limits ^{ws}}\mu $. We must show that $\mu \in \varvec{\mathcal {O}}$. First of all, we will show that $\mu ^{\textbf{X}}=\nu +\mu \alpha Q$. By hypothesis, given any $B\in {\mathcal {B}}({\textbf{X}})$ and $n\in {\mathbb {N}}$, we have
$$\begin{aligned} \mu _n^{\textbf{X}}(B)= & {} \int _{{\textbf{X}}\times {\textbf{A}}} {\textbf{I}}_{B\times {\textbf{A}}}(x,a)\mu _n(dx,da)\\= & {} \nu (B)+\int _{{\textbf{X}}\times {\textbf{A}}}Q(B|x,a)\alpha (x,a)\mu _n(dx,da). \end{aligned}$$
Using Assumption 2.2(b)–(c) it is easy to show that the functions Q(B|x, a) and $\alpha (x,a)$ are in $Car_b({\textbf{X}}\times {\textbf{A}})$, and the same happens to $ {\textbf{I}}_{B\times {\textbf{A}}}$. Hence, we can take the limit as $n\rightarrow \infty $ since $\mu _n{\mathop {\longrightarrow }\limits ^{ws}}\mu $ to conclude that $\mu _n^{\textbf{X}}(B)=\int {\textbf{I}}_{B\times {\textbf{A}}}d\mu _n\rightarrow \int {\textbf{I}}_{B\times {\textbf{A}}}d\mu =\mu ^{\textbf{X}}(B)$ and so
$$\begin{aligned} \mu ^{\textbf{X}}(B)=\nu (B)+ \int _{{\textbf{X}}\times {\textbf{A}}}Q(B|x,a)\alpha (x,a)\mu (dx,da)\quad \hbox {for every}\quad B\in {\mathcal {B}}({\textbf{X}}), \end{aligned}$$
as we wanted to prove.

Let us now show that $\mu ({\mathbb {K}}^c)=0$. Since the action sets ${\textbf{A}}(x)$ are compact, it turns out that the function $(x,a)\mapsto {\textbf{I}}_{{\mathbb {K}}^c}(x,a)$ is measurable and nonnegative, and it is a l.s.c. function on ${\textbf{A}}$ for each fixed $x\in {\textbf{X}}$. Using Lemma 4.1(iii), we have that
$$\begin{aligned} 0=\liminf _{n\rightarrow \infty } \mu _n({\mathbb {K}}^c)\ge \mu ({\mathbb {K}}^c) \end{aligned}$$
and the result follows because $\mu $ is a nonnegative measure. This completes the proof that $\varvec{\mathcal {O}}$ is closed in the ws-topology.

Combining (i)–(iii) above we conclude that $\varvec{\mathcal {O}}$ is a compact metric space in the ws-topology. To complete the proof of the proposition, it remains to establish the next statement.

(iv)
The set $\varvec{\mathcal {O}}_d$ is compact. If we show that $\varvec{\mathcal {O}}_d$ is closed, since it is contained in the compact set $\varvec{\mathcal {O}}$, it will follow that it is compact as well. To see that $\varvec{\mathcal {O}}_d$ is closed, suppose that $\{\mu _n\}$ is a sequence in $\varvec{\mathcal {O}}_d$ that converges in the ws-topology to some $\mu \in \varvec{\mathcal {O}}$. Applying again Lemma 4.1(iii), since the functions $d_i$ are bounded below and l.s.c. on ${\textbf{A}}$ for fixed $x\in {\textbf{X}}$, we have that $\mu _n{\mathop {\longrightarrow }\limits ^{ws}}\mu $ implies
$$\begin{aligned} \liminf _{n\rightarrow \infty } \langle d_i,\mu _n\rangle \ge \langle d_i,\mu \rangle . \end{aligned}$$
Therefore, we have $\langle d_i,\mu \rangle \le k_i$ for each $1\le i\le q$, and so $\varvec{\mathcal {O}}_d$ is closed and, hence, compact.

The proof of the proposition is now complete. $\square $

We also have the following result, whose proof directly follows from Lemma 4.1(iii).

Proposition 4.2

Under Assumptions 2.1 and 2.2, the function $\mu \mapsto \langle c,\mu \rangle $ is l.s.c. on $\varvec{\mathcal {O}}$.

We obtain our main result in the paper.

Proof of Theorem 2.3

We have that the linear programming problem (LP) in (20) is solvable because $\varvec{\mathcal {O}}_d$ is compact (Proposition 4.1) and $\mu \mapsto \langle c,\mu \rangle $ is l.s.c. (Proposition 4.2). If $\mu ^*\in \varvec{\mathcal {O}}_d$ is an optimal solution of (LP), then $\varphi ^*\in \mathbf {\Phi }$ such that $\mu ^*=\mu _{\varphi ^*}$ is, by Corollary 3.1, an optimal control policy for the constrained problem (CP). $\square $

We conclude this section with some comments on related approaches used in the literature, and with a discussion on approximating techniques for the problem (LP).

Remark 4.1

(a)
The method employed in this section is also known in the literature as the direct method. To the best of our knowledge, earlier works that use such a method need the assumption that c is nonnegative (or, at least, bounded below). In the present work, the non-negativeness property ensures the existence of finite occupation measures $\mu _\pi $ for arbitrary policies $\pi \in \mathbf {\Pi }$—see the proof of Theorem 3.1(ii) that in turn uses the arguments within the proof of the Proposition 2.1—. Thus, the case of c being negative and/or unbounded has not been previously studied with the direct method and we think that it is a not trivial task. There exist, however, previous works dealing with costs of this latter kind. For instance, the papers [19, 20] re-formulate the space $\varvec{\mathcal {O}}_d$ as the restriction of a bigger linear program, so that, with the help of its corresponding dual problem, the existence of both the value function and an optimal policy can be obtained. More recently, the paper [32] assumes unbounded and possibly negative costs and a state-action dependent discount factor although the optimality results were obtained through the use of the well-known dynamic programming method.
(b)
As was argued earlier, the approach ensuring the existence of optimal control policies is based on the compactness of the sets $\varvec{\mathcal {O}}$ and $\varvec{\mathcal {O}}_d$, together with the semicontinuity of $\mu \mapsto \langle c,\mu \rangle $, shown in Propositions 4.1 and 4.2, respectively. Our assumptions allowed us to obtain such optimality results by using the ws-topology and the Young topology. It is worth mentioning that previous works addressing this same problem for the case when the discount factor is a given constant $\alpha <1$, have applied an approach similar to ours but whose main difference lies on the inf-compactness^{Footnote 2} property of c (we do not make such assumption in this paper)—see, for instance, [20, 21, 28]. The approach in the aforementioned references is to apply the weak topology on a corresponding version of $\varvec{\mathcal {O}}$ (rather than the topologies used here) so that it is proved that: (a) $\varvec{\mathcal {O}}$ is weakly closed, and (b) a certain “minimizing family” of measures in $\varvec{\mathcal {O}}$ is relative compact thanks to Prokhorov’s theorem. In addition, the semicontinuity property of the mapping $\mu \mapsto \langle c,\mu \rangle $ is also ensured. We mention here the main advantages and disadvantages of these two approaches: on one side, the assumption on c to be inf-compact excludes the very simple case of c being bounded. Since we do not make this assumption, our model covers this case (as well as many others), but we pay the price of assuming both: (a) the compactness of the sets ${\textbf{A}}$ and ${\textbf{A}}(x),$ for $x\in {\textbf{X}}$, as well as (b) the existence of a density function q with suitable properties—see Assumption 2.2.
(c)
Obtaining explicitly the optimal policy of the constrained problem (CP) requires solving the infinite-dimensional linear programming problem (LP) over a set of measures. It is not possible, in general, to give a solution to this infinite-dimensional program. To tackle this problem, approximation techniques become necessary. Making a detailed exposition of the techniques that could be used in this context would require extensive explanations and it is beyond the scope of this paper. Let us mention, however, some existing procedures. Finite approximations using approximating sequences and duality, as well as aggregation–relaxation techniques combined with inner approximations are discussed in [26, Chapter 12]. Some other methods rely on suitable discretization of the state and action spaces. These are based on discretizing the action space by using the metric on ${\textbf{A}}$, while the state space ${\textbf{X}}$ is discretized by using probabilistic methods. More precisely, the underlying probability measure $\lambda $ is approximated by some probability measure $\lambda _N$ on ${\textbf{X}}$ with finite support, and this support is taken as the discretization of ${\textbf{X}}$. The discrete measure $\lambda _N$ can be obtained by quantization (i.e., making the best $L^p$ approximation of $\lambda $ by a probability measure supported on a given number N of points): see [9] for the results obtained by using this technique on a constrained discounted MDP, or by making a stochastic approximation letting $\lambda _N$ be the empirical probability measure of $\lambda $ for a sample of size N: this yields convergence rates which are stated in terms of Wasserstein distances, as seen in [10] also for a constrained discounted MDP.

5 Further Results

In this section, we present some complementary results as well as some applications.

5.1 Sufficient Conditions for Assumption 2.1(b)

Among our conditions (Assumptions 2.1 and 2.2), the most restrictive one is Assumption 2.1(b) because it requires that the probability of reaching $L_\epsilon $ is bounded away from zero uniformly in the initial state $(x,a)\in U_\epsilon $ and in the policy $\varphi \in \mathbf {\Phi }$. Next, we propose some additional conditions under which this condition is satisfied. In particular, we will just need the probability of reaching $L_\epsilon $ to be positive for each single initial state $(x,a)\in U_\epsilon $ and each single stationary policy $\varphi \in \mathbf {\Phi }$.

For fixed $x\in {\textbf{X}}$, consider the x-section $U_\epsilon (x)=\{a\in {\textbf{A}}:\alpha (x,a)\ge 1-\epsilon \}$. In addition to the previous hypotheses, we impose the following conditions.

Assumption 5.1

(a)
The discount value function $\alpha $ is u.s.c. on ${\textbf{X}}\times {\textbf{A}}$.
(b)
The density function satisfies that $(x_n,a_n)\rightarrow (x,a)$ implies $q(\cdot |x_n,a_n){\mathop {\longrightarrow }\limits ^{L^1}} q(\cdot |x,a)$.
(c)
For $\epsilon >0$ in Assumption 2.1, the set $U_\epsilon ^{\textbf{X}}=\{x\in {\textbf{X}}: U_\epsilon (x)\ne \emptyset \}$ is compact.

Proposition 5.1

Suppose that Assumptions 2.1 (except part (b)), 2.2, and 5.1 are satisfied. If there exists $t_0\in {\mathbb {N}}$ such that

$$\begin{aligned} {\textbf{P}}^{\varphi }_{x,a}\{ (X_{t_0},A_{t_0})\in L_\epsilon \}>0 \end{aligned}$$

for any $(x,a)\in U_\epsilon $ and $\varphi \in \mathbf {\Phi }$, then Assumption 2.1(b) holds.

Proof

By Assumption 5.1(a) we have that $U_\epsilon $ is closed. Since it is contained in the compact set $U_\epsilon ^{\textbf{X}}\times {\textbf{A}}$ (recall Assumption 5.1(c)), it follows that it is compact as well.

Suppose that an initial state $(x,a)\in {\textbf{X}}\times {\textbf{A}}$ is fixed and consider, for $\varphi \in \mathbf {\Phi }$, the state-action Markov chain with transition probability $\overline{Q}_\varphi (\cdot |x,a)$ and its successive compositions $\overline{Q}^{\,t}_\varphi (\cdot |x,a)$ for any $t\in {\mathbb {N}}$. Note that $\overline{Q}^{\,t}_\varphi (\cdot |x,a)$ is in ${\mathcal {P}}({\textbf{X}}\times {\textbf{A}})\subseteq {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})$, which we are assuming is endowed with the ws-topology. We are going to show that the mapping from ${\textbf{X}}\times {\textbf{A}}\times \mathbf {\Phi }$ to ${\mathcal {P}}({\textbf{X}}\times {\textbf{A}})$ given by

$$\begin{aligned} (x,a,\varphi )\mapsto \overline{Q}^{\,t}_\varphi (\cdot |x,a) \end{aligned}$$

is continuous for any $t\in {\mathbb {N}}$, where $\mathbf {\Phi }$ is endowed with the Young topology. Namely, we are going to show that for any $t\in {\mathbb {N}}$

$$\begin{aligned}{} & {} x_n\rightarrow x, a_n\rightarrow a,\hbox {and}\,\, \varphi _n{\mathop {\longrightarrow }\limits ^{y}}\varphi \,\, \hbox {implies} \nonumber \\{} & {} \qquad \overline{Q}^{\,t}_{\varphi _n}(\cdot |x_n,a_n){\mathop {\longrightarrow }\limits ^{ws}} \overline{Q}^{\,t}_\varphi (\cdot |x,a). \end{aligned}$$

(26)

Given any $f\in Car_b({\textbf{X}}\times {\textbf{A}})$ we have that

$$\begin{aligned}&\int _{{\textbf{X}}\times {\textbf{A}}} f(y,z) \overline{Q}^{\, t}_{\varphi _n}(dy,dz|x_n,a_n) \\&\quad = \int _{\textbf{X}}\int _{\textbf{X}}\int _{\textbf{A}}f(y,z) \varphi _n(dz|y)Q_{\varphi _n}^{t-1}(dy|v)Q(dv|x_n,a_n) \end{aligned}$$

because the transition to the state at time 1 follows Q(dv|x, a), then we make $t-1$ transitions according to the Markov chain $Q_{\varphi _n}$ on ${\textbf{X}}$ and, finally, the action at time t is chosen according to $\varphi _n$. Therefore,

$$\begin{aligned}&\int _{{\textbf{X}}\times {\textbf{A}}} f(y,z) \overline{Q}^{\, t}_{\varphi _n}(dy,dz|x_n,a_n)\nonumber \\&\quad =\int _{\textbf{X}}\left[ \int _{\textbf{X}}\int _{\textbf{A}}f(y,z) \varphi _n(dz|y)Q_{\varphi _n}^{t-1}(dy|v)\right] q(v|x_n,a_n)\lambda (dv)\nonumber \\&\quad = \int _{\textbf{X}}\left[ \int _{\textbf{X}}\int _{\textbf{A}}f(y,z) \varphi _n(dz|y)Q_{\varphi _n}^{t-1}(dy|v)\right] q(v|x,a)\lambda (dv)\end{aligned}$$

(27)

$$\begin{aligned}&\qquad +\int _{\textbf{X}}\left[ \int _{\textbf{X}}\int _{\textbf{A}}f(y,z) \varphi _n(dz|y)Q_{\varphi _n}^{t-1}(dy|v)\right] \nonumber \\&\qquad \cdot (q(v|x_n,a_n)-q(v|x,a)) \lambda (dv). \end{aligned}$$

(28)

Denote by $h_n$ and h the functions in $L^\infty ({\textbf{X}},{\mathcal {B}}({\textbf{X}}),\lambda )$ given by

$$\begin{aligned} h_n(y)=\int _{\textbf{A}}f(y,z) \varphi _n(dz|y)\quad \hbox {and}\quad h(y)=\int _{\textbf{A}}f(y,z) \varphi (dz|y). \end{aligned}$$

By definition of the Young convergence it is clear that $h_n\ {{\mathop {\rightharpoonup }\limits ^{*}}\ }h$ and so by Lemma 3.7 in [11], it follows that the expression within brackets in (27) satisfies (as functions in $L^\infty ({\textbf{X}},{\mathcal {B}}({\textbf{X}}),\lambda )$ in the variable $v\in {\textbf{X}}$)

$$\begin{aligned} \int _{\textbf{X}}h_n(y)Q_{\varphi _n}^{t-1}(dy|v) \ {{\mathop {\rightharpoonup }\limits ^{*}}\ }\int _{\textbf{X}}h(y)Q_{\varphi }^{t-1}(dy|v). \end{aligned}$$

Since q(v|x, a) is in $L^1({\textbf{X}},{\mathcal {B}}({\textbf{X}}),\lambda )$ as a function of v for fixed (x, a), we conclude that the expression in (27) converges to

$$\begin{aligned}&\int _{\textbf{X}}\left[ \int _{\textbf{X}}\int _{\textbf{A}}f(y,z) \varphi (dz|y)Q_{\varphi }^{t-1}(dy|v)\right] q(v|x,a)\lambda (dv)\\&\quad = \int _{{\textbf{X}}\times {\textbf{A}}} f(y,z) \overline{Q}^{\, t}_{\varphi }(dy,dz|x,a). \end{aligned}$$

Regarding (28), the first expression within brackets is bounded and so this term converges to zero by Assumption 5.1(b). Consequently, (26) follows.

Observe that the function $(y,z)\mapsto {\textbf{I}}_{L_\epsilon }(y,z)$ on ${\textbf{X}}\times {\textbf{A}}$ is measurable and, for fixed $y\in {\textbf{X}}$, it is l.s.c. in $z\in {\textbf{A}}$ (this is because $\alpha (y,z)$ is in $Car_b({\textbf{X}}\times {\textbf{A}})$). Hence, we have the function

$$\begin{aligned} (x,a,\varphi )\mapsto {\textbf{P}}^\varphi _{x,a}\{(X_{t_0},A_{t_0})\in L_\epsilon \}= \overline{Q}^{\, t_0}_{\varphi }(L_\epsilon |x,a) \end{aligned}$$

(29)

is l.s.c. as a consequence of Lemma 4.1(iii) and so it attains its minimum on the compact space $U_\epsilon \times \mathbf {\Phi }$. Consequently, if ${\textbf{P}}^{\varphi }_{x,a}\{ (X_{t_0},A_{t_0})\in L_\epsilon \}>0$ for every $(x,a)\in U_\epsilon $ and $\varphi \in \mathbf {\Phi }$, then

$$\begin{aligned} \inf _{(x,a)\in U_\epsilon }\inf _{\varphi \in \mathbf {\Phi }} {\textbf{P}}^{\varphi }_{x,a}\{ (X_{t_0},A_{t_0})\in L_\epsilon \}>0 \end{aligned}$$

as we wanted to prove. $\square $

5.2 Formulation as a Total Cost Markov Decision Process

For some purposes, it might be useful to formulate the MDP with varying discount factor as a total expected cost MDP with an absorbing state. More precisely, the control model described in Sect. 2 is augmented with a cemetery state denoted by $\Delta $, so that

the state space is ${\hat{{\textbf{X}}}}={\textbf{X}}\cup \{\Delta \}$ with $\Delta $ an isolated point,
the action space is ${\hat{{\textbf{A}}}}={\textbf{A}}\cup \{a_\Delta \}$ with $a_\Delta $ an isolated point. We let ${\hat{{\textbf{A}}}}(x)={\textbf{A}}(x)$ for $x\in {\textbf{X}}$ and ${\hat{{\textbf{A}}}}(\Delta )=\{a_\delta \}$.
the state $\Delta $ is absorbing and has zero cost, i.e.,
$$\begin{aligned} {\hat{Q}}(\Delta |\Delta ,a_\Delta )=1\quad \hbox {and}\quad {\hat{c}}(\Delta ,a_\Delta )={\hat{d}}_i(\Delta ,a_\Delta )=0, \end{aligned}$$
the transition probability for $(x,a)\in {\textbf{X}}\times {\textbf{A}}$ is given by
$$\begin{aligned} {\hat{Q}}(B|x,a)=\alpha (x,a)Q(B|x,a)\quad \hbox {if}\quad B\in {\mathcal {B}}({\textbf{X}})\quad \hbox {and}\quad {\hat{Q}}(\Delta |x,a)=1-\alpha (x,a). \end{aligned}$$
In particular, when taking action a in state x, the probability of not being absorbed by $\Delta $ is $\alpha (x,a)$.
The initial distribution $\nu $ of the system is supported on ${\textbf{X}}$, i.e., $\nu ({\textbf{X}})=1$.

Slightly abusing the notation, and since there is a single action at the cemetery state, we will also denote by ${\textbf{F}}$, $\mathbf {\Phi }$, and $\mathbf {\Pi }$ the sets of policies of this augmented model. The corresponding probability measure on the set ${\hat{\mathbf {\Omega }}}$ of augmented sample paths and the expectation operator are denoted by ${\hat{{\textbf{P}}}}^\pi _\nu $ and ${\hat{{\textbf{E}}}}^\pi _\nu $. We will use the same notations $X_t,A_t$ for the state and action processes.

It is easy to see that the total cost MDP is equivalent to the original control model because we have (cf. (4)–(5))

$$\begin{aligned} V_{0}(\pi ,\nu )={\hat{{\textbf{E}}}}_{\nu }^{\pi }\left[ \sum _{t=0}^{\infty }c(X_{t},A_{t})\right] \quad \hbox {and}\quad V_{i}(\pi ,\nu )={\hat{{\textbf{E}}}}_{\nu }^{\pi }\left[ \sum _{t=0}^{\infty }d_i(X_{t},A_{t})\right] \end{aligned}$$

for $0\le i\le q$. Letting $T_\Delta $ be the hitting time of $\Delta $, i.e.,

$$\begin{aligned} T_\Delta (\omega )=\min \{t\ge 0: x_t\in \Delta \}\quad \hbox {for}\quad \omega \in {\hat{\mathbf {\Omega }}}, \end{aligned}$$

the above sums can be restricted to the set of indices $0\le t<T_\Delta $.

It is very important to notice that for any $\pi \in \mathbf {\Pi }$ and any integer $t\ge 0$

$$\begin{aligned} {\hat{{\textbf{P}}}}^\pi _\nu \{T_\Delta >t\}={\textbf{E}}^\pi _\nu \left[ \prod _{j=0}^{t-1}\alpha (X_j,A_j)\right] \end{aligned}$$

so that, the total mass of the occupation measure of $\pi $ (recall (16)) coincides with the expected absorption time:

$$\begin{aligned} {\hat{{\textbf{E}}}}^\pi _\nu [T_\Delta ]={\textbf{E}}^\pi _\nu \left[ \sum _{j=0}^\infty \prod _{j=0}^{t-1}\alpha (X_j,A_j)\right] . \end{aligned}$$

According to [15], we will say that the total cost MDP is absorbing when

$$\begin{aligned} \sup _{\pi \in \mathbf {\Pi }} {\hat{{\textbf{E}}}}^\pi _\nu [T_\Delta ]<\infty \end{aligned}$$

(30)

and uniformly absorbing when it is absorbing and, in addition,

$$\begin{aligned} \lim _{k\rightarrow \infty }\sup _{\varphi \in \mathbf {\Phi }}\sum _{n=k}^\infty {\hat{{\textbf{P}}}}^\pi _\nu \{T_\Delta >n\}=0. \end{aligned}$$

(31)

We note that, under our conditions, (30) is not necessarily satisfied: indeed, by Proposition 2.1(i) we have ${\hat{{\textbf{E}}}}^\pi _\nu [T_\Delta ]<\infty $ provided that $V_0(\pi ,\nu )$ is finite. However, the condition (31) holds and it has been established, in fact, in the proof of Proposition 4.1: see the Eq. (25). For related issues on uniformly absorbing MDPs, the interested reader can also consult [12, 36].

Existence of optimal chattering stationary policies. We say that a stationary policy $\varphi \in \mathbf {\Phi }$ is m-chattering if, for some integer $m\ge 1$, there exist measurable functions $\gamma _1,\ldots ,\gamma _m:{\textbf{X}}\rightarrow [0,1]$ and deterministic stationary policies $f_1,\ldots ,f_m\in {\textbf{F}}$ such that

$$\begin{aligned} \varphi (\cdot |x)=\sum _{i=1}^m \gamma _i(x)\delta _{f_i(x)}(\cdot ). \end{aligned}$$

This means that the random action taken by $\varphi $ at state $x\in {\textbf{X}}$ is a mixture of the actions $f_i(x)$ for $i=1,\ldots ,m$, though the coefficients of the mixture may vary with x.

Theorem 5.2

Suppose that Assumptions 2.1 and 2.2 hold and, in addition, we have:

(i)
for every $\pi \in \mathbf {\Pi }$,
$$\begin{aligned} {\textbf{E}}^\pi _\nu \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1}\alpha (X_j,A_j)\right] <+\infty ; \end{aligned}$$
(ii)
the cost functions c and $d_i$ are bounded.

Under these conditions, there exists a $(q+1)$-chattering stationary Markovian policy in $\mathbf {\Phi }$ which is optimal for the constrained problem (CP).

Proof

The condition (i), together with (31), ensures that the total cost MDP is uniformly absorbing. The cost functions c and $d_i$ being bounded, this ensures that the total cost MDP satisfies all the conditions in [16, Theorem 9.1]. Hence, if $\varphi \in \mathbf {\Phi }$ is an optimal policy for the constrained problem (CP) then there exist $q+1$ deterministic stationary policies $f_0,f_1,\ldots ,f_q\in {\textbf{F}}$ and there exist coefficients $\beta _i\ge 0$ with $\beta _0+\cdots +\beta _q=1$ such that the occupation measures $\mu _\varphi $ and $\mu _{f_i}$ satisfy

$$\begin{aligned} \mu _\varphi =\sum _{i=0}^q \beta _i \mu _{f_i}\quad \hbox {and thus}\quad \mu _\varphi ^{\textbf{X}}=\sum _{i=0}^q \beta _i \mu ^{\textbf{X}}_{f_i}. \end{aligned}$$

Without loss of generality, we can assume that all the $\beta _i$ are strictly positive. This implies that $\mu ^{\textbf{X}}_{f_i}\ll \mu _\varphi ^{\textbf{X}}$. Now we let

$$\begin{aligned} \gamma _i(x)=\beta _i\cdot \frac{d\mu ^{\textbf{X}}_{f_i}}{d\mu _\varphi ^{\textbf{X}}}(x)\quad \hbox {for}\quad x\in {\textbf{X}}\quad \hbox {and}\quad 0\le i\le q. \end{aligned}$$

Mimicking the arguments in Lemma 5 and Theorem 2 in [14], it follows that the chattering policy obtained from the $\gamma _i(x)$ and the $f_i$ satisfies the stated result. $\square $

Existence of optimal deterministic stationary policies. In the paper [15] the authors prove under fairly general conditions that in an atomless discounted MDP (with constant discount factor) or in an atomless uniformly absorbing MDP, given a finite number of bounded cost functions, for every policy $\pi \in \mathbf {\Pi }$ there exists some $f\in {\textbf{F}}$ such that $\pi $ and f have the same expected costs for, simultaneously, all the given cost functions. The following result is, therefore, a direct consequence of [15, Theorem 3.8].

Theorem 5.3

Suppose that Assumptions 2.1 and 2.2, together with the conditions (i)–(ii) in Theorem 5.2 hold, and that the control model is atomless, meaning that

$$\begin{aligned} \nu (\{x\})=0\quad \hbox {and}\quad Q(\{y\}|x,a)=0 \end{aligned}$$

for every $(x,a)\in {\mathbb {K}}$ and $y\in {\textbf{X}}$. Then there exists a deterministic stationary Markovian policy in ${\textbf{F}}$ which is optimal for the constrained problem (CP).

Proof

We have that the total cost MDP is uniformly absorbing. Letting $\varphi \in \mathbf {\Phi }$ be an optimal policy for the constrained problem (CP), by [15, Theorem 3.8] there exists some $f\in {\textbf{F}}$ with $V_i(\varphi ,\nu )=V_i(f,\nu )$ for all $i=0,\ldots ,q$. Hence, $f\in {\textbf{F}}$ is also optimal for the constrained problem (CP). $\square $

5.3 Example: Queueing Systems with Failures

We consider a queueing system with finite capacity $M>0$. The controller controls the service rate $a\in [0,1]$ over an infinite time horizon. Additionally, we allow for a fatal failure of the system which can occur with a certain probability depending on the state of the system and the action chosen: for instance, we may assume that when the system is close to its maximal capacity and a low service rate is chosen, then failure of the system becomes more likely; failure can also occur due to random exogenous causes, or endogenous causes depending on the nature of the system. The goal of the controller is to minimize a total expected running cost subject to the constraint that the expected lifetime of the system is above some constant and to some other similar constraints on some reward functions. Next, we show how this controlled queueing system falls into the framework of a constrained MDP with a non-constant discount factor.

The state space is ${\textbf{X}}=[0,M]$ for some constant $M>0$ and the action space is ${\textbf{A}}=[0,1]$ with ${\textbf{A}}(x)={\textbf{A}}$ for every $x\in {\textbf{X}}$. The state of the system stands for the size of the queue (ranging from 0 to the maximal capacity of the system M), and the action space models the service rate (from the absence of service at 0 to a maximal service rate at 1). The transition probability function has a density function q(y|x, a) with respect to the uniform distribution $\lambda $ on [0, M]. We assume that q(y|x, a) is continuous on ${\textbf{X}}\times {\textbf{X}}\times {\textbf{A}}$. Suppose that the initial distribution of the system $\nu $ is absolutely continuous with respect to the Lebesgue measure on [0, M].

Let $\alpha (x,a)$ be the probability that the system does not collapse when taking action $a\in {\textbf{A}}$ in state $x\in {\textbf{X}}$. Intuitively, a small value of x combined with a large value of a should yield values of $\alpha (x,a)$ close to one, whereas large values of x combined with small values of a should lead to values of $\alpha (x,a)$ close to zero. Obviously, many choices of a functional expression for $\alpha (x,a)$ are compatible with these requirements. For illustration, we can choose the following

$$\begin{aligned} \alpha (x,a)=1\wedge \frac{3}{4}\Big (1+a-\frac{x}{M}\Big ). \end{aligned}$$

In particular, we have $\alpha (x,a)=1$ whenever $a-\frac{x}{M}\ge \frac{1}{3}$, which is the triangle in the upper left side of ${\textbf{X}}\times {\textbf{A}}$ of vertices (0, 1), (0, 1/3), $(\frac{2}{3}M,1)$.

Assume that the cost-per-stage function is a strictly positive function c on ${\textbf{X}}\times {\textbf{A}}$ and that we are also given some continuous reward function r on ${\textbf{X}}\times {\textbf{A}}$. The control problem is to minimize the total expected cost when managing this queueing system, with the constraint that the expected lifetime of the system is above some $T_0$ and that the total expected reward is above some $r_0$. Denoting by T the lifetime of the system, the control problem can be written

$$\begin{aligned} \min _{\pi \in \mathbf {\Pi }}\ {\textbf{E}}_{\nu }^{\pi }\left[ \sum _{t=0}^{T}c(X_{t},A_{t})\right] \end{aligned}$$

subject to

$$\begin{aligned} {\textbf{E}}_{\nu }^{\pi }[T]\ge T_0\quad \hbox {and}\quad {\textbf{E}}_{\nu }^{\pi }\left[ \sum _{t=0}^{T}r(X_{t},A_{t})\right] \ge r_0, \end{aligned}$$

which is indeed equivalent to (4)–(5) with $d_1\equiv -1$ and $d_2=-r$ (recall the formulation as a total cost MDP given in Sect. 5.2).

Let us now check our conditions on this model. It is clear that Assumption 2.1(a) holds, while (c) is satisfied by letting $\pi _0$ be the policy that always takes action $a=0$: indeed, we have $\alpha (x,0)\le 3/4$ for every $x\in {\textbf{X}}$ and thus $E_\nu ^{\pi _0}[T]\le 4$, which implies statement (c). Regarding Assumption 2.1(b), we use Proposition 5.1. Conditions (a)–(b) in Assumption 5.1 are satisfied, while by choosing $\epsilon >0$ small enough it is easily seen that $U_\epsilon ^{\textbf{X}}={\textbf{X}}$ and so Assumption 5.1(c) holds (in this particular case, any $\epsilon <1/4$ satisfies the desired condition). Concerning the condition in Proposition 5.1, it is satisfied for $t_0=1$ and any $\epsilon <1/4$ provided that for every $(x,a)\in {\textbf{X}}\times {\textbf{A}}$ there exists some $2M/3<y\le M$ with $q(y|x,a)>0$. Under the above conditions, the queueing system with constraints possesses an optimal stationary policy.

This illustrates that a large class of controlled queuing systems with possible failure can be modeled using the results in this paper. Many generalizations are possible. For instance, it is possible to give a strictly positive probability to the maximal capacity M, i.e., $Q(\{M\}|x,a)>0$ for some x and a. This makes sense because a large demand can result in the system reaching its maximal capacity. This case can be treated just by changing the topology on ${\textbf{X}}$: on [0, M) we would still consider the usual topology, and we would let $\{M\}$ to be an open set (loosely, this consists in letting M to be an isolated point). A similar treatment could be given to the state 0. We do not give the technical details on these generalizations, but we would like to stress that our framework in this paper is flexible enough to cover many variations of the described queueing system.

6 Concluding Remarks

In this paper we have studied a constrained control problem in Borel action-state spaces when the objective function is a total expected cost with a non-constant discount factor, that takes values in the interval [0, 1]. We have rewritten the original optimization problem as a linear optimization problem defined on a space of occupation measures satisfying certain conditions. Optimal elements have been obtained through the topological properties (i.e., compactness) of the space of occupation measures together with the semicontinuity property of the objective function.

Our main contribution can be divided into two facts: (1) the study of a constrained control problem when the discount factor is not necessarily bounded away from one, and (2) the set of hypotheses we have proposed in the paper are considerably different from those assumed in the literature since most of the constrained problems analyzed in the past assume some kind of inf-compactness on the cost-per-stage. We have not made such an assumption here.

Finally, let us mention that this new line of research could be extended to the theory of dynamic games; in particular, to the so-named hybrid games, which are very appealing models due to their ability to model deep structural modifications of the system. These changes are carried out thanks to the possibility that the discount factor may take the value of one at some stages; for more details, see [29,30,31].

Notes

This means that if $\varphi ,\varphi '$ satisfy $\mu =\mu ^{\textbf{X}}\otimes \varphi =\mu ^{\textbf{X}}\otimes \varphi '$ then $\varphi (\cdot |x)=\varphi '(\cdot |x)$ for $\mu ^{\textbf{X}}$-almost every $x\in {\textbf{X}}$.
For further references on this definition, see [28, Assumption 2.2(a)].

References

Aliprantis, C.D., Border, K.C.: Infinite Dimensional Analysis: A Hitchhiker’s Guide, 3rd edn. Springer, Berlin (2006)
Google Scholar
Altman, E.: Constrained Markov decision processes with total cost criteria: occupation measures and primal LP. Math. Methods Oper. Res. 43, 45–72 (1996)
Article MathSciNet Google Scholar
Alvarez-Mena, J., Hernández-Lerma, O.: Convergence of the optimal values of constrained Markov control processes. Math. Methods Oper. Res. 55, 461–484 (2002)
Article MathSciNet Google Scholar
Balder, E.J.: Lectures on Young Measure Theory and Its Applications in Economics. Rijksuniversiteit Utrecht. Mathematisch Instituut. Utrecht University, Utrecht (1998)
Google Scholar
Balder, E.J.: On Cournot–Nash equilibrium distributions for games with differential information and discontinuous payoffs. Econ. Theory 1, 339–354 (1991)
Article MathSciNet Google Scholar
Balder, E.J.: On ws-convergence of product measures. Math. Oper. Res. 26, 494–518 (2001)
Article MathSciNet Google Scholar
Bertsekas, D., Shreve, S.E.: Stochastic Optimal Control: The Discrete-Time Case, vol. 5. Academic Press, New York (1978)
Google Scholar
Carmon, Y., Shwartz, A.: Markov decision processes with exponentially representable discounting. Oper. Res. Lett. 37, 51–55 (2009)
Article MathSciNet Google Scholar
Dufour, F., Prieto-Rumeau, T.: Finite linear programming approximations of constrained discounted Markov decision processes. SIAM J. Control Optim. 51, 1298–1324 (2013)
Article MathSciNet Google Scholar
Dufour, F., Prieto-Rumeau, T.: Stochastic approximations of constrained discounted Markov decision processes. J. Math. Anal. Appl. 413, 856–879 (2014)
Article MathSciNet Google Scholar
Dufour, F., Prieto-Rumeau, T.: Stationary Markov Nash equilibria for nonzero-sum constrained ARAT Markov games. SIAM J. Control. Optim. 60, 945–967 (2022)
Article MathSciNet Google Scholar
Dufour, F., Prieto-Rumeau, T.: Absorbing Markov decision processes. ESAIM Control Optim. Calc. Var. 30, 5 (2024)
Article MathSciNet Google Scholar
Dynkin, E.B., Yushkevich, A.A.: Controlled Markov Processes. Springer, New York (1979)
Book Google Scholar
Feinberg, E.A., Jaśkiewicz, A., Nowak, A.S.: Constrained discounted Markov decision processes with Borel state spaces. Automatica 111, 105582 (2020)
Article MathSciNet Google Scholar
Feinberg, E.A., Piunovskiy, A.: Sufficiency of deterministic policies for atomless discounted and uniformly absorbing MDPs with multiple criteria. SIAM J. Control Optim. 57, 163–191 (2019)
Article MathSciNet Google Scholar
Feinberg, E.A., Rothblum, U.G.: Splitting randomized stationary policies in total-reward Markov decision processes. Math. Oper. Res. 37, 129–153 (2012)
Article MathSciNet Google Scholar
Feinberg, E.A., Shwartz, A.: Markov decision models with weighted discounted criteria. Math. Oper. Res. 19, 152–168 (1994)
Article MathSciNet Google Scholar
Florescu, L.C., Godet-Thobie, C.: Young Measures and Compactness in Measure Spaces. De Gruyter, Berlin (2012)
Book Google Scholar
Hernández-Hernández, D., Hernández-Lerma, O.: Discounted cost Markov decision processes on Borel spaces: the linear programming formulation. J. Math. Anal. Appl. 183, 335–351 (1994)
Article MathSciNet Google Scholar
González-Hernández, J., Hernández-Lerma, O.: Constrained Markov control processes in Borel spaces: the discounted case. Math. Methods Oper. Res. 52, 271–285 (2000)
Article MathSciNet Google Scholar
González-Hernández, J., López-Martínez, R.R., Minjárez-Sosa, J.A., Gabriel-Arguelles, J.R.: Constrained Markov control processes with randomized discounted cost criteria: occupation measures and extremal points. Risk Decis. Anal. 4, 163–176 (2013)
Article Google Scholar
González-Hernández, J., López-Martínez, R.R., Minjárez-Sosa, J.A., Gabriel-Arguelles, J.R.: Constrained Markov control processes with randomized discounted cost criteria: infinite linear programming approach. Optim. Control Appl. Methods 35, 575–591 (2014)
Article MathSciNet Google Scholar
Guo, X.: Constrained nonhomogeneous Markov decision processes with expected total reward criterion. Acta Appl. Math. Sin. (Engl. Ser.) 23, 230–235 (2000)
Google Scholar
Hernández-Lerma, O., González-Hernández, J., López-Martínez, R.R.: Constrained average cost Markov control processes in Borel spaces. SIAM J. Control. Optim. 42, 442–468 (2003)
Article MathSciNet Google Scholar
Hernández-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, New York (1996)
Book Google Scholar
Hernández-Lerma, O., Lasserre, J.B.: Further Topics on Discrete-Time Markov Control Processes. Springer, New York (1999)
Book Google Scholar
Ilhuicatzi-Roldán, R., Cruz-Suárez, H., Chávez-Rodríguez, S.: Markov decision processes with time-varying discount factors and random horizon. Kybernetika 53, 82–98 (2017)
MathSciNet Google Scholar
Jasso-Fuentes, H., López-Martinez, R.R., Minjárez-Sosa, J.A.: Some advances on constrained Markov decision processes in Borel spaces with random state-dependent discount factors. Optimization 73, 925–951 (2024)
Article MathSciNet Google Scholar
Jasso-Fuentes, H., Menaldi, J.L., Prieto-Rumeau, T.: Discrete-time control with non-constant discount factor. Math. Methods Oper. Res. 92, 377–399 (2020)
Article MathSciNet Google Scholar
Jasso-Fuentes, H., Menaldi, J.L., Prieto-Rumeau, T.: Discrete-time hybrid control in Borel spaces. Appl. Math. Optim. 81, 409–441 (2020)
Article MathSciNet Google Scholar
Jasso-Fuentes, H., Menaldi, J.L., Prieto-Rumeau, T.: Recent results on discrete-time hybrid control models with general state and action spaces. Pure Appl. Funct. Anal. (in press) (2024)
Jasso-Fuentes, H., Salgado-Suárez G.D.: Discrete-time hybrid control processes with unbounded costs. Submitted (2024)
Mendoza-Pérez, A.F., Jasso-Fuentes, H., De la Cruz-Courtois, O.A.: Constrained Markov decision processes in Borel spaces: from discounted to average optimality. Math. Methods Oper. Res. 84, 489–525 (2016)
Article MathSciNet Google Scholar
Minjárez-Sosa, J.A.: Markov control models with unknown random state-action-dependent discount factors. TOP 23, 743–772 (2015)
Article MathSciNet Google Scholar
Piunovskiy, A.: Optimal Control of Random Sequences in Problems With Constraints. Springer, Dordrecht (2012)
Google Scholar
Piunovskiy, A., Zhang, Y.: On the continuity of the projection mapping from strategic measures to occupation measures in absorbing Markov decision processes. arXiv:2311.14043 (2023)
Schäl, M.: On dynamic programming: compactness of the space of policies. Stoch. Proc. Appl. 3, 345–364 (1975)
Article MathSciNet Google Scholar
Wei, Q., Guo, X.: Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper. Res. Lett. 39, 369–374 (2011)
MathSciNet Google Scholar

Download references

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. Research partially supported by CONACyT grant Ciencia de Frontera No. 87787 and by Grant No. PID2021-122442NB-I00 from the Spanish Ministerio de Ciencia e Innovación.

Author information

Authors and Affiliations

Department of Mathematics, CINVESTAV-IPN, Mexico City, Mexico
Héctor Jasso-Fuentes
Department of Statistics and Operations Research, UNED, Madrid, Spain
Tomás Prieto-Rumeau

Authors

Héctor Jasso-Fuentes
View author publications
You can also search for this author in PubMed Google Scholar
Tomás Prieto-Rumeau
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomás Prieto-Rumeau.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by Benoit Chachuat.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jasso-Fuentes, H., Prieto-Rumeau, T. Constrained Markov Decision Processes with Non-constant Discount Factor. J Optim Theory Appl (2024). https://doi.org/10.1007/s10957-024-02453-y

Download citation

Received: 12 July 2023
Accepted: 05 May 2024
Published: 30 May 2024
DOI: https://doi.org/10.1007/s10957-024-02453-y

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Constrained Markov Decision Processes with Non-constant Discount Factor

Abstract

Similar content being viewed by others

Constrained Markov decision processes in Borel spaces: from discounted to average optimality

Constrained Continuous-Time Markov Decision Processes on the Finite Horizon

Discrete-time control with non-constant discount factor

1 Introduction

2 The Control Problem

Assumption 2.1

Assumption 2.2

Theorem 2.3

Proposition 2.1

Proof

3 Occupation Measures and Young Measures

Definition 3.1

Proposition 3.1

Theorem 3.1

Proof

Corollary 3.1

Definition 3.2

Proposition 3.2

Definition 3.3

Definition 3.4

Lemma 3.1

Proof

Proposition 3.3

Lemma 3.2

Proof

4 Proof of Theorem 2.3

Lemma 4.1

Proposition 4.1

Proof

Proposition 4.2

Proof of Theorem 2.3

Remark 4.1

5 Further Results

5.1 Sufficient Conditions for Assumption 2.1(b)

Assumption 5.1

Proposition 5.1

Proof

5.2 Formulation as a Total Cost Markov Decision Process

Theorem 5.2

Proof

Theorem 5.3

Proof

5.3 Example: Queueing Systems with Failures

6 Concluding Remarks

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation