Optimal Policies for Quantum Markov Decision Processes

Markov decision process (MDP) offers a general framework for modelling sequential decision making where outcomes are random. In particular, it serves as a mathematical framework for reinforcement learning. This paper introduces an extension of MDP, namely quantum MDP (qMDP), that can serve as a mathematical model of decision making about quantum systems. We develop dynamic programming algorithms for policy evaluation and finding optimal policies for qMDPs in the case of finite-horizon. The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.


Introduction
Markov decision process (MDP) offers a general framework for modelling sequential decision making where outcomes are random [1] . It stemmed from operations research and has been widely used in a broad range of areas, including manufacturing, economics, ecology, biology, automatic control and robotics. Since Kaelbling et al. [2] introduced MDPs, in particular partially observable Markov decision processes (POMDPs) into artificial intelligence (AI), they have been successfully applied in planning, scheduling, machine learning, to name just a few.

Quantum Markov decision processes
Recently, MDPs have been generalised into the quantum world in two slightly different ways: 1) The notion of quantum observable Markov decision process (QOMDP) was defined by Barry et al. [3] as a quantum generalisation of Kaelbling et al.′s POMDP [2] . The following two problems were studied there: V V i) Policy existence problem for the infinite horizon: Given a QOMDP, a starting state, and a value , whether there is a policy that achieves reward at least ?
ii) Goal-state reachability problem for the finite-hori-s s ′ s ′ s zon: Given a QOMDP, a starting state , and a goal state , whether there is a policy that can reach from with probability 1?
The most interesting result in [3] is a computability separation between POMDPs and QOMDPs indicating that the goal-state reachability is decidable for POMDPs but undecidable for QOMDPs. 1 p < 1 2) Another quantum generalisation of MDPs, called qMDP, was defined in [4]. A major difference between QOMDPs and qMDPs is that a policy in a QOMDP maps directly a (pure or mixed) quantum state to an action, whereas a policy in a qMDP maps the outcome of measurement on a quantum state to an action. It was proved that the goal-reachability problem for infinite-horizon qMDPs with probability or is EXP-hard or undecidable, respectively. The authors have employed quantum Markov chains (QMCs) as the semantic model in their research on static analysis of quantum programs [5,6] . In particular, the termination problem of quantum programs can be reduced to reachability of QMCs. This observation lead the authors further to developing model checking techniques for quantum systems [7−9] . Essentially, the main results in [4] are extensions of the corresponding results of [9] to qMDPs.

Quantum machine learning
In the last five to ten years, a new research line has been rapidly emerging at the intersection of quantum physics and AI & machine learning [10,11] . The interaction between these two areas is bidirectional: 1) Quantum physics helps to solve AI & machine learning problems via quantum computation.
2) AI & machine learning methodologies and techniques are employed to help solving problems in quantum physics.
For more detailed discussions about this area, the reader is referred to several excellent surveys [12−14] .
Reinforcement learning is a basic machine learning paradigm, in which an agent learns behaviour through trial-and-error interactions with the dynamic environment [15,16] . Several quantum reinforcement learning models have already been proposed either for enabling to apply reinforcement learning in the quantum world or for enhancing reinforcement learning by exploiting quantum advantage (see for example [17−20]). A question naturally arises here: How can the quantum Markov decision processes introduced in [3,4] be used as a mathematical framework of quantum reinforcement learning?

Contributions of this paper
The main problem considered in [3,4] is the reachability of quantum Markov decision processes (QOMDPs and qMDPs). However, a crucial step in classical reinforcement learning is to find an optimal behaviour of the agent that can usually be formulated as an optimal policy in MDPs. This paper solves the optimal policy problem for quantum Markov decision processes to provide a useful mathematical tool for decision making and reinforcement learning in the quantum world. In this paper, we focus on the case of finite horizon. The case of infinite horizon will be discussed in a forthcoming paper. We adopt a model that extends a qMDP defined in [4].
The paper is organised as follows: Quantum mechanics is briefly reviewed in Section 2 in a way that the AI community can easily understand. Several basic notions, including qMDP, policy and expected reward, are defined in Section 3. A backward recursion for the expected reward with a given policy is established, and an algorithm based on it for computing the expected reward is presented in Section 4. In Section 5, the Bellman principle of optimality is generalised to qMDPs and an algorithm for finding optimal policies for qMDPs is given. A key step in the algorithms presented in Sections 4 and 5 is the computation of quantum probabilities. For readability, we separate it from the other parts of the algorithms and solve it in Section 6. An illustrative example is shown in Section 7. The paper concludes with several remarks about further studies.

Preliminaries
For the convenience of the reader, we review the basics of quantum mechanics. In this paper, we only consider quantum systems of which the state spaces are finite-dimensional. So, their mathematical descriptions can be presented in the languages of vectors and matrices. We assume the reader is familiar with matrix algebra. All operations (e.g., addition, multiplication, scalar product) of vectors and matrices used in this paper are standard.
Quantum states. Following the convention in quantum theory, we use the Dirac notation to write |ψ⟩ = (a1, · · · , an) T n C n C T for a column vector, i.e., an element in the -dimensional complex vector space , where is the field of complex numbers, and stands for transpose. If the components of the vector satisfies the normalisation condition: |ψ⟩ |ψ⟩ ⟨ψ| = (a1, · · · , an) n |ψ⟩ ∈ C n then is called a unit vector. The dual of is the row vector . According to the basic postulates of quantum mechanics, a pure state of anlevel quantum system can be represented by a unit vector . For example, the state of a qubit (quantum bit) is a 2-dimensional vector: , where is a basis of the 2-dimensional space . Two example qubit states are More generally, a mixed state of an -level system is described by an positive semi-definite matrix with trace: . It turns out that each density operator can be written in the form of where is a family of pure states, and is a probability distribution. So, mixed state can be interpreted as follows: The system is in state with probability . For example, if a qubit is in state with probability and in state with probability , then it can be depicted by the density matrix: where is the identity matrix. If the states of a closed quantum system at times and are and , respectively, then they are related to each other by a unitary matrix which depends only on the times and : ρ, ρ ′ t, t ′ If the system is in mixed state at time , respectively, then, For example, the NOT gate and Hadamard gate on a qubit are respectively described by unitary matrices: H and the state is transferred by into More generally, the dynamics of an open quantum system is described by a super-operator. The notion of super-operator can be introduced in several different (but equivalent) ways. Here, we choose to use the Kraus operator-sum representation, which is convenient for computation. A super-operator transforms a density matrix to another and is defined by a family of matrices : It is obvious that degenerates to a unitary matrix whenever is a singleton. For example, the bit flip action transfers the state of a qubit from to and vice versa, with probability , . It is described by the super-operator: and are the unit matrix and the NOT gate, respectively. For example, state is transformed by to Quantum measurements. To acquire information about a quantum system, a measurement must be performed on it. A quantum measurement on an -level system is described by a collection of complex matrices satisfying the normalisation condition: where the indices stand for the measurement outcomes. We write for the set of all possible outcomes of . If the state of a quantum system is immediately before the measurement, then the probability that result occurs is and the state of the system after the measurement is If the state of a quantum system was before measurement, then the probability that result occurs is and the state after the measurement is (2) For example, the measurement on a qubit in the computational basis is , where If we perform on a qubit in state , then the probability that we get outcome is and the probability of outcome is . In the case that the outcome is , the qubit will be in state after the measurement, and in the case that the outcome is , it will be in state .
Composite quantum systems. In the example presented in Section 7, we will need the notion of a tensor product of vector spaces. For each , let be a vector space with as an orthonormal basis. Then the tensor product is the vector space with as an orthonormal basis. For example, the state space of two-qubits is . A two-qubit system can be in a separable state are one-qubit states, e.g., . It can also be in an entangled state that cannot be written as the product of two one-qubit states, like the EPR (Einstein-Podolsky-Rosen) pair or Bell state:

Basic definitions
Recall from [1] that an MDP consists of decision epochs, states, actions, transition probabilities and rewards. The decision epochs are the points of time where decisions are made. In this paper, we only consider the case of finite horizon -the set of decision epochs is finite. We write for the set of possible states of the system and for the set of allowable actions. At each decision epoch, the system occupies a state , and the decision maker take an action chosen from . As a result of taking action in state at decision epoch , the decision maker receives a reward , and the system evolves as follows: At the next decision epoch, the system is in state with probability . A qMDP is a quantum generalisation of MDP where the dynamics of and the observation on the system are governed by the laws of quantum mechanics. Formally, we have: where: is the set of decision epochs.
is the state space of an -level quantum system.
is a set of action names.

5) For each
and , is a super-operators in .
is a set of quantum measurements in . We write: For each , (real numbers) is the reward function at decision epoch , and is the reward function at the final decision epoch .
An MDP is a decision maker together with a classical (but stochastic) system on which the decision maker can take actions. In contrast, a qMDP consists of a decision maker and a quantum system of which the state space is . The state of this quantum system is described by a density matrix. and are the sets of actions and measurements, respectively, allowable to perform on the system. For each decision epoch , the decision maker acquires information about the system through performing a chosen measurement . It is possible that different outcomes occur with certain probabilities. Each is called an observation, meaning that measurement is performed and the outcome is . For each action , the super-operator models the evolution of the system if is taken on it between and the next epoch. So, if the system is in state before action , then it will be in state after action . Obviously, can be seen as the quantum counterpart of the matrix of transition probabilities in a MDP. For each and , is the reward that the decision maker gains by taking action at decision epoch when the outcome of measurement is . Note that in a MDP, the reward depends on the state of the system. However, in a qMDP, the reward depends on the observation about the system rather than directly on the state of the system, because usually the state of a quantum system cannot be fully known. Since at the final epoch , no action will be taken, the domain of the reward function is but not . Now we start to examine the behaviour of a qMDP by introducing the following: is called a history of epochs if and .
A policy specifies the rule to be used by the decision maker to choose the measurements and actions performed at all decision epochs. For any nonempty set , we write for the set of probability distributions over . .
for . Then repeated applications of (1) and (2)   perform on state , outcome occurs with probability , and the state of the system becomes . Furthermore, action is chosen with probability and it transforms the system into state .
The following lemma gives a more compact representation of probability function . Lemma 1.
Proof. By a routine calculation.

□ π π
Finally, we can define the reward received by the decision maker in a qMDP. For each randomised history-dependent policy , if is used in the decision process, then the expected total reward over the decision making horizon is

Policy evaluation
As in the case of MDPs, a direct computation of the reward in a qMDP based on defining emuation (6) is very inefficient. In this section, we establish a backward recursion for the reward function so that dynamic programming can be used in policy evaluation for qMDPs. To this end, we first introduce a conditional probability function. Let be a randomised history-dependent policy, and Clearly, the concatenation of and is in . By repeated applications of (1) and (2), we obtain the conditional probability of under on : where hj = (ht, at, Mt+1, mt+1, · · · , aj−1, Mj, mj) pj and ′s are defined by (3). Similar to Lemma 1, we have: where ′s are defined by (5).
Proof. By a routine calculation.
Using the conditional probability function , we can compute the expected reward in the tail of a decision process. More precisely, for each randomised history-dependent policy , function u π t : Ht → R π t, t + 1, · · · , N ht ∈ Ht is defined to be the expected total reward obtained by using policy at decision epochs ; i.e., for every , rN (MN , mN ).
Theorem 1 presents a backward recursion that shows how to compute the conditional reward at decision epoch from the conditional reward at the next epoch .
where the third is over .
The aim of policy evaluation is to compute the total reward . The following lemma gives a representation of in terms of the conditional reward at the first decision epoch.
Combining Theorem 1 and Lemma 3 enables us to develop a dynamic programming algorithm for evaluating ; see Algorithm 1. Note that there is an essential difficulty in step 2 of this policy evaluation algorithm, namely the computation of quantum probabilities . The same difficulty arises in the next section where the optimal policies for qMDPs are considered. So, this problem will be carefully addressed in Section 6.
Note that for each , the computation of according to Theorem 3 takes time where is the dimension of the Hilbert space. Furthermore, as , and for each , the computation of in (10) requires multiplications each of which needs to compute , the total complexity of Algorithm 1 is .

Optimality of policies
Now we turn to consider how to compute optimal policies. The optimal expected total reward over the decision making horizon is defined by For any and , is defined to be the optimal expected total reward from decision epoch onward when the history up to time is , i.e., u * traverses over all randomised history-dependent policies. Similar to Lemma 3, Lemma 4 shows that the optimal total reward can be represented in terms of the optimal reward at the first decision epoch.
Proof. By a routine calculation.
Theorem 2 shows that the optimal expected reward at decision epoch can be computed by solving the optimal equations. ut : Ht → R(t = 1, · · · , N ) Theorem 2. (The principle of optimality) Let be a solution of the optimality equations (13) and (14). Then, and .
Proof. First, we show that by backward induction on . By definition, it holds for . Now assume that it holds for . Then using Lemma 4.3.1 in [21] and Theorem 1, we obtain: Note that the last inequality comes from the induction hypothesis for .
Secondly, we show that . For given and , and for any , by the definition of ut+1, · · · , uN , we have Here, . We choose a deterministic policy such that and . Then, we obtain: International Journal of Automation and Computing 18(3), June 2021 Note that the equality " " follows from that for , and similar equalities for and . Finally, arbitrariness of leads to It can be seen from the above proof that for any , we can find a deterministic policy for . Then is an -optimal policy in the sense that . In particular, if both and are finite and is finite for every , then there is a deterministic policy π = (α0, β1, α1, · · · , βN−1, αN −1) such that rt(Mt, mt, βt(Mt, mt)) + . Based on this observation, a dynamic programming algorithm can be developed for computing the optimal expected reward and finding an optimal policy for the case that , and all with are finite. Algorithm 2. Finding optimal policy rN (MN , mN )  hN ∈ HN  tail(hN ) = (MN , mN ) 1) Set and for with .
2) Substitute for . Compute for with using (13) (with and be replaced by , , respectively). Set As in Algorithm 1, the problem of computing the quantum probability arises in Step 2) of this algorithm. Furthermore, it is easy to see that Algorithm 2 has the same complexity as Algorithm 1.

Computation of quantum probabilities
Now we present a method for computing the quantum probabilities needed in both Algorithms 1 and 2. First of all, an elegant formula for can be easily derived from its defining equation (3)  □ pt (t − 1) σt E However, it is hard to compute probability directly using the above lemma because -fold iterations of super-operators occurs in . The matrix representation of a super-operator is usually easier to manipulate than the super-operator itself. Suppose super-operator has the representation: for all density matrices . Then the matrix representation of is the matrix: where stands for the conjugate of matrix , i.e., with being the conjugate of complex number , whenever . We write: For every , we write for the matrix representation of super-operator , and pt Then a combination of the above two lemmas yields an elegant formula for computing the quantum probability through ordinary matrix multiplications: Proof. This theorem can be easily proved by combining Lemmas 5 and 6. □ 7 An illustrative example We now give an example to illustrate the ideas introduced in the previous sections. Suppose a quantum robot is walking in an grid environment shown in Fig. 1 (when  and  ). Initially, the robot is at the location. At each decision epoch, it can choose to move horizontally or vertically, each implemented by a (one-dimensional) Hadamard quantum walk [21] . After each move, the robot's location information is (partially) obtained by making a measurement detecting its positions. If the robot is found outside the grid, it gets a penalty (a negative reward ), and then restarts from the original point ; If it reaches the target slot , then it stays there and a reward is received; For other cases, no reward or penalty incurs. We assume that and .
Hp {|i, j⟩p : (i, j) ∈ G} Formally, let and be two -dimensional vector spaces with , , respectively, as an orthonormal basis. They will serve as the state spaces of the coins for the horizontal and vertical walks, respectively. The location space is the vector space with as an orthonormal basis, where The shift operators for the horizontal and vertical walks are defined by Note that we use subscripts to indicate which subsystems the corresponding operators are performed on. For example, acts only on systems and . For , let be the quantum-walk super-operator in the grid along direction , except that it stops when reaching the target slot or getting out of the grid, i.e., for any quantum state , , stand for the Hadamard matrix and the unit matrix, respectively. Let be the super-operator which resets the robot to the initial state (i.e., the position to , and the coin states to ) when it walks outside the grid. To be specific, has the Kraus operators: , and For each , , and , let . With the notations presented above, the robot-walking system can be modelled by a qMDP: In the remainder of this section, we compute the optimal expected total reward as well as (one of) the corresponding policies in several simple cases. Our strategy is to first calculate for any , , the probability defined in (3). Then (13)   of the optimal policies could be taking such that . Furthermore, if ; otherwise, take .

Concluding remarks
In this paper, we studied the optimal policy for qM-DPs of finite-horizon. It is shown that the problem can by solved by dynamic programming together with matrix multiplications for computing quantum probabilities. We hope that the mathematical framework of qMDPs developed in [3,4] and this paper can provide certain theoretical foundations for quantum reinforcement learning [17−20] , quantum robot planning [22−24] and other decision-making tasks in the quantum world.
For future studies, one of the most interesting problems is to settle the complexity of the optimal policy problem (as well as other problems) for qMDPs (in comparison with that for classical MDPs and POMDPs [25,26] ). Another interesting problem is to find quantum algorithms (rather than classical algorithms as considered in the present paper) for solving qMDPs, in particular, speeding up the computation of quantum probabilities. Since the state space of a qMDP is a continuum and thus doomed-to-be infinite, it will be useful to extend analysis techniques for MDPs with infinite state spaces, e.g., bisimulation and metrics [27,28] , to the quantum case.