1 Introduction

Reinforcement learning (RL) is responsible for many relevant developments in artificial intelligence (AI). Successes such as beating the world champion of Go (Silver et al. 2016) and solving numerous complex games without any human intervention (Schrittwieser et al. 2020) were relevant milestones in AI, providing optimal planning without supervision. RL is paramount in complex real-world problems such as self-driving vehicles (Kiran et al. 2021), automated trading (Liu et al. 2020; Mosavi et al. 2020), recommender systems (Afsar et al. 2021), and quantum physics (Dalgaard et al. 2020). Recent advancements in RL are strongly associated with advances in deep learning (Goodfellow et al. 2016) since scaling to large state/action space environments is possible, as opposed to tabular RL (Sutton and Barto 2018).

Previous results suggest that RL agents obeying the rules of quantum mechanics can outperform classical RL agents (Dunjko et al. 2016, 2017; Paparo et al. 2014; Sequeira et al. 2021; Dunjko and Briegel 2018; Saggio et al. 2021). However, these suffer from the same scaling problem as classical tabular RL: they do not scale easily to real-world problems with large state-action spaces. Additionally, the lack of fault-tolerant quantum computers (Preskill 1997) further compromises the ability to handle problems of significant size.

Variational quantum circuits (VQCs) are a viable alternative since state-action pairs can be parameterized, enabling, at least in theory, a reduction in the circuit’s complexity. Moreover, VQCs could enable shallow enough circuits to be confidently executed on current NISQ (Noisy Intermediate Scale Quantum) hardware (Preskill 2018) without resorting to typical brute force search over the state/action space as in the quantum tabular setting (Sequeira et al. 2021; Dunjko et al. 2016). Variational models are also referred to as approximately universal quantum neural networks (Farhi and Neven 2018; Schuld et al. 2021). Nevertheless, fundamental questions on the expressivity and trainability of VQCs remain to be answered, especially from a perspective relevant to RL.

This paper proposes an RL agent’s policy resorting to a shallow VQC and studies its effectiveness when embedded in the Monte-Carlo-based policy gradient algorithm REINFORCE (Williams 2004) throughout standard benchmarking environments. However, benchmarking variational algorithms for classical environments exhibit a trade of information between a quantum and a classical channel that incurs an overhead from encoding classical information into the quantum processor. Efficient encoding of real-world data constitutes a real bottleneck for NISQ devices, with the consequence of neglecting any potential quantum advantage (LaRose and Coyle 2020). In the case of a quantum agent-environment interface, the cost of data encoding can often be neglected, and there is room for potential quantum advantages from quantum data (Huang et al. 2021). In optimal quantum control, gate fidelity is improved by exploiting the full knowledge of the system’s Hamiltonian (James 2021). However, such methods are only viable when the system’s dynamics are known. Thus, applying variational quantum methods may indeed be relevant (Martín-Guerrero and Lamata 2021). In this setting, we considered a quantum RL agent that optimizes the gate fidelity in a model-free setting, learning directly from the interface with the noisy environment.

The main contributions of this paper are:

  • Design of a variational softmax-policy using a shallow VQC similar to or outperforming long-term cumulative reward compared to a restricted class of classical neural networks used in a set of standard benchmarking environments and the problem of quantum state preparation, using a fraction of the number of trainable parameters.

  • Demonstration of a logarithmic sample complexity concerning the number of parameters in gradient estimation.

  • Empirical verification of different parameter initialization strategies for variational policy gradients.

  • Study of the barren plateau phenomenon in quantum policy gradient optimization using the Fisher information matrix spectrum.

The rest of the paper is organized as follows. Section 2 reviews quantum variational RL’s state-of-the-art. Section 3 summarizes the theory behind the classical policy gradient algorithm used in this work. Section 4 details each block of the proposed VQC and the associated quantum policy gradient algorithm. Section 4.5 explores trainability under gradient-based optimization using quantum hardware and its corresponding sample complexity. Section 5 presents the performance of the quantum variational algorithm in simulated benchmarking environments. Section 6 analyzes the number of parameters trained and the Fisher information spectrum associated with the classical/quantum policy gradient. Section 7 closes the paper with some concluding remarks and suggestions for future work.

2 Related work

Despite numerous publications focusing on quantum machine learning (QML), the literature on variational methods applied to RL remains scarce. Most results to date focus on value-based function approximation rather than policy-based. Chen et al. (2020) use VQCs as quantum value function approximators for discrete state spaces, and, in Lockwood and Si (2020), the authors generalize the former result to continuous state spaces. Lockwood and Si (2021) show that simple VQC-inspired Q-networks (i.e., state-action value approximators) based on double deep Q-learning are not adequate for the Atari games, Pong and Breakout. Sanches et al. (2021) proposed a hybrid quantum-classical policy-based algorithm to solve real-world problems like vehicle routing. In Wu et al. (2021), the authors proposed a variational actor-critic agent, which is the only work so far operating on the quantum-quantum context of QML (Aïmeur et al. 2006), i.e., a quantum agent acting upon a quantum environment. The authors suggest that the variational method could solve quantum control problems. Jerbi et al. (2021) propose a novel quantum variational policy-based algorithm achieving better performance than previous value-based methods in a set of standard benchmarking environments. Their architecture consists of repeated angle-encoding to increase the expressivity of the variational model, i.e., increasing the number of functions of the input state that the model can represent (Schuld et al. 2021). Compared with Jerbi et al. (2021), our work shows that a simpler variational architecture composed of a shallow ansatz, consisting of a two-qubit entangling gate and two single-qubit gates (Bharti et al. 2022) with a single encoding layer can be considered for standard benchmarking environments. Variational policies can be devised with decreased depth and fewer trainable parameters. The type of functions our circuit can represent is substantially smaller when compared to Jerbi et al. (2021). However, simpler classes of policies may be beneficial in the language of generalization and overfitting. Furthermore, compared to Jerbi et al. (2021), this work considers a more trivial set of observables for the measurement of the quantum circuit, leading to fewer shots necessary to estimate the agent’s policy and respective policy gradient.

3 Policy gradients

Policy gradient methods try to learn a parameterized policy \(\pi (a\lvert s,\theta ) = \mathbb {P}\{ a_{t} = a \lvert s_{t} = s , \theta _{t} = \theta \}\), where \(\theta \in \mathbb {R}^{k}\) is the parameter vector of size k, s and a are the state and action, respectively, and t is the time instant that can optimally select actions without resorting to a value function. These methods try to maximize a performance measure J(𝜃), performing gradient ascent on J(𝜃)

$$ \theta_{i+1} = \theta_{i} + \eta \nabla_{\theta_{i}} J(\theta_{i}) $$
(1)

where η is the learning rate. Provided that the action space is discrete and relatively small, then the most prominent way of balancing exploration and exploitation is by sampling an action from a Softmax-Policy, also known as Neural Policy (Agarwal et al. 2019):

$$ \pi(a \lvert s,\theta) = \frac{e^{h(s,a,\theta)}}{{\sum}_{b \in A} e^{h(s,b,\theta)}} $$
(2)

where \(h(s,a,\theta ) \in \mathbb {R}\) is a numerical preference for each state-action pair and A is the action set. For legibility, A will be omitted whenever a policy similar to Eq. 2 is presented. The policy gradient theorem (Sutton et al. 1999) states that the gradient of the objective function can be written as a function of the policy itself. In general, the Monte-Carlo policy gradient known as REINFORCE (Williams 2004) computes the gradient of samples obtained from N trajectories of length T, also known as the horizon under the parameterized policy, as in Eq. 3.

$$ \nabla_{\theta} J(\theta) = \frac{1}{N} \sum\limits_{i=0}^{N-1}\sum\limits_{t=0}^{T-1} G_{t}(\tau_{i}) \nabla_{\theta} \log \pi(a_{t_{i}}\lvert s_{t_{i}},\theta) $$
(3)

where Gt(τ) is the γ-discounted cumulative reward per time step, known as the return (see Eq. 5) derived from trajectory’s return G(τ) (see Eq. 4).

$$ G(\tau) = \sum\limits_{t=0}^{T-1} \gamma^{t} r_{t+1} $$
(4)
$$ G_{t}(\tau) = \sum\limits_{t^{\prime} = 0}^{T-t-1} \gamma^{t^{\prime}} r_{t^{\prime}+t} $$
(5)

A known limitation of the REINFORCE algorithm is due to Monte Carlo estimates. Stochastically sampling the trajectories results in gradient estimators with high variance, which deteriorate the performance as the environment’s complexity increases (Greensmith et al. 2004). The REINFORCE estimator can be improved by leveraging a control variate known as baseline b(st), without increasing the number of samples N. Baselines are subtracted from the return such that the optimization landscape becomes smooth. The REINFORCE with baseline gradient estimator is represented in Eq. 6, and the complete algorithm is presented in Algorithm 1.

$$ \nabla_{\theta} J(\theta) = \frac{1}{N} \sum\limits_{i=0}^{N-1}\sum\limits_{t=0}^{T-1} (G_{t}(\tau_{i}) - b(s_{t_{i}})) \nabla_{\theta} \log \pi(a_{t_{i}} \lvert s_{t_{i}} , \theta) $$
(6)

For the benchmarking environments in Section 5, the average return was used as a baseline, calculated as in Eq. 7.

$$ b(s_{t}) = \frac{1}{{N}} \sum\limits_{i=0}^{N-1} G_{t}(\tau_{i}) $$
(7)
Algorithm 1
figure a

REINFORCE with baseline.

4 Quantum policy gradients

This section details the proposed VQC-based policy gradient. Numerical preferences \(h(s, a,\theta ) \in \mathbb {R}\) are the output of measurements in a given parameterized quantum circuit. The result can be represented as the expectation value of a given observable or the probability of measuring a basis state. We resort to the former since it allows for more compact representations of objective functions (Bharti et al. 2021). Additionally, the type of ansatz used by the proposed VQC implies that \(\theta \in \mathbb {R}^{k}\) is a high dimensional vector corresponding to the angles of arbitrary single-qubit rotations.

VQCs are composed of four main building blocks, as represented in Fig. 1. Initially, a state preparation routine or embedding, S, encodes data points into the quantum system. Next, a unitary U(𝜃) maps the data into higher dimensions of the Hilbert space. Such a parameterized model corresponds to linear methods in quantum feature spaces. Expectation values returned from a measurement scheme are finally post-processed into the quantum neural policy. A careful analysis of each block of Fig. 1 follows. Moreover, the sample complexity of estimating the quantum policy gradient is analyzed in Section 4.5.

Fig. 1
figure 1

Building blocks of variational quantum circuits

4.1 Embedding

Unlike classical algorithms, the state-preparation routine is a crucial step for any variational quantum algorithm. There are numerous ways of encoding classical data into a quantum processor (Schuld and Petruccione 2018). Angle encoding (LaRose and Coyle 2020) is used to allow continuous-state spaces. Arbitrary Pauli rotations σ ∈{σx,σy,σz} can encode a single feature per qubit. Hereby, given an agent’s state s with n features, \(s = \{s_{0}, s_{2}, {\dots } s_{n-1}\}\), σx rotations are used, requiring n qubits to encode |s〉, as indicated by Eq. 8.

$$ |{s}\rangle = \bigotimes_{i=0}^{n-1} e^{-j \sigma_{x} s_{i}} |{b_{i}}\rangle $$
(8)

where |bi〉 refers to the i th qubit of an n-qubit register initially in state |0n〉(represented w.l.g as |0〉 from now on). Each feature needs to be normalized such that si ∈ [−π,π]. Since the range of each feature is usually unknown, this work resorts to normalization based on the \(L_{\infty }\) norm. The main advantage of angle encoding lies in the simplicity of generating the encoding, given the composition of solely n single-qubit gates, thus giving rise to a circuit of depth 1. In contrast, the main disadvantage is the linear dependence between the number of qubits and the number of features characterizing the agent’s state and the poor representational power, at least in principle (Schuld 2021).

4.2 Parameterized model

To the best of the authors’ knowledge, no problem-inspired ansatz exploiting the physics behind the problem is known in RL applications. This can be explained by the difficulty of expressing and training RL agent’s policies as Hamiltonian-based evolution models (Bharti et al. 2021). Moreover, since the goal is to design a NISQ ansatz to capture the agent’s optimal policy in different environments, this work uses a parameterized model from the family commonly referred to as hardware-efficient ansatze (Bharti et al. 2021). Such models behave similarly to a classical feed-forward neural network. The main advantage of this family of ansatze is its versatility, accommodating encoding symmetries and bringing correlated qubits closer for depth reduction (Cerezo et al. 2021). The ansatz consists of an alternating-layered architecture composed of single-qubit gates followed by a cascade of entangling gates as pictured in Fig. 2.

Fig. 2
figure 2

Hardware-efficient ansatz for RL based on single-qubit Ry,Rz rotation gates

A single layer is composed of two single-qubit σy,σz rotation gates per qubit, followed by a cascade of entangling gates, such that features are correlated in a highly entangled state. The ansatz includes 2n single-qubit rotation gates per layer, each gate parameterized by a given angle. Therefore, there are 2nL trainable parameters for L layers. The entangling gates follow a pattern that changes over the number of layers, inspired by the circuit-centric classifier design (Schuld et al. 2021). The pattern follows a modular arithmetic CNOT[i,(i + l) mod n] where \(i \in [1, {\dots } ,n]\) and \(l \in [1, {\dots } ,L]\) indexes the layers. Increasing the number of layers increases the correlation between features and expressivity.

4.3 Measurement

An arbitrary state \(|{\psi }\rangle \in \mathbb {C}^{2^{n}}\) is represented by an arbitrary superposition over the basis states, as in Eq. 9.

$$ |{\psi}\rangle = \sum\limits_{i=0}^{2^{n-1}} c_{i} |{\psi_{i}}\rangle $$
(9)

Measuring the state |ψ〉 in the computational basis (σz basis) collapses the superposition into one of the basis states |ψi〉 with probability \(\lvert c_{i} \rvert ^{2}\), as given by the Born rule (Nielsen and Chuang 2011). In general, the expectation value of some observable \(\hat {O}\) is given by the summation of each possible outcome, i.e., the eigenvalue λi weighted by its respective probability \(p_{i} = \lvert c_{i}\rvert ^{2}\) as in Eq. 10.

$$ \langle \hat{O} \rangle = \langle{{\psi}|\hat{O}}|{\psi}\rangle = \sum\limits_{i=0}^{2^{n-1}} \lambda_{i} p_{i} $$
(10)

Let \(\hat {O}\) be the single-qubit \({\sigma _{z}^{i}}\) measurement, applied to the i th-qubit. Given that the σz eigenvalues are {− 1,1}, the expectation value \(\langle {\sigma _{z}^{i}} \rangle \) can be obtained by the probability p0 of the qubit being in the state |0〉 as \(\langle {\sigma _{z}^{i}} \rangle = 2p_{0} - 1\). Notice that in practice, p0 needs to be estimated from several circuit repetitions to obtain an accurate estimate of the expectation value.

Let the state |ψ〉 be the quantum state obtained from the encoding of an agent’s state via S(s), and the parameterized block U(𝜃), as in Sections 4.1 and 4.2 respectively. Let \(\langle {\sigma _{z}^{i}} \rangle \) be the quantum analogue of the numerical preference for action i, which we represent by 〈ai〉 for clarity. Its expectation can be formally described by Eq. 11.

$$ \langle a_{i} \rangle_{\theta} = \langle 0|{S(s)^{\dagger}U(\theta)^{\dagger} {\sigma_{z}^{i}} U(\theta)S(s)}|{0}\rangle $$
(11)

For a policy with \(\lvert A\rvert \) possible actions, each σz measurement corresponds to the numerical preference of each action. Thus, \(\lvert A\rvert \) single-qubit estimated expectation values are needed. If the number of features in the agent’s state is larger than the number of actions, the single-qubit measurements occur only on a subset of qubits. Such measurement scheme is qubit-efficient (Schuld and Petruccione 2018). Figure 3 represents the full VQC for an environment with four feature states and four actions with three parameterized layers.

Fig. 3
figure 3

Variational quantum circuit for policy-based RL with three parameterized layers

4.4 Classical post-processing

Measurement outcomes representing numerical preferences h(s,a,𝜃) = 〈a𝜃 are classically post-processed to convert the estimated expectation values to the final quantum neural policy, as given by Eq. 12.

$$ \pi(a\mid s,\theta) = \frac{{e^{\langle a \rangle_{\theta}}}}{{{\sum}_{b} e^{\langle b \rangle_{\theta}}}} $$
(12)

Equation 12 imposes an upper bound on the greediness of π. It will always allow for exploratory behavior, which can negatively impact the performance of RL agents, especially in deterministic environments. As an example, consider a 2-action environment with

$$ \pi = [\pi(a_{0}\mid s,\theta) , \pi(a_{1}\mid s,\theta)] $$

The entries of π are given by Eq. 12 and the actions’ estimated expectation values [〈a0𝜃,〈a1𝜃]. As these are bounded as 〈σz〉∈ [− 1,1], the maximum difference between action preferences occurs when the estimated vector is [〈a0𝜃 = − 1,〈a1𝜃 = 1]. The corresponding softmax normalized vector is:

$$ \pi_{a} = [\pi(a_{0}\mid s,\theta) , \pi(a_{1}\mid s,\theta)] = [0.88,0.12] $$

In this case, the policy always has a \(\sim 0.1\) probability of selecting the worst action; the same rationale applies to larger action sets. Thus, a trainable parameter β is added to the quantum neural policy as in Eq. 13:

$$ \pi(a\lvert s,\theta) = \frac{{e^{\beta\langle a \rangle_{\theta}}}}{{{\sum}_{b} e^{\beta\langle b \rangle_{\theta}}}} $$
(13)

β has the effect of scaling the output values from the quantum circuit measurements, resembling an energy-based model. Instead of decreasing β over time, we treat it as a hyperparameter to be tuned along with 𝜃. The optimization sets β, assuring convergence towards the optimal policy.

4.5 Gradient estimation

This section develops upper bounds on both the number of samples and the number of circuit evaluations necessary to obtain an 𝜖-approximation of the policy gradient, as given by Eq. 3, restated here for completion:‘

$$ \nabla_{\theta} J(\theta) = \frac{1}{N} \sum\limits_{i=0}^{N-1}\sum\limits_{t=0}^{T-1} G_{t}(\tau_{i}) \nabla_{\theta} \log \pi(a_{t_{i}}\lvert s_{t_{i}},\theta) $$

The gradient ∇𝜃J(𝜃) can be estimated using the same quantum device that computes expectations 〈ai𝜃, via parameter-shift rules (Schuld et al. 2019). These rules require the policy gradient to be framed as a function of gradients of observables, as given by Eq. 14.

$$ \nabla_{\theta} \log \pi(a\lvert s,\theta) = \beta \left( \nabla_{\theta} \langle a \rangle_{\theta} - {\sum\limits_{b} \pi(b\lvert s,\theta) \nabla_{\theta} \langle b \rangle_{\theta}}\right) $$
(14)

By combining Eqs. 3 and 14, the quantum policy gradient estimator is given by Eq. 15:

$$ \begin{array}{@{}rcl@{}} \nabla_{\theta} J(\theta) = \frac{1}{N} \sum\limits_{i=0}^{N-1}{\sum}_{t=0}^{T-1} G_{t}(\tau_{i}) \beta \left( \nabla_{\theta} \langle a_{t_{i}} \rangle_{\theta}\vphantom{{\sum\limits_{b_{t_{i}}} \pi(b_{t_{i}}\mid s_{t_{i}},\theta) \nabla_{\theta} \langle b_{t_{i}} \rangle_{\theta}}}\right.\\\left. - {\sum\limits_{b_{t_{i}}} \pi(b_{t_{i}}\mid s_{t_{i}},\theta) \nabla_{\theta} \langle b_{t_{i}} \rangle_{\theta}}\right) \end{array} $$
(15)

The number of samples associated with Eq. 15 is defined as the number of visited states. Since there are N trajectories (sequences of actions, τi), each visiting T states, the total number of samples is \(\mathcal {O}(NT)\).

Lemma 4.1 provides an upper bound for N such that the policy gradient is 𝜖-approximated with probability 1 − δ.

Lemma 4.1 (𝜖 -approximation of the policy-gradient)

lemmasample Let \(\theta \in \mathbb {R}^{k}\), k being the number of parameters, Rmax be the maximum possible reward in any time step, T the horizon, and ∇𝜃J(𝜃) the expected policy gradient. The policy gradient, \(\hat {\nabla }_{\theta } J(\theta )\), can be 𝜖-approximated, with probability 1 − δ

$$ \lvert \hat{\nabla}_{\theta} J(\theta) - \nabla_{\theta} J(\theta) \rvert \leq \epsilon_{\nabla} $$
(16)

using a number of samples given by

$$ NT \approx \mathcal{O}\left( \frac{8\beta^{2} R_{\max}^{2} T^{3}}{\epsilon_{\nabla}^{2} (\gamma - 1)^{4}} \log \left( \frac{2k}{\delta_{\nabla}}\right) \right) $$
(17)

The most relevant insight drawn from Lemma 4.1 is that it establishes that for obtaining an 𝜖-approximated policy gradient, the algorithm needs a number of samples that grows logarithmically with the total number of parameters. The proof of Lemma 4.1 is presented in detail in Appendix A.1.

Gradient-based optimization can be performed using the same quantum device that computes expectations 〈ai𝜃, via parameter-shift rules (Sweke et al. 2020; Schuld et al. 2019), which compute the gradient of an observable w.r.t a single variational parameter concerning rotation angles of quantum gates. Parameter-shift rules are given by Eq. 18:

$$ \nabla_{\theta_{i}} \langle a_{i} \rangle_{\theta} = \frac{1}{2} \left[ \langle a_{i} \rangle_{\theta + \frac{\pi}{2}} - \langle a_{i} \rangle_{\theta - \frac{\pi}{2}} \right] $$
(18)

The gradient’s accuracy depends on the expectation values, 〈a𝜃. These are estimated for each sample and action using several repetitions of the quantum circuit or shots. Lemma 4.2 establishes an upper bound on the total number of shots required to reach an 𝜖〈〉-approximated policy gradient, with probability 1 − δ〈〉.

Lemma 4.2 (Total number of quantum circuit evaluations)

lemmaquery Let \(\theta \in \mathbb {R}^{k}\), \(\mathcal {O}(NT)\) be the sample complexity given by Lemma 4.1, and \(\lvert A \rvert \) the number of available actions. With probability 1 − δ〈〉 and approximation error 𝜖〈〉, the quantum policy gradient algorithm requires a number of shots given by

$$ \mathcal{O}\left( \frac{\lvert A \rvert NT}{\epsilon_{\langle \rangle}^{2}} \log \left( \frac{2k}{\delta_{\langle \rangle}}\right)\right) $$
(19)

Similarly to Lemma 4.1, it is shown that the accuracy of the policy gradient, as a function on the total number of shots, grows logarithmically with the total number of parameters. The proof of Lemma 4.2 is presented in detail in Appendix A.2.

5 Performance in simulated environments

This section examines the performance of the proposed quantum policy gradient through standard benchmarking environments from the OpenAI Gym library (Brockman et al. 2016). Moreover, the quantum policy gradient was also tested in a handcrafted quantum control environment. In this setting, a quantum agent was designed to learn to prepare the state |1〉 with high fidelity, starting from the ground state |0〉. The empirical reward over the number of episodes was used to discern the performance of both classical and quantum models. The best-performing classical neural network was selected from a restrictive set of networks composed of at most two hidden linear layers. All quantum circuits were built using the Pennylane library (Bergholm et al. 2020) and trained using the PyTorch automatic differentiation backend (Paszke et al. 2017) to be directly compared with classical models built with the same library. All training instances used the most common classical optimizer, ADAM (Kingma and Ba 2017).

5.1 Numerical experiments

The CartPole-v0 and Acrobot-v1 environments were selected as classic benchmarks. They have a continuous state space with a relatively small feature space (2 to 6 features) and discrete action space (2 to 3 possible actions). The reward function is similar to each environment. In Cartpole, the agent receives a reward of + 1 at every time step. The more time the agent keeps the pole from falling, the more reward it gets. In Acrobot, the agent receives a − 1 reward at every time step and reward 0 once it gets to the goal state. Thus, Acrobot will be harder to master since, for the Cartpole, every action has an immediate effect as opposed to Acrobot.

In the quantum control environment of state preparation, which we refer to as QControl on this point onward, for simplicity, the mapping |0〉↦|1〉 can be characterized by a time-dependent Hamiltonian H(t) of the form of Eq. 20 describing the quantum environment as in Zhang et al. (2019).

$$ H(t) = 4J(t)\sigma_{z} + h\sigma_{x} $$
(20)

where h represents the single-qubit energy gap between tunable control fields, considered a constant energy unit. J(t) represents the dynamical pulses controlled by the RL quantum agent in a model-free setting. The learning procedure defines a fixed number of steps N = 10, from which the RL agent must be able to create the desired quantum state. The quantum environment prepares the state associated with the time step t + 1, given the gate-based Hamiltonian at time step t, U(t):

$$ |{\psi_{t+1}}\rangle = U(t)|{\psi}\rangle $$
(21)

The reward function is naturally represented as the fidelity between the target state |ψT〉 = |1〉 and the prepared state |ψt〉 naturally serves as the reward rt for the agent at time step t, as in Eq. 22.

$$ r_{t} = {\lvert \langle \psi_{t} \lvert \psi_{T} \rangle \rvert }^{2} $$
(22)

Using the policy gradient algorithm of Section 4, the goal is to learn how to maximize fidelity. Figure 4 depicts the agent-environment interface.

Fig. 4
figure 4

Agent-environment interface for quantum control

Each sequence of N pulses corresponds to an episode. The quantum agent should learn the optimal pulse sequence that maps to the state with maximum fidelity as the number of episodes increases. The quantum variational architecture selected was the same as described in Section 4. In this setting, the main difference is the lack of encoding. The quantum agent receives the quantum state from the corresponding time-step Hamiltonian applied at each time step. However, since the environment is simulated, the qubit is prepared in the state of time step t and then fed to the variational quantum policy. In this setting, it is considered the binary action-space A = [0,1](apply pulse A = 1 or not, A = 0). A sequence of N actions corresponds to N pulses. A performance comparison is made relative to classical policy gradients. In this case, the corresponding state vector associated with the qubit was explicitly encoded at each time step, considering both real and imaginary components. All environment specifications are presented in Table 1.

Table 1 Description of the environments (#F, number of features; #A, number of actions; Max #s, maximum steps)

Several neural network architectures were tested for the CartPole-v0 and Acrobot-v1 environments. However, the structure is the same. Every neural network is composed of fully connected layers using a rectified linear unit (ReLU) activation function in every neuron. The output layer is the only layer that does not have ReLU activation. The depth, the total number of trainable parameters, and the existence of dropout differs from network to network. All the networks using dropout have a probability equal to 0.2. Every network was trained with an ADAM optimizer with an experimentally fine-tuned learning rate of 0.01. Figure 5a and b illustrate the average reward for different classical network configurations for the benchmarking environments. The results show that a fully connected neural network with a single layer of 128 and 32 neurons performs reasonably better than similar architectures for the CartPole-v0 and Acrobot-v1 environments, respectively.

Fig. 5
figure 5

Different classical neural network architectures used in the three simulated environments. Panels a, b, and c represent different architectures for the Cartpole, Acrobot, and QControl environments, respectively. Each label indicates the respective network structure and if it uses dropout. Each label represents the total number of neurons in each input, hidden, and output layer. For example, 4 − 4 − 4 has input, hidden, and output layers with four neurons each

In the QControl environment, eight different neural networks were tested with a single hidden layer. Since the optimal neural network for this problem is still an open question, to the best of the author’s knowledge, it was decided to successively increase the size of the network until it solves the task of comparing the minimum viable network with the VQC. For this set of classical architectures, the neural network with a single layer of 16 neurons was chosen since it achieves the best average fidelity as the minimum viable network solving the problem, as illustrated in Fig. 5c.

The second step compares the performance of the quantum neural policy of Section 4 against the aforementioned classical architecture. Increasing the number of layers in the parameterized quantum model would perhaps increase the expressivity of the model (Schuld 2021). At the same time, increasing the number of layers leads to more complex optimization tasks, given that more parameters need to be optimized. For some variational architectures, there is a threshold for expressivity in terms of the number of layers (Sim et al. 2019). We encountered precisely this in practice. For Cartpole, the expressivity of the quantum neural policy saturates after three layers, and for the Acrobot, after four layers. From there on, the agent’s performance deteriorated rather than improved. For the QControl environment, the classical NN was compared with a simplified version of the variational softmax policy. In this case, it was considered a VQC with the most general gate with three parameters that can approximately prepare every single-qubit state. The observables for the numerical action preference are the opposite sign computational basis measurement, i.e., [〈σz〉,−〈σz〉]. In every environment, the model’s learning rate was fine-tuned by trial and error as opposed to β, which was randomly initialized. The optimal configuration for the learning rate, number of layers, and batch size used to compare are presented in Table 2.

Table 2 Specification for hyperparameter, number of layers, and batch size used for the classical and quantum neural policies in the three simulated environments

Figure 6a, b, and c compare the average cumulative reward through several episodes for quantum and classical neural policies for the Cartpole, Acrobot, and QControl environments, respectively. A running mean was plotted to smooth the reward curves since the policy and environments are noisy. Figure 6c also plots the respective control trajectory obtained by the variational quantum policy.

Fig. 6
figure 6

Average cumulative reward. Comparison between the variational softmax policy and the respective classical NN. Panels a, b, and c represent the average reward comparison for the Cartpole, Acrobot, and QControl environments, respectively

One can conclude that the quantum and classical neural policies perform similarly in every environment. In the QControl environment, the classical policy achieves a slightly greater cumulative reward. Nonetheless, there is clear evidence that the quantum-inspired policy needs fewer interactions with the environment to converge to near-optimal behavior. Moreover, the total number of trainable parameters for the quantum and classical models is summarized in the Table 3. The input layer of a classical neural network is related to the number of qubits in a quantum circuit. Furthermore, we take the number of layers in the VQC as the number of hidden layers in a classical neural network. Given that the quantum circuit is unitary, the number of neurons in a quantum neural network is constant, i.e., equal to the system’s number of qubits. Thus, one can conclude that the quantum policy has similar or even outperforming behavior compared to the classical policy with an extremely reduced total number of trainable parameters.

Table 3 Number of parameters trained for both environments (Env, environment; I, input layer; O, output layer; #N, neurons; #R, rotations per qubit; #P, parameters)

5.2 The effect of initialization

The parameters’ initialization strategy can dramatically improve the convergence of a machine learning algorithm. Random initialization is often used to break the symmetry between different neurons (Goodfellow et al. 2016). However, if the parameters are arbitrarily large, the activation function may saturate, difficulting the learning task. Therefore, parameters are often drawn from specific distributions. For instance, the Glorot (Glorot and Bengio 2010) initialization strategy is among the most commonly used to balance initialization and regularization (Goodfellow et al. 2016).

In quantum machine learning models, the problem persists. However, it was verified experimentally that the Glorot initialization has a slight advantage compared to other strategies. The empirical results reported in Section 5.1 were obtained using such a strategy.

The Glorot strategy samples the parameters of the network from a normal distribution \(\mathcal {N}(0,std^{2})\) with standard deviation given by Eq. 23:

$$ std = \text{gain} * \sqrt{\frac{6}{{n_{\text{in}} + n_{\text{out}}}}} $$
(23)

where gain is a constant multiplicative factor. nin and nout are the number of inputs and outputs of a layer, respectively. It was devised to initialize all layers with approximately the same activation and gradient variance, assuming that the neural network does not have nonlinear activations, being thus reducible to a chain of matrix multiplications. The latter assumption motivates this strategy in quantum learning models since they are composed of unitary layers without nonlinearities. The only nonlinearity is introduced by the measurement (Nielsen and Chuang 2011).

Figure 7a, b, and c plot the average reward obtained by the quantum agent in the CartPole, Acrobot, and QControl environments, respectively, following the most common initialization strategies. Glorot initialization has a slightly better performance and stability. Moreover, it is verified empirically that for policy gradients, initialization from normal distributions generates better results for the classic environments compared to uniform distributions, as reported in Zhang et al. (2022) for standard machine learning cost functions. However, in the QControl task was not observed the same behavior since uniform sampling U(− 1,1) achieves similar performance than N(0,1).

Fig. 7
figure 7

Normal and uniform distributions used to initialize the parameters of the variational softmax policy. Panels a, b, and c represent the average reward comparison for the Cartpole, Acrobot, and QControl environments, respectively

6 Quantum enhancements

In this section, further steps are taken toward studying the possible advantages of quantum RL agents following two different strategies:

  • Parameter count — Comparison between quantum and classical agents regarding the number of parameters trained. It is unclear whether this is a robust approach to quantify advantage, given that the number of parameters alone can be misleading. For example, the function \(\sin \limits (\theta )\) has a single parameter and is more complex than polynomial ax3 + bx2 + cx + d. However, having smaller networks could enable solutions for more significant problems at a smaller cost. Even though only parameter-shift rules are allowed on real quantum hardware, it enables a lower cost on memory than backpropagation. Perhaps the training difference may be negligible from a tradeoff between memory and time consumption for large enough problems. As reported in Table 3, a massive reduction in the number of parameters in the quantum neural network compared with the classical counterpart for all three simulated environments.

  • Fisher information — The Fisher information matrix spectrum is related to the effect of barren plateaus in the optimization surface itself. Studying the properties of the matrix eigenvalues should help to explain the hardness of training.

The Fisher information (Ly et al. 2017) is crucial both in computation and statistics as a measure of the amount of information in a random variable X in a statistical model parameterized by 𝜃. Its most general form amounts to the negative Hessian of the log-likelihood. Suppose a datapoint x sampled i.i.d from \(p(x \lvert \theta )\) where \(\theta \in \mathbb {R}^{k}\). Since the Hessian reveals information about the curvature of a function, the Fisher information matrix (see Eq. 24) captures the sensitivity concerning changes in the parameter space, i.e., changes in the curvature of the loss function.

$$ F(\theta) = \mathbb{E}_{x \sim p} \left[ \nabla_{\theta} \log p(x \lvert \theta) \nabla_{\theta} \log p(x \lvert \theta)^{\top}\right] \in \mathbb{R}^{k \times k} $$
(24)

The Fisher information matrix is computationally demanding to obtain. Thus, the empirical Fisher information matrix is usually used in practice and can be computed as in Eq. 25:

$$ F(\theta) = \frac{1}{T} \sum\limits_{i=1}^{T} \nabla_{\theta} \log p(x_{i} \lvert \theta) \nabla_{\theta} \log p(x_{i} \lvert \theta)^{\top} $$
(25)

Equation 25 captures the curvature of the score function at all parameter combinations. That is, it can be used as a measure for studying barren plateaus in maximum likelihood estimators (Karakida et al. 2019), given that all the matrix entries will approach zero with the flatness of the model’s landscape. This effect is captured by looking at the spectrum of the matrix. If the model is in a barren plateau, then the eigenvalues of the matrix will approach zero (Abbas et al. 2021).

In the context of policy gradients, the empirical Fisher information matrix (Kakade 2001) is obtained by multiplying the vector resultant of the gradient of the log-policy with its transpose as in Eq. 26:

$$ F(\theta) = \frac{1}{T} \sum\limits_{t=1}^{T} \nabla_{\theta} \log \pi(a_{t} \lvert s_{t},\theta) \nabla_{\theta} \log \pi(a_{t} \lvert s_{t},\theta)^{\top} $$
(26)

Inspecting the spectrum of the matrix in Eq. 26 reveals the flatness of the loss landscape. Thus, it can harness the hardness of the model’s trainability for both RL agents based on classical neural networks and VQCs (Abbas et al. 2021). This work considers the trace and the eigenvalues’ probability density of the Fisher information matrix. The trace will approach zero if the model is closer to a barren plateau and the eigenvalues’ probability density unveils the magnitude of the associated eigenvalues.

Figure 8a, b, and c plot the average Fisher information matrix eigenvalue distribution for training episodes during the entire training for the CartPole, Acrobot, and QControl environments, respectively. Subpanels in every plot indicate the associated information matrix trace. On average, the Fisher information matrix of the quantum model exhibits significantly larger density in eigenvalues different from zero compared to the classical model during the entire training. The same behavior is observed for every environment, explaining the improvement of the training performance for quantum agents (Section 5) compared to classical ones. Although it is not visible from the eigenvalue distribution, the classical model has larger eigenvalues than the quantum model. However, their density is extremely small, thus making it negligible in a distribution plot. Further analysis is required to understand the behavior of both classical and quantum agents thoroughly.

Fig. 8
figure 8

Probability density for the Fisher information matrix eigenvalues and average trace. Panels a, b, and c represent the eigenvalue distribution and trace of the Fisher information matrix for the Cartpole, Acrobot, and QControl environments, respectively

7 Conclusion

In this work, a VQC was embedded into the decision-making process of an RL agent, following the policy gradient algorithm, solving a set of standard benchmarking environments efficiently. Empirical results demonstrate that such variational quantum models behave similarly or even outperform several typically used classical neural networks. The quantum-inspired policy needs fewer interactions to converge to an optimal behavior, benefiting from a reduction in the total number of trainable parameters.

Parameter-shift rules were used to perform gradient-based optimization resorting to the same quantum model used to compute the policy. It was proved that the sample complexity for gradient estimation via parameter-shift rules grows only logarithmically with the number of parameters.

The Fisher information spectrum was used to study the effect of barren plateaus in quantum policy gradients. The spectrum indicates that the quantum model comprises larger eigenvalues than its classical counterpart, suggesting that the optimization surface is less prone to plateaus.

Finally, it was verified that the quantum model could prepare a single-qubit state with high fidelity in fewer episodes than the classical counterpart with a single layer.

Concerning future work, it would be interesting to apply such RL-based variational quantum models to quantum control problems of larger dimensions. Specifically, their application to noisy environments would be of general interest. Moreover, studying the expectation value of policy gradients given a specific initialization strategy to support empirical claims is crucial. At last, the quantum Fisher information (Meyer 2021) should be addressed to analyze the information behind quantum states. Moreover, it would be interesting to embed the quantum Fisher information in a natural gradient optimization (Stokes et al. 2020) to derive quantum natural policy gradients. Advanced RL models such as actor-critic or deep deterministic policy gradients (DDPG) could benefit from quantum-aware optimization.