1 Introduction

Efficient communication has played a crucial role in the evolution of all civilizations since antiquity [1, 2]. As the evolution continued, our society began to globalize, and so have our communications needs, ultimately leading to the introduction of the Internet [3]. Nowadays, even these classical communication networks seem outdated, facing the development in the field of quantum communications [4]. The first proposed quantum communications protocols were designed for a one-to-one quantum key distribution (QKD) [5,6,7]. Subsequent strategies to encompass more parties have been proposed [7, 8]. One promising approach is the so-called teleportation-based quantum networks facilitated by standard or controlled teleportation [9, 10]. The idea is to concatenate many teleportation-based cells into a large global network, i.e., the quantum Internet [11]. Many studies have discussed this concept’s potential [12,13,14], the network’s possible topologies [15], platforms to realize it on [16], and fundamental problems to overcome [17].

Quantum networks, however, aim beyond mere QKD, which we must consider when designing quantum networks. The major problem that needs to be addressed is finding an efficient method for optimal dynamic routing in these large-scale quantum networks. It seems that teleportation (entanglement swapping) is, for now, the best method for establishing connections between distant parties in quantum networks [18]. Also, note that entanglement swapping is the core ingredient for quantum repeaters [19] and relays [20], allowing combating unfavorable scaling of losses. The unique features of quantum information prevent reliably employing classical tools such as shortest path and tree search algorithms [21]. This paper provides solutions to the routing problem in teleportation-based networks using reinforcement machine learning. The connection between two distant parties (Alice and Bob) in these networks is established by repeated use of entanglement swapping by several intermediate nodes resulting in an entangled state \(\hat{\phi }\) shared by the above-mentioned parties, Alice and Bob (see Fig. 1) [22]. Once they share an entangled state, Alice and Bob are free to use it for secret key sharing [23], quantum state teleportation [24] or dense coding [25].

Consecutive entanglement swapping in practical quantum networks will necessarily result in entanglement decay. It is therefore imperative to employ methods that minimize such an adverse effect. A recipe is provided in this paper in the form of reinforcement machine learning that efficiently searches for the best entanglement-preserving route possible between two given nodes. We benchmark our method against a naive Monte Carlo search and show that reinforcement learning performs considerably better in terms of resulting entanglement quality as well as in searching speed. Moreover, we present two quantum-specific examples where intelligent routing allows restoring a partially decayed entanglement (case of amplitude damping and correlated phase noise).

Fig. 1
figure 1

Schemes represent entanglement swapping from the initial to the final point in the quantum communications network. a Each node possesses pair of particles belonging to two different entanglement states symbolized by the two black balls inside the nodes, black line between nodes depicts a quantum channel, green node symbolizes the initial point “Alice”, red balls depict intermediate nodes used for entanglement swapping, and blue ball represents the final point “Bob”. Bell icon mark where Bell measurement takes place and swap icon highlights where entanglement swapping is carried out. b Characterization of the road between the initial and final points where \(F_0, \ldots , F_n\) are singlet fractions shared by two neighboring nodes (Color figure online)

There are many possibilities to quantify the quality of the repeated entanglement swapping and the quality of the shared entangled state between Alice and Bob [26, 27]. We chose the singlet fraction F as the figure of merit because for bipartite entangled states, F can be directly used to evaluate the usefulness of \(\hat{\phi }\) for quantum teleportation [28]. Singlet fraction

$$\begin{aligned} F(\hat{\phi })=\max _{\vert \psi \rangle } \langle \psi \vert \hat{\phi }\vert \psi \rangle , \end{aligned}$$
(1)

is defined as the maximal overlap of the investigated state \(\hat{\phi }\) with any maximally entangled state \(\vert \psi \rangle \). Maximal achievable teleportation fidelity f of a qubit state is then calculated as:

$$\begin{aligned} f=\frac{2F+1}{3}. \end{aligned}$$
(2)

One can naively think that the singlet fraction of the final state shared by Alice to Bob \(F_{AB}\) is obtained as a product of singlet fractions of the entangled states introduced in the repeated n entanglement swappings (see Fig. 1)

$$\begin{aligned} F_{AB} = \prod _{i= 0}^{n}F_i, \end{aligned}$$
(3)

alternatively, one can establish an effective distance d between Alice and Bob using a logarithm of the singlet fraction

$$\begin{aligned} d= -\log F_{AB}=-\sum _i \log F_i. \end{aligned}$$
(4)

However, this is not generally true. For example, in cases of amplitude damping [29] or correlated phase noise [30], errors can cancel each other out. If Eqs. (3) and (4) were to hold, one should be able to assign a single quantifier to each quantum channel between nodes and use any graph path or tree-finding algorithm such as the Dijkstra algorithm and find the route minimizing the distance d [31]. As we show later in this paper, this yields suboptimal solutions. Note that even prominent dynamic algorithms such as Bellman–Ford [32] and A* [33] cannot handle these types of errors.

One possible solution capable of handling quantum effects is a brute force in the form of the Monte Carlo algorithm. The only downside is Monte Carlo’s exponential scaling with the number of nodes. Such a scaling becomes a game stopper, especially in the case of an evolving network where it needs to be repeatedly executed. Hence, a smarter strategy needs to be adopted. In this paper, we propose using the proximal policy optimization (PPO), an artificial intelligence-based algorithm, developed to solve complex evolving problems [34]. This algorithm is commonly used in the gaming industry, where we found inspiration for how to approach the routing problem. We designed our network as a map in a game for the agent to play, intending to find the optimal path through the quantum network. We compare the performance of the PPO against the Monte Carlo and the Dijkstra algorithm demonstrating PPO’s virtues.

2 Quantum network topology

We found the inspiration for our network topology in the low-density parity-check code structure, one of the possible topologies considered for designing the 6 G networks. For the details on the topology, see Fig. 2 [35]. This network simulates a real-world scenario where several local users form groups connected among themselves by central nodes. We chose this particular topology mainly due to its robustness against local connection problems, contributing to steady performance. In case of random malfunction in any specific node, this topology offers several possible reroutes to ensure stability. Each connection in the network structure represents a quantum channel using which two neighboring nodes share an entangled two-qubit state. For simplicity, we limit the network topology to a maximum of 4 connections per node. Moreover, each node can perform entanglement swapping, i.e., Bell measurement. All shared entangled states are fully characterized by their density matrices. This representation allows us to fully describe how noisy or damaged each connection is. We can easily simulate different sources of disturbance, such as white noise in the channel or amplitude damping. For an overview of the initialization part, see Pseudocode I. These essential characteristics enable us to simulate various scenarios in the communications network that we later present in the Results section.

Pseudocode I: Initialization of the quantum network

define nodes [N] // vector of length N

define connections [N;4] // matrix: N nodes \(\times \) max. 4 connections

define shared_states [N;4;4;4] // matrix: defines \(4 \times 4\) (two-qubit) density matrix of shared

    state per connection; for density matrices, see Eqs. (57)

The ultimate goal is to distribute entangled state between Alice and Bob. As mentioned in the previous section, the quality of this state is given in terms of singlet fraction \(F_{AB}\), which we maximize. We cast this task as a “game” for the tested algorithms to play. The final reward is received in proportion to \(F_{AB}\). Every connection can be used in each game only once because the used entangled pair is consumed in entanglement swapping. We choose the initial and final users’ positions so that the agent can successfully connect them in a given number of actions. The actions count is further used to compare the agent performance because it represents the consumption of resources (i.e., computational time and entangled pairs). We made the routing in the network realistic by ensuring that even the unperturbed connections have a singlet fraction of the distributed state \(F=0.99\) by adding a corresponding amount of white noise, forcing the PPO algorithm toward the shortest path solutions. To represent white noise, we model the shared entangled states in the form of Werner states:

$$\begin{aligned} \hat{\rho }_w= p\vert \psi ^-\rangle \langle \psi ^-\vert + (1-p)\hat{1}/4. \end{aligned}$$
(5)

\(\vert \psi ^-\rangle = (\vert 01\rangle - \vert 10\rangle )/\sqrt{2}\) represents singlet Bell state, \(\hat{1}/4\) stands for the maximally mixed state, and p is the mixing parameter. Amplitude damping, on the other hand, is represented by generalizing the Bell states \(\vert \psi ^-\rangle \) to

$$\begin{aligned} \vert \psi _g^-(\theta )\rangle =\cos (\theta )\vert 01\rangle -\sin (\theta )\vert 10\rangle , \end{aligned}$$
(6)

where \(\theta \in [0;\frac{\pi }{2}]\) is the damping parameter. Lastly, an arbitrary phase shift can be described as:

$$\begin{aligned} \vert \psi _s^-(\phi )\rangle =(\vert 01\rangle -e^{i\phi }\vert 10\rangle )/\sqrt{2}, \end{aligned}$$
(7)

where \(\phi \in [0;\pi ]\) is the phase shift parameter and, if uncompensated and random, renders the state shared between Alice and Bob effectively mixed.

Fig. 2
figure 2

This figure shows a visualization of the quantum communications network. The entangled photon pair marked “Alice” represents the initial position for our agents, and blue circle named “Bob” marks the ending point of the route. Full black lines set possible routes for entanglement swapping, red thick lines highlight the optimal solution under ideal conditions, and black cubes represent primary connection nodes between local clusters (Color figure online)

3 Routing algorithms

We tested different algorithms capable to solve routing in quantum networks and compared their performance. Namely, we tested the PPO, Dijkstra algorithm, and Monte Carlo method on the quantum communications network (see Fig. 3). PPO is a policy gradient method for reinforcement learning, which uses multiple epochs of stochastic gradient ascent to perform each policy update. It is well known for the simplicity of implementation to various problems and overall performance compared to similar family algorithms. We use stable baseline 3 framework [36] and its implementation of the PPO in our work.

Fig. 3
figure 3

The aim is to identify the optimal approach toward route finding in the quantum networks (i.e., environment). The agent can choose from three unique algorithms Monte Carlo, Dijkstra, and PPO. Ultimately, we can compare performance and various amounts of consumed resources and thus identify the most suitable candidate for solving a given task

The PPO agents starts at Alice’s node. It can choose from at most four actions corresponding to the maximum number of connections any node can have. If the agent chooses an invalid action (i.e., a non-existing connection), the game ends with a negative reward. If a valid link is selected, the agent moves to the node connected by the chosen connection (action). At this point, entanglement swapping is implemented, leading to a shared entangled state between Alice and the connected node. Selecting action and implementing entanglement swapping constitutes one action. A maximum of 15 actions limits the agent; if depleted, the game ends. The preliminary reward is calculated at the end of each action using the formula:

$$\begin{aligned} R_p=F_{A_i}-F_{A_{i-1}}. \end{aligned}$$
(8)

where \(F_{A_i}\) stands for the singlet fraction of the newly established entangled state, while \(F_{A_{i-1}}\) is the singlet fraction resulting from entanglement swapping in the preceding action (\(F_{A_0}=1\) in case of the first action). We tuned the n-steps hyperparameter of the PPO according to the complexity of the designed quantum network topology. Note that the n-steps hyperparameter determines the number of actions the agent takes before updating the parameters of its policy. We kept all other hyperparameters in default values because we did not notice significant changes when tuning them. It is the reward function structure that has the most noticeable influence on the agent’s performance. We save the PPO’s policy after every 100–5000 games based on the scenarios’ complexity. If the agent reaches the final destination (Bob), it receives a final reward

$$\begin{aligned} R=100F_{AB}. \end{aligned}$$
(9)

In case of the Monte Carlo algorithm, we applied the same game rules as for the PPO. So, we can obtain a straightforward comparison. The only difference is that Monte Carlo chooses its actions randomly in each game with no intelligent policy. For the overview of the route-finding algorithm, see Pseudocode II. When using the PPO algorithm, the policy predicting neural network of the PPO algorithm is used to pick the connection in a given state (see pick connection in the Pseudocode II). Based on the calculated reward (see calculate preliminary reward in the Pseudocode II), the critic neural network of the PPO algorithm estimates the advantage and from it also the loss function. Subsequently, both the aforementioned neural networks of the PPO algorithm are updated using the back-propagation of the loss function gradient. Detailed working principle of the PPO algorithm itself is provided in “Appendix”.

Pseudocode II: Route-finding

set agent_node = Alice

set agent_state = singlet // initial state held by the agent

actions_used = 0

repeat:

     actions_used = actions_used + 1

     pick connection // from current node to a next one

        // PPO initially randomly, subsequently based on learned policy

        // Monte Carlo always randomly

     update agent_state // execute entanglement swapping on present state held by

       the agent using density matrix of chosen connection

     calculate singlet fraction of agent_state

     calculate preliminary reward // see Eq. (8)

     set agent_node = next node // depending on the connection chosen

until: agent_node = Bob

     or actions_used >15

     or no available connections exist at agent_node calculate final reward // see Eq. (9)

Dijkstra’s algorithm, on the other hand, needs more information and the data structure of the task. Unlike the previous agents, it needs to know the exact topology of the communications network ahead as well as information about each connection. Therefore, the Dijkstra algorithm does not operate under the same conditions as the previously mentioned agents. At the expense of requiring all the information, it is very efficient at finding distance d from Alice to Bob. A brief description on the working principle of the PPO and Djikstra algorithms is presented in “Appendix”, and the entire Python code is available as Digital Supplement.

4 Results

Firstly, we investigate routing in a quantum network burdened solely by white noise. This scenario is close to the classical network because white noise is additive and cannot be compensated. In a quantum network, however, other types of errors can occur. As examples of such errors, we consider amplitude damping and correlated phase noise, which we investigate in the second and third subsections. Finally, a dynamically evolving network noise is considered in the last subsection.

4.1 Network affected by white noise

We start with a completely operational network (see the first topology in Fig. 9 in Appendix). A singlet fraction \(F=0.99\) characterizes all connections. Optimal routing through this network between Alice and Bob involves 6 intermediary nodes. Then, we started introducing damaged connections (i.e., connections with \(F= 0.6\)), thus increasing the number of the intermediary nodes (8,10,12,14,16) required for finding the optimal solution. The optimal routing paths, under those circumstances, are shown in Fig. 9 as well. The performances of the three agents (PPO, Monte Carlo, Dijkstra) are summarized in Table 1. The results show that the Monte Carlo method performs worst than PPO even in the case of the simplest scenario (fully operational network with 6 intermediary nodes to find a solution). The more complex scenarios become, the more prominent the PPO’s performance gain is. More specifically, in the case of a network where at least 16 intermediary nodes are required to find a solution, PPO outperforms the Monte Carlo method by a factor of about 13000. For visualization, see Fig. 4. Given the additivity of white noise, the Dijkstra algorithm significantly outperforms both PPO and Monte Carlo in these almost classical scenarios if the complexity surpasses 8 intermediary nodes to complete the task. However, the situation significantly changes with the introduction of purely quantum noise.

Table 1 Results of the three agents applied to the networks of different complexities
Fig. 4
figure 4

The graph compares the performance of the PPO algorithm, represented by the red (right) columns, and the Monte Carlo algorithm, depicted by black (left) columns, on different scenarios requiring a given number of passes through intermediary nodes in the quantum networks (Color figure online)

4.2 Network affected by amplitude damping

Amplitude damping, as introduced in Eq. 6, skews the amplitude balance toward one of the two components (\(\vert 01\rangle \) or \(\vert 10\rangle \)). As a result, the singlet fraction decreases. Two connections with mutual opposite component damping can rebalance the amplitudes increasing the singlet fraction (at the expense of overall losses). This feature is intractable by greedy or dynamic algorithms such as Dijkstra, Bellman–Ford, or A*. In order to use those algorithms, one needs to save all preliminary solutions and compare them, which would cause exponential scaling of the algorithm complexity. Ultimately, the agent needs to figure out that in order to complete the task, it needs to find such a route where individual amplitude damping cancels each other out as much as possible. We forced the agent to use this strategy by designing scenarios where the agent must choose at least one amplitude-damped connection to reach the final destination. Moreover, the resulting singlet fraction is maximized when a second (opposite) amplitude-damped connection is chosen by the agent. Similar to the previous subsection, we present the agent with scenarios ranging from 6 to 16 intermediary nodes (see Fig. 10).

The agents’ performance is summarized in Table 2 and plotted in Fig. 5. One can notice that due to these complex initial conditions, both Monte Carlo and PPO algorithms require more actions to solve the initial (i.e., 6 intermediary nodes) scenario. However, one can observe a similar difference in scaling between PPO and Monte Carlo as in the previous scenario, i.e., Monte Carlo scales considerably less favorably. The PPO outperforms the Monte Carlo from the beginning, and in the case of the most complex scenario, i.e., 16 intermediary nodes, the PPO outperforms the Monte Carlo method by a factor of about 9000.

Table 2 Results of the two agents applied to the networks of different complexity, including effects such as amplitude damping
Fig. 5
figure 5

The graph compares the performance of the PPO algorithm, represented by the red (right) columns, and the Monte Carlo algorithm, depicted by black (left) columns, on different scenarios in the quantum networks where we also introduced connections affected by amplitude damping (Color figure online)

4.3 Network affected by correlated phase noise

This subsection demonstrates how agents handle another type of reversible damage caused by the correlated phase noise. These scenarios are motivated by one of the practical approaches toward quantum information distribution proposed by Xu et al. [30]. Testing Dijkstra algorithms is again pointless for the reasons we mentioned in the previous subsection. To demonstrate the versatility of the PPO agent, a brand new set of scenarios involving 6–16 intermediary nodes were generated. For more details, see Fig. 11. In the current scenario, the agent starts from the initial node Alice and in the first action, it can only choose from paths damaged by the correlated phase noise. The agent aims to search the network for a suitable path to reverse the initial correlated phase shift. If successful, it must then find the final node, Bob.

Results of this test are shown in Table 3 and plotted in Fig. 6. One can notice that the result once again supports PPO algorithm superiority.

Table 3 Results of the two agents applied to the networks of different complexity, including effects such as correlated phase noise
Fig. 6
figure 6

The graph compares the performance of the PPO algorithm, represented by the red (right) columns, and the Monte Carlo algorithm, depicted by black (left) columns, on different scenarios, staged in the quantum networks where we also introduced connections causing correlated phase noise (Color figure online)

Fig. 7
figure 7

Illustration of the Agents’ performance on the evolving quantum communication network. Thin stripes show the overall functionality of the quantum network throughout its evolution, and the thick stripes show the functionality during each scenario of its evolution

4.4 Evolving quantum network

Ultimately, we test the agents on dynamically evolving scenarios in our quantum network. The agents’ goal in this final test is to maximize the overall functionality of the network throughout the evolutions. These scenarios reflect the realistic behavior of real-world quantum networks where various errors appear at random places and times. The entire routing task lasts for \(10^6\) actions, during which the quantum network undergoes ten scenarios (i.e., ten events when various connections become damaged or unperturbed). The evolution continues regardless of the agent’s success. In this final test, we use all three types of errors discussed in previous chapters, namely white noise, amplitude dumping, and correlated phase noise. To make the interpretation of the results clear, we set some ground rules. Suppose the agent finds a solution (i.e., a path between Alice and Bob with \(F>0.8\)) to the current scenario. In that case, it will use this solution for as long as its singlet fraction remains \(F>0.8\) (i.e., until the scenario evolves). PPO agent at that point also saves its current policy. After the situation evolves, both agents search for a new solution. PPO starts searching from the last saved policy and Monte Carlo randomly from scratch. Each evolution introduces errors so that the previous solution is no longer valid (\(F<0.8\)). Hence, agents need to find a new route. This condition does not mimic the natural network behavior, but it is the most extreme case where the PPO agent faces the most disadvantageous conditions. All evolutions of the quantum network are depicted in Fig. 12. Resulting success rates are shown in Fig. 7. From the obtained results, it is clear that if we let agents deal with an undamaged or slightly damaged network (scenarios 1,2,10), both agents can keep the network functional for more than 95% of the time. If the scenario becomes a bit more complex (scenario 6), the PPO agent noticeably outperforms the Monte Carlo agent. For even more complex scenarios, Monte Carlo could not find a solution in a given amount of actions. Due to these poor results, Monte Carlo kept the network functional for 33.1% of the overall time. On the other hand, the PPO found a solution in 10/10 scenarios and kept the network functional for 93.4% of the overall time.

5 Conclusions

This paper compares three different algorithms (PPO, Dijkstra, and Monte Carlo) for route-finding in quantum communication networks. We benchmark these algorithms on various scenarios in a realistic network topology using singlet fraction as the figure of merit. In these scenarios, we introduce additive white noise as well as purely quantum errors such as amplitude damping and correlated phase noise.

We explicitly show that the non-additivity of quantum errors prevents traditional graph path or tree-finding algorithms (Dijkstra) from finding the optimal solution. While the Monte Carlo search allows finding such optimal solutions, its exponential scaling makes its deployment prohibitive in large complex networks. We demonstrate that reinforcement machine learning in the form of the PPO algorithm circumvents the limitations of both aforementioned approaches. It can cope with purely quantum errors and, simultaneously, does not suffer from unfavorable scaling.

Our numerical model reveals that the PPO advantage over mere Monte Carlo search becomes significant when the number of intermediary nodes in the path increases (e.g., for 16 intermediary nodes, PPO outperforms Monte Carlo by a factor of several thousand). Moreover, in a dynamically evolving quantum network, the PPO could maintain an operational route for about 93% of the time, while Monte Carlo for less than 33%.

We believe that our research further promotes reinforcement learning as an invaluable method for improving quantum communications.