introduction

Quantum computing is a new computing paradigm based on quantum mechanics that utilizes qubits instead of classical bits to store and process information1. Since the theoretical concepts were proposed2,3,4, quantum computers have developed at an astonishing speed, gradually moving from the milestone achievement like quantum supremacy in the laboratory5,6,7 to the stage of proof-of-principle application exploration8,9,10. Among its many applications, quantum machine learning is an emerging field that leverages the power of quantum computers to overcome bottlenecks of high computing power requirements in the machine learning11,12,13,14. On the current noisy intermediate-scale quantum devices15, one popular strategy for constructing quantum machine learning algorithms is using classical-quantum hybrid optimization loops to train the parameterized quantum circuits for various learning tasks, such as pattern recognition16,17 and classification18,19,20,21,22.

Similar to the classical neural networks that consist of input layers, hidden layers and output layers, the fundamental structures of the variational quantum neural networks include data-encoding circuits, variational ansatz, and output layers realized by quantum measurement23,24. To be specific, the data-encoding or quantum feature map processes \({{{{{{{\mathcal{U}}}}}}}}(x)\) can map the classical data x ∈ χ to a quantum state in Hilbert space \({{{{{{{\mathcal{H}}}}}}}}\). It serves as one of the main sources of non-linearity for the networks, and there exist numerous encoding strategies such as amplitude encoding and angle encoding25. Moreover, different choices of architectures for the variational ansatz \({{{{{{{\mathcal{W}}}}}}}}(\theta )\) containing trainable parameters θ will lead to various quantum neural networks26,27,28,29,30,31,32,33,34,35,36 and it will greatly affect the network performance such as generalization37,38 and trainability39,40. For example, general deep parameterized quantum circuits suffer from the barren plateau phenomenon, leading to vanishing gradients40,41,42,43,44. But it can be avoided by networks with hierarchical structure, proposed as a realization of the quantum convolutional neural networks (QCNN)20,27,45, which has been proved the absence of barren plateaus46. Finally, the output of an n-qubit quantum neural networks is the mean value of a measurable observable O as

$$f(x,\theta )=\left\langle {\psi }_{0}\right\vert {U}_{\theta }^{{{{\dagger}}} }(x)O{U}_{\theta }(x)\left\vert {\psi }_{0}\right\rangle$$
(1)

where initial state \(\left\vert {\psi }_{0}\right\rangle ={\left\vert 0\right\rangle }^{\otimes n}\) and Uθ(x) is the parameterized quantum circuit consisting repeatable data-encoding and trainable blocks. Interestingly, the expressivity and universality of such variational quantum models can be guaranteed by the fact that one can naturally write the outputs as partial Fourier series in the network inputs47,48,49,50, and the accessible frequencies are determined by the eigenvalues of the generator Hamiltonian in the data-encoding gates, while the coefficients are controlled by the design of the entire circuits50.

A great deal of research work has subsequently devoted to advancing the quantum neural networks, with one intuitive approach being the quantization of classical networks31,32,33,34. Especially, inspired by the classical residual neural networks, which are proposed for alleviating the vanishing gradient problem during the training process of deep neural networks51, its quantum counterpart is promising to mitigating barren plateaus34. The key idea is to introduce residual connections into the traditional neural networks, as shown in the Fig. 1. Mathematically, the residual connections can provide an additional cross-layer propagation channel for the input features, leading to a basic residual unit of neural networks as \({{{{{{{\mathcal{H}}}}}}}}(x)={{{{{{{\mathcal{F}}}}}}}}(x)+x\), where the non-linear parameterized function \({{{{{{{\mathcal{F}}}}}}}}(x)\) represents the traditional neural networks. Although there exist some works on the quantum realization of residual neural networks, the residual channels are usually implemented using classical or hybrid methods34,52. The researches on the full quantum implementations of residual connections and effects on the expressivity are still very lacking.

Fig. 1: Quantum residual neural networks.
figure 1

a A schematic of the quantum neural networks with residual connections. The quantum feature map circuit \({{{{{{{\mathcal{U}}}}}}}}(x)\) and trainable variational circuit \({{{{{{{\mathcal{W}}}}}}}}(\theta )\) are repetitively implemented multiple times to form the multilayer structures. The \({{{{{{{\mathcal{R}}}}}}}}(x)\) and \({{{{{{{\mathcal{R}}}}}}}}(\theta )\) blocks labeled by red represent the data-encoding gates U(x) and parameterized gates W(θ) with residual connections. bThe classical residual unit and its quantum counterpart. The residual connection channels are shown with blue arrows, and the output of residual block is \({{{{{{{\mathcal{H}}}}}}}}(x)={{{{{{{\mathcal{F}}}}}}}}(x)+x\), where non-linear function \({{{{{{{\mathcal{F}}}}}}}}(x)\) represents the classical neural networks. The quantum residual operator \({{{{{{{\mathcal{R}}}}}}}}(\lozenge)\) (a unified expression for \({{{{{{{\mathcal{R}}}}}}}}(x)\) and \({{{{{{{\mathcal{R}}}}}}}}(\theta )\) operators) implemented on the initial state \(\left\vert {\phi }_{0}\right\rangle\) can be realized in the subspace of an ancillary qubit with measurement results ma = 0/1. c The residual feature map can introduce more frequency components (blue) to the original spectra of quantum neural networks (gray), and also make the Fourier expansion coefficients more flexible, whose ranges are represented by double dotted arrows.

In this work, we address these issues by proposing a quantum circuit-based algorithm to implement quantum residual neural networks (QResNets). The residual connection channel is constructed through one ancillary qubit and the target evolution process is embedded in the subspace. Such structures are compatible to both the data-encoding and trainable blocks in the variational quantum neural networks. We also further parameterize the encoding gates on the auxiliary qubit and obtain the generalized residual operators. Furthermore, we find that the Fourier spectrum of the output of parameterized quantum circuits can be enriched when the residual connections are used for the data-encoding blocks. The number of frequency combinations forms can be extended from one, namely the difference of the sum of generator eigenvalues, to \({{{{{{{\mathcal{O}}}}}}}}({l}^{2})\) for the l-layer residual encoding. Moreover, the diverse construction methods for frequencies in the residual outputs and the extra trainable parameters in the generalized residual operators can expand the Fourier coefficient space. The results suggest that the expressivity of quantum models can be enhanced by residual connections. We offer extensive numerical demonstrations of the quantum algorithm in the regression tasks by function fitting of Fourier series, and also present the performance of binary classification with standard MNIST datasets to recognize the handwritten digits images53, achieving an accuracy improvement of over 7% with residual encoding. Our results show that the residual connections proposed in classical deep learning for improving trainability can also be used to improve the expressivity in quantum neural networks, making it a promising quantum learning model for real-life applications.

Results

Realization of quantum residual connection

In the QResNets, there are multiple layers of repeatable data-encoding block \({{{{{{{\mathcal{U}}}}}}}}(x)\) and trainable parameterized ansatz \({{{{{{{\mathcal{W}}}}}}}}(\theta )\), and the residual connections can be adopted in some of the blocks, as shown in the Fig. 1. The data-encoding block consists of quantum rotation gates of the form U(x) = eiHx where H is a generator Hamiltonian, while the trainable circuits are composed of single- and two-qubit parameterized quantum gates W(θ) with optimization parameters θ. Some gates in the data-encoding and ansatz block can be sampled to add residual connections forming quantum residual operators \({{{{{{{\mathcal{R}}}}}}}}(x)\) and \({{{{{{{\mathcal{R}}}}}}}}(\theta )\), which correspond to the residual evolution processes. We introduce a unified notation ♢ which has ♢ = x for quantum gates in the data-encoding blocks while ♢ = θ in the trainable blocks. Then for an n-qubit quantum system with initial state \(\left\vert {\phi }_{0}\right\rangle\), the evolution under residual operator can be expressed as

$${{{{{{{\mathcal{R}}}}}}}}(\lozenge)\left\vert {\phi }_{0}\right\rangle =\frac{1}{2}\left({\sigma }_{0}^{\otimes n}+{{{{{{{\mathcal{L}}}}}}}} (\lozenge)\right)\left\vert {\phi }_{0}\right\rangle$$
(2)

where σ0 is the identity matrix and \({{{{{{{\mathcal{L}}}}}}}}(x)=U(x)\) in the quantum feature map block and \({{{{{{{\mathcal{L}}}}}}}}(\theta )=W(\theta )\) in optimization ansatz. Such an evolution operator can be realized by the frame of linear combination of unitary with one ancillary qubit, and the target quantum states are obtained by post-processing54,55. Specifically, we first apply a Hadamard gate to encode the ancillary system followed by a controlled-\({{{{{{{\mathcal{L}}}}}}}} (\lozenge)\) operator. After adding another Hadamard gate, we can measure the ancillary qubit with results ma = 0/1 corresponding to quantum states \(\left\vert 0\right\rangle /\left\vert 1\right\rangle\). Then the evolution results under residual operators can be obtained in the \(\left\vert 0\right\rangle \left\langle 0\right\vert\) subspace. The introduction of an auxiliary qubit provides an additional channel that allows the unevolved quantum state to pass alone and add to the evolved quantum state.

More generally, the weight of the summation process can also be adjusted by replacing the first Hadamard gate on the ancillary qubit with Ry(2α) rotation with trainable angles α. Then the corresponding residual operator is generalized as a single optimization-angle residual operator

$${{{{{{{{\mathcal{R}}}}}}}}}_{1}(\lozenge)=\frac{\cos \alpha {\sigma }_{0}^{\otimes n}+{(-1)}^{{m}_{a}}\sin \alpha \cdot {{{{{{{\mathcal{L}}}}}}}}(\lozenge)}{\sqrt{2}}$$
(3)

Such a construction does not require a post-selection process, but rather reconstructs the target operator from the measurement results. It can be reduced to \({{{{{{{\mathcal{R}}}}}}}}(\lozenge)\) with α = π/4 and ma = 0. Similarly, a two optimization-angles residual operator \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(\lozenge)\) can also be constructed by replacing both Hadamard gates with parameterized rotation gates, and the detail is shown in Methods section. In principle, the introduction of more trainable parameters in these two generalized residual operators will provide additional degrees of freedom for optimization, which can further increase the expressivity of the parameterized quantum circuits.

Therefore, we can conclude here that a general residual connection in quantum neural networks can be realized in the complete quantum circuit frame. It is also worth noting that in some special network structures such as the QCNN27, by reusing discarded qubits, we can simulate the residual connections without additional qubits. Moreover, due to the fact that the expressivity of quantum models is fundamentally limited by the data-encoding strategy, we will prove below that the residual connections applied to data-encoding block, no matter what ansatz used, will lead to a better spectra richness in the Fourier series of quantum model output, resulting an expressivity enhancement.

Frequency spectra enhancement

It has been pointed out that the output of a parameterized quantum circuit can be expressed as a finite-term Fourier series of the input features50

$$f(x,\theta )={\sum}_{\omega \in \Omega }{c}_{\omega }(\theta ,O){e}^{i\omega x}$$
(4)

where the frequency ω of spectrum Ω = {wk − wjj, k ∈ [d]} depends on the d-dimensional generator of one-layer data-encoding gate U(x) = eiHx with eigenequations \(H\left\vert {h}_{j}\right\rangle ={w}_{j}\left\vert {h}_{j}\right\rangle\) for j ∈ [d]. Notation [d] ≔ {1, 2, ⋯  , d} here. It means that the accessible frequency of the quantum model is constructed from the difference between the generator eigenvalues. For example, a frequently used generator is the Pauli matrix H = σ/2 with two eigenvalues w1,2 = ± 1/2 where σ = {σx, σy, σz}, then such a one-layer data-encoding block would produce a frequency spectrum Ω = {0, ± 1}. Moreover, the expansion coefficients cω(θ, O) are associated with the entire structure of the quantum circuit, including trainable parameters θ, and the observable O.

However, for a data-encoding block with residual connection, more frequency components can be involved, realizing an improvement in the circuit approximation ability. Assuming that the initial quantum state \(\left\vert {\phi }_{0}\right\rangle\) of the residual encoding block is related to the optimization parameters θ, the residual outputs can be expressed as

$${f}_{R}(x,\theta ) = \left\langle {\phi }_{0}\right\vert {{{{{{{{\mathcal{R}}}}}}}}}^{{{{\dagger}}} }(x)O{{{{{{{\mathcal{R}}}}}}}}(x)\left\vert {\phi }_{0}\right\rangle \\ = \frac{1}{4}\left(\left\langle {\phi }_{0}\right\vert {U}^{{{{\dagger}}} }(x)OU(x)\left\vert {\phi }_{0}\right\rangle +\left\langle {\phi }_{0}\right\vert O\left\vert {\phi }_{0}\right\rangle \right. \\ \quad \left. +2\,{{\mbox{Re}}}\,\left(\left\langle {\phi }_{0}\right\vert OU(x)\left\vert {\phi }_{0}\right\rangle \right)\right)\\ $$
(5)

It is clear that the first term produces the same frequency components as the traditional encoding scheme, whereas the second term corresponds to the zero-frequency component, independent of input feature x. So the key lies in the third term. Because the eigenstates \(\vert {h}_{j}\rangle\) of the generator Hamiltonian form a complete basis, we can then expand the initial quantum state \(\left\vert {\phi }_{0}\right\rangle\) and the observable O as \(\vert {\phi }_{0}\rangle ={\sum }_{k}{\phi }_{k}\vert {h}_{k}\rangle\) and \(O={\sum }_{j,k}{o}_{jk}\vert {h}_{j}\rangle \langle {h}_{k}\vert\). By using the equation \(U(x)\vert {h}_{j}\rangle ={e}^{i{w}_{j}x}\vert {h}_{j}\rangle\), we can have

$$\left\langle {\phi }_{0}\right\vert OU(x)\left\vert {\phi }_{0}\right\rangle = \mathop{\sum}_{j,k}{\phi }_{j}^{* }\big\langle {h}_{j}\Big\vert {o}_{jk}\Bigg\vert {h}_{j}\big\rangle \left\langle {h}_{k}\right\vert U(x){\phi }_{k}\left\vert {h}_{k}\right\rangle \\ = \mathop{\sum}_{j,k}({\phi }_{j}^{* }{o}_{jk}{\phi }_{k}){e}^{i{w}_{k}x}\\ $$
(6)

It can be found that this part will produce new frequency components for the quantum models, which are the eigenfrequencies of generator themselves ± wk for k ∈ [d], but not the differences between them. Therefore, the new spectra of the one-layer data-encoding block with residual connection is

$${\Omega }_{l = 1}^{R}=\left\{{w}_{k}-{w}_{j},\pm {w}_{k}| j,k\in [d]\right\}$$
(7)

which indicates that the frequency generation forms of the quantum neural networks with residual encoding is more diverse, and the resulting Fourier spectrum in general could also be more abundant. In this case, the toy model we exemplified above will produce new spectrum {0, ± 1/2, ± 1}, which includes more frequency components and leads to an enhanced approximation ability for the parameterized quantum circuits.

A natural issue needs to be addressed is when will the residual encoding strategy behaves better than the traditional method. For the one-layer data-encoding block in quantum neural networks, it needs to meet the condition that there exists frequency component wk ∉ Ω for k ∈ [d], which implies

$$| {w}_{j}-{w}_{l}| \ne | {w}_{k}| ,\forall \,j,l\in [d],\exists \,k\in [d]$$
(8)

Such a constraint can be satisfied in many practical cases because we usually use Pauli operators as the generator Hamiltonian.

Furthermore, for the data-encoding strategy repeated l-times either in sequence or in parallel, the traditional scheme will lead to a frequency spectrum \({\Omega }_{l}=\{({w}_{{j}_{1}}+\cdots {w}_{{j}_{l}})-({w}_{{k}_{1}}+\cdots {w}_{{k}_{l}})| {j}_{1},\cdots \,,{j}_{l},{k}_{1},\cdots \,, {k}_{l}\in [d]\}\), which has only one frequency combination form, namely the difference between the sum of two sets of l frequencies50. However, for the residual encoding, there are more ways to construct the spectrum and the combination forms of frequencies will be more complex and diversified. Specifically, the frequency spectrum of a two-layer residual encoding is

$${\Omega }_{l = 2}^{R} = \left\{({w}_{{j}_{1}}+{w}_{{j}_{2}})-({w}_{{k}_{1}}+{w}_{{k}_{2}}),\right.\\ \pm ({w}_{{j}_{1}}+{w}_{{j}_{2}}),({w}_{{j}_{1}}-{w}_{{k}_{1}})\\ \left.\pm ({w}_{{j}_{1}}+{w}_{{j}_{2}}-{w}_{{k}_{1}}),| {j}_{1},{j}_{2},{k}_{1},{k}_{2}\in [d]\right\}$$
(9)

which contains four kinds of frequency combination forms. More frequency generation forms in general can result in a larger upper limit for the spectrum size. We can summarize by induction that for a l-layer residual encoding scheme, the number of frequency combination forms is

$${{{{{{{\mathcal{N}}}}}}}}({\Omega }_{l}^{R})=(\lceil l/2\rceil +1)(\lfloor l/2\rfloor +1)\propto {{{{{{{\mathcal{O}}}}}}}}({l}^{2})$$
(10)

where ⌈ ⋅ ⌉ and ⌊ ⋅ ⌋ represent roundup and rounddown functions. This is a squared improvement over the traditional scheme and detail is shown in the Methods section.

In addition to enlarging the accessible frequency spectrum, residual encoding can also improve the flexibility of the corresponding Fourier coefficients, both of which determine the expressivity of a quantum model. The enhancement comes from two aspects, one is due to the introduction of additional optimization degrees of freedom in the generalized residual operators \({{{{{{{{\mathcal{R}}}}}}}}}_{1,2}(x)\), and another one is due to the more diverse construction methods of frequency and the corresponding recombination of Fourier coefficients, which means that a single frequency component can be generated from the recombination of different terms in the residual outputs. The latter one is the reason why residual operator \({{{{{{{\mathcal{R}}}}}}}}(x)\) can behave better than the traditional encoding strategy in expanding Fourier coefficient space without introducing additional optimization parameters. Furthermore, we may be able to understand the frequency spectrum amplification in the quantum residual models from the perspective that the classical residual networks behave like ensembles of relatively shallow networks56. That is to say, the quantum residual connection channels can equivalently implement ensembles of small quantum models with different frequencies, thus leading to richer spectrum and stronger expressivity. We will show the expressivity improvement in detail in the numerical simulation section.

Measurement scheme

To get the expectation values of an observable O for the quantum state \({{{{{{{\mathcal{R}}}}}}}}(x)\left\vert {\phi }_{0}\right\rangle\), which is embedded in the \(\left\vert 0\right\rangle \left\langle 0\right\vert\) subspace of the ancillary qubit, we can introduce another observation operator \(\bar{O}=\left\vert 0\right\rangle \left\langle 0\right\vert \otimes O\) on the system. Then the output observation values can be expressed as

$${\bar{f}}_{R}(x,\theta ) = \, \big\langle {\phi }_{f}\Bigg\vert \bar{O}\Bigg\vert {\phi }_{f}\big\rangle \\ = \, \left\langle 0\right\vert \left\langle {\phi }_{0}\right\vert {{{{{{{{\mathcal{R}}}}}}}}}^{{{{\dagger}}} }(x)(\left\vert 0\right\rangle \left\langle 0\right\vert \otimes O)\left\vert 0\right\rangle {{{{{{{\mathcal{R}}}}}}}}(x)\left\vert {\phi }_{0}\right\rangle \\ = \, {f}_{R}(x,\theta )$$
(11)

where \(\left\vert {\phi }_{f}\right\rangle =\left\vert 0\right\rangle {{{{{{{\mathcal{R}}}}}}}}(x)\left\vert {\phi }_{0}\right\rangle +\left\vert \perp \right\rangle\) is the output quantum state of the whole system, and the second item \(\left\vert \perp \right\rangle\) is orthogonal to the first part. Furthermore, because we can expand the measurement operator as \(\bar{O}=({\sigma }_{0}+{\sigma }_{z})/2\otimes O\), we can also have

$${\bar{f}}_{R}(x,\theta )=\frac{1}{2}\left(\langle {\sigma }_{0}\otimes O\rangle +\langle {\sigma }_{z}\otimes O\rangle \right)$$
(12)

This indicates that we can obtain the residual outputs fR(x, θ) by measuring the average expectation of system output state \(\vert {\phi }_{f}\rangle\) with two observations {σ0 ⊗ O, σz ⊗ O}, which is experimentally feasible and introduces little resource overhead. For a l-layer residual encoding, we need l ancillary qubits at most and the corresponding observation operators will be \(\{{({\sigma }_{0}+{\sigma }_{z})}^{\otimes l}\otimes O\}\), whose size grows exponentially with layers of residual encoding. This exponential dependence is intrinsically related to the attenuation of success probability in the quantum algorithms with post-selection. Specificly, suppose that the output state with one residual connection on qubit i is \(\vert {\phi }_{f}^{(i)}\rangle = \vert 0\rangle {{{{{{{\mathcal{R}}}}}}}}(\lozenge)\vert {\phi }_{0}^{(i)}\rangle +\vert \perp \rangle\), then the probability for measuring ancillary qubit in \(\left\vert 0\right\rangle\) state is \({P}_{0}^{(i)}=| | {{{{{{{\mathcal{R}}}}}}}}(\lozenge)\vert {\phi }_{0}^{(i)}\rangle | {| }^{2}\), where ∣∣x∣∣ represents the modulus of vector x. So the success probability of quantum algorithm with l residual connection blocks is \({P}_{s}={\prod }_{i = 1}^{l}{P}_{0}^{(i)}\), which decays exponentially with l.

In practice, we do not need to use residual feature maps in every block, and inserting residual connections to some sampled data-encoding blocks could make the networks obtain better expressivity. In addition, the measurement schemes suggest that our algorithm is compatible with the existing methods for calculating the gradient of expectation value of the quantum circuit with respect to the optimization parameters57,58,59. Using parameter-shift rule57, the gradient of the residual outputs for a parameter θj can be calculated as

$$\frac{\partial {f}_{R}(x,\theta )}{\partial {\theta }_{j}}=\frac{1}{2}\left[{f}_{R}\left(x,{\theta }_{j}+\frac{\pi }{2}\right)-{f}_{R}\left(x,{\theta }_{j}-\frac{\pi }{2}\right)\right]$$
(13)

where fR(x, θj ± π/2) are the expectation values when the target parameter θj is shifted by ± π/2 respectively.

Furthermore, it should be mentioned that the approximation improvement can be understood from the universal approximation property with polynomial basis functions60, which states that the linear combination of different observations can approximate any continuous functions. Based on the above analysis for the quantum models with the specific residual encoding structures, we can see that such a combination of measurement results can actually lead to a frequency richness improvement in the Fourier series, which enhances the expressivity ability of quantum neural networks. Therefore, our work can serve as a specific case to bridge the polynomial approximation60 and Fourier series approximation50, two perspectives for understanding the universal approximation property of quantum machine learning models.

Numerical demonstration

To demonstrate the improvement of the Fourier frequency spectrum by residual connections, we present a proof-of-principle numerical simulation with Pennylane61 here, which solves regression tasks of fitting quantum models to the target Fourier series. We adopt the traditional qubit encoding strategy to map classical data x into quantum state with a single-qubit Pauli-rotation \(U(x)={R}_{y}(x)={e}^{-ix{\sigma }_{y}/2}\) operator, where the generator Hamiltonian G = − σy/2 has two eigenvalues e1,2 = ± 1/2. The optimization ansatz used has two arbitrary single-qubit rotation gates \(U({\theta }_{i})={R}_{z}({\theta }_{i}^{1}){R}_{y}({\theta }_{i}^{2}){R}_{z}({\theta }_{i}^{3})\) for i = 1, 2 placed before and after the data-encoding block, resulting a quantum model Uθ(x) = U(θ2)U(x)U(θ1). The observable is σz and then the outputs is \(f(x,\theta )=\left\langle 0\right\vert {U}_{\theta }^{{{{\dagger}}} }(x){\sigma }_{z}{U}_{\theta }(x)\left\vert 0\right\rangle\). The quantum models are trained by a supervised learning frame to search the optimal parameters θ*, which minimizes the mean squared error (MSE) as

$$\Delta (\theta )=\frac{1}{2D}\mathop{\sum }_{i=1}^{D}{(y({x}_{i})-f({x}_{i},\theta ))}^{2}$$
(14)

where D is the dimension of the data set and y( ⋅ ) is the target function. We use Adam optimizer with at most 200 steps and set the learning rate as 0.3 with batch size 0.7D in the simulation. A termination condition for the optimization convergence, that is, the variance of ten consecutive loss function values is less than 10−8, is also used.

As shown in the Fig. 2, this quantum model can learn functions of the form \({y}_{1}(x)={\sum }_{{\omega }_{i}\in {\Omega }_{1}}(a{e}^{i{\omega }_{i}x}+{a}^{* }{e}^{-i{\omega }_{i}x})\) with a MSE value Δ = 6.0 × 10−5, where a is an amplitude parameter and the frequency spectrum is Ω1 = {ω0 = 0, ω1 = 2∣e1,2∣ = 1}, and this is consistent to the results in50. However, a multi-frequency function with spectrum Ω2 = {ω0 = 0, ω1 = 1, ω2 = 0.5} cannot be well fitted with error Δ = 5.1 × 10−2, due to the frequency lack of parameterized quantum circuits caused by data-encoding strategy. The frequency mismatch can be mitigated by inserting residual connections to the data-encoding block with an output MSE value Δ = 5.1 × 10−5, because the resulting residual operator \({{{{{{{\mathcal{R}}}}}}}}(x)\) can bring richer frequency components to enhance the circuit expressivity. It is worth noting that the residual data encoding scheme still works well for the spectral Ω1 besides Ω2, and the optimization process can converge quickly.

Fig. 2: Fitting results of quantum models.
figure 2

The fitting results of quantum models to the target function y1(x) with frequency spectra Ω1 = {0, 1} (a, b) and Ω2 = {0, 1, 0.5} (c, d). The panels (a, c) show the theoretical function values (black dashed lines), and the quantum model outputs with traditional (gray) and residual (red) encoding strategies, respectively. The panels (b,d) show the mean squared error (MSE) during the training processes.

Furthermore, we turn to a more general case for fitting the function \({y}_{2}(x)={\sum }_{{\omega }_{i}\in {\Omega }_{2}}({a}_{{\omega }_{i}}{e}^{i{\omega }_{i}x}+{a}_{{\omega }_{i}}^{* }{e}^{-i{\omega }_{i}x})\), where the amplitudes can be different for each frequency component. Additional degrees of freedom can be obtained from the multi-combination methods of single-frequency components in residual outputs and the parameterized gates on the auxiliary qubit in the generalized residual operators \({{{{{{{{\mathcal{R}}}}}}}}}_{1,2}(x)\) in equation (3) and (17). We can conclude from the numerical results in the Fig. 3 that the traditional encoding scheme still cannot fit the target function with MSE value Δ = 0.09, while the residual feature map with \({{{{{{{\mathcal{R}}}}}}}}(x)\) operator works better with error Δ = 2.1 × 10−3. When we use the generalized residual operators, the fitting results can be further improved, which converges to a smaller MSE values with Δ = 1.1 × 10−4 for \({{{{{{{{\mathcal{R}}}}}}}}}_{1}(x)\) and Δ = 1.7 × 10−4 for \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(x)\) in fewer optimization steps with 77 steps for \({{{{{{{{\mathcal{R}}}}}}}}}_{1}(x)\) and 55 steps for \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(x)\). Moreover, the extra combination forms and trainable parameterized quantum gates bring more flexibility for fitting, which expand the Fourier coefficient space. As shown in the Fig. 4, we sample the quantum models 1000 times with different feature maps which produce Fourier series, and then get the distribution of Fourier coefficients. We can see that under the same ansatz, the residual feature map with \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(x)\) operator has the widest Fourier coefficients distribution, and all the three residual encoding are better than the traditional encoding scheme.

Fig. 3: Fitting results of quantum models.
figure 3

a The fitting results of quantum models to the target function y2(x) with traditional encoding scheme (gray) and residual feature map with the \({{{{{{{\mathcal{R}}}}}}}}(x)\) (red), \({{{{{{{{\mathcal{R}}}}}}}}}_{1}(x)\) (green) and \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(x)\) (blue) operators, respectively. b The mean squared error (MSE) during the training processes.

Fig. 4: Fourier coefficients and quantum models.
figure 4

a–c The real and imaginary parts of the Fourier coefficients \({a}_{{\omega }_{i}}\) with ωi ∈ {0, 1, 0.5} sampled from 1000 random quantum models. d Quantum models with one-layer data-encoding structure. The quantum models share the same ansatz but vary the data-encoding strategies by traditional encoding (gray), residual feature map with the \({{{{{{{\mathcal{R}}}}}}}}(x)\) (red), \({{{{{{{{\mathcal{R}}}}}}}}}_{1}(x)\) (green) and \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(x)\) (blue) operators. The distribution of coefficients widens from gray to red to green to blue.

In addition, this enhancement can be quantitatively measured by a commonly used expressibility metric62. We first generate many pairs of parameters Θ1 and Θ2 randomly, and calculate the distribution (PF) of state fidelities \(F=| \left\langle 0\right\vert {U}_{{\Theta }_{1}}^{{{{\dagger}}} }(x){U}_{{\Theta }_{2}}(x)\left\vert 0\right\rangle {| }^{2}\), which measure the overlap of quantum states generated by quantum models. Then the Kullback-Leibler (KL) divergence63 is used to quantify the circuit expressivity by comparing the sampled fidelity distributions with that of the Haar-distributed state ensemble (PHaar) as

$${D}_{KL}({P}_{F}| | {P}_{{{{{{{{\rm{Haar}}}}}}}}})=\mathop{\sum}_{j}{P}_{F}(j)\log \frac{{P}_{F}(j)}{{P}_{{{{{{{{\rm{Haar}}}}}}}}}(j)}$$
(15)

where the analytical form of the fidelity distribution for the ensemble of Haar random states is pHaar(F) = (N − 1)(1−F)N−2 and N is the dimension of Hilbert space64. A smaller KL divergence value corresponds to a more favorable expressibility. We sample each quantum model in the Fig. 4 by 1000 times and use 45 histogram bins to estimate the fidelity distribution, which are then compared with the sampled fidelities ensemble of the Haar random states. The computed results of KL divergence are \({D}_{KL}^{{{{{{{{\rm{trad}}}}}}}}}=0.0634,{D}_{KL}^{{{{{{{{\mathcal{R}}}}}}}}(x)}=0.0581,{D}_{KL}^{{{{{{{{{\mathcal{R}}}}}}}}}_{1}(x)}=0.0446\) and \({D}_{KL}^{{{{{{{{{\mathcal{R}}}}}}}}}_{2}(x)}=0.0429\), respectively. We can see that the residual operators can indeed increase the circuit expressivity relative to traditional encoding scheme because they all can introduce richer frequency components into the quantum models. However, it worth mentioning that though the three residual models have the same frequency spectrum, the additional reasons for the expressivity enhancement are somewhat different for \({{{{{{{\mathcal{R}}}}}}}}(x)\) and \({{{{{{{{\mathcal{R}}}}}}}}}_{1,2}(x)\) operators. The former one is due to the diverse construction methods of frequencies in residual outputs, while the latter is also due to the additional optimization parameters. We prove in Methods section that the generalized residual outputs can be seen as the weighted version of the residual outputs with trainable weights. Moreover, it is known that constructing frequencies only from the difference between the sum of the generator’s eigenvalues will limit the access to higher-order components, resulting in a reduction in coefficient variance50. Therefore, the residual encoding method which can offer more methods to construct frequency could broaden the distribution of Fourier coefficients, which suggests an enhanced expressivity of quantum models by residual connections.

Moreover, similar to the traditional encoding, we can extend the accessible frequency spectrum by repeating the residual encoding block multi-times in sequence or in parallel method. To investigate the frequency extension by sequential and parallel repetitions of data-encoding, we fit the aforementioned target function y2(x) with a more complex spectra Ω3 = {ω0 = 0, ω1 = 1, ω2 = 0.5, ω3 = 1.5, ω4 = 2} and amplitude a0 = 0.1 and a1.5,2 = 5a1,0.5 = 0.15 + 0.15i. Two-layers of repeating structures for the traditional encoding in sequence and residual encoding with \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(x)\) operators in sequence and in parallel are used, as shown in the Fig. 5. The single-qubit observable is O = σz for all cases. All the quantum models were trained with 200 steps at most using Adam optimizer and with batch size 16. We can see that both the sequential and parallel repetitions of residual encoding can extend the Fourier spectrum and fit the target function well. The MSE values and optimization steps for the sequential repetitions are Δ = 3.3 × 10−4 and 159 steps, while Δ = 4.2 × 10−4 and 115 steps for parallel repetitions. It should be clarified that the mixed use of residual and traditional encoding will also bring an enhanced expressivity. Therefore, replacing parts of the encoding blocks in complex quantum models with residual blocks, but not all of them, can enrich the expressivity of the whole neural networks.

Fig. 5: Fitting results and quantum models.
figure 5

a The fitting results of quantum models with two-layer data-encoding for target function y2(x) with frequency spectra Ω3. b The mean squared error (MSE) during the training processes. c Quantum models with two-layer data-encoding structure. The residual operator \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(x)\) is repeated in sequence and in parallel, and the output is the measurement value 〈σz〉 on a qubit.

Application in image classification

In this part, we turn to discuss the performance of QCNN algorithm with residual encoding for image classification using a real-word dataset MNIST. The MNIST includes 60000 (10000) images for train (test) datasets with 10 classes of handwritten digits, and each image is a 28 × 28 pixels data. Here we focus on the binary classification with selected classes 0 and 1, and the sizes for the train and test datasets used are 12665 and 2115. Constrained by the current quantum hardwares, high-dimensional data usually require classical pre-processing techniques for dimensionality reduction, and we adopt principal component analysis (PCA) technology to match the input data with the four-qubit data-encoding layer65. For comparison, we use qubit encoding and consider the case where no residual connection is added, and the case where the residual operator \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(x)\) is applied to the i-th qubit, denoted as traditional and residual-Qi schemes, respectively.

The ansatz for QCNN algorithm is composed of a series of alternating convolutional and pooling layers27, as shown in the Fig. 6. Each convolutional layer includes several single- and two-qubit parameterized quantum gates, keeping a translationally invariant structure. We use Ising interactions between adjacent qubits with one parameter as \(ZZ(\phi )={e}^{-i{\sigma }_{z}\otimes {\sigma }_{z}\phi /2}\) and single-qubit U3 gates with three parameters as

$${U}_{3}(\theta ,\phi ,\delta )=\left[\begin{array}{cc}\cos (\theta /2)&-{e}^{i\delta }\sin (\theta /2)\\ {e}^{i\phi }\sin (\theta /2)&{e}^{i(\phi +\delta )}\cos (\theta /2)\end{array}\right]$$
(16)

The pooling layer is implemented by a parameterized controlled-U3 gate and one qubit will be traced out, reducing the quantum states from two qubits to a single qubit. We measure the expectation values \({\langle {\sigma }_{z}\rangle }_{i}\) on the output qubit for the i-th input data with label yi = 0/1. The cost function is \(C(\theta )={\sum }_{i = 1}^{D}{(| \langle {\sigma }_{z}\rangle {| }_{i}-{y}_{i})}^{2}/2D\) for a D-dimensional dataset and it is optimized by Adam optimizer with a learning rate 0.2. The number of iterations in the training process is 100 and the processes are repeated 20 times to obtain the mean values with random initialization of optimization parameters. Once the cost function converges and the optimal parameters \({\theta }^{* }=\arg {\min }_{\theta }C(\theta )\) are obtained, the measurement outputs can be reconstructed into binary values c0/1 via a boundary precision \(\epsilon \in \left(0,0.5\right]\). We suppose that the classification result is c0/1 = 1 for ∣〈σz〉∣ > 1 − ϵ and c0/1 = 0 for ∣〈σz〉∣ < ϵ, while other values are marked as unclassifiable optimization results. A smaller value for ϵ represents higher optimization accuracy and higher classification standards.

Fig. 6: A schematic of quantum convolutional neural networks (QCNN) with residual encoding for image classification.
figure 6

The handwritten digits are encoded as quantum states via quantum feature map, where the green blocks represent qubit encoding schemes and the red blocks are residual encoding with \({{{{{{{{\mathcal{R}}}}}}}}}_{2}({x}_{i})\) operators on the i-th qubit. The multiple convolutional (C) and pooling (P) layers use quantum gates with trainable parameters θ, and the detailed structures are shown below. The measurement outcome of the quantum circuit 〈σz〉 is used to calculate the cost function C(θ) and characterize the binary classification results c0/1. The classical computer updates the optimization parameters of QCNN algorithm based on gradients until the cost function converges.

The optimization results of cost function and accuracy are shown in the Fig. 7 and Table 1. We set ϵ = 0.1 in the simulation and there are 20 free parameters involved in the ansatz. We can conclude that the residual encoding schemes can obtain smaller convergence values of loss than the traditional encoding method, which means that the models have better approximation ability. Such an enhancement can lead to better expressivity and higher accuracy for quantum models in complex learning tasks. In addition, the residual encoding can produce a high classification accuracy, reaching 92.85% and 92.47% on average for the train and test datasets respectively, which are about 7.74% and 7.57% higher than that with the traditional encoding strategy. Further, we provide more numerical simulations of larger QCNN models with up to 12 qubits in the Fig. 8. We can see that with the increase of the number of qubits, the dimensionality reduction of the input image is mitigated, and more information can be involved into the quantum networks. The convergence values of loss function is gradually reduced, and the learning accuracy is gradually improved. The average classification accuracy on the train and test datasets with a residual data-encoding algorithm can be improved to about 97.66% in the maximum-scale quantum learning model.

Fig. 7: Evolution of cost function and accuracy.
figure 7

The performance of quantum convolutional neural networks with different data-encoding strategies for image classification. Simulations with the traditional scheme and residual encoding on qubits Q0 and Q2 in the train and test datasets are offered. The panel a shows the evolution processes of cost function with optimization steps and panel b is the corresponding results in accuracy.

Table 1 Average accuracy
Fig. 8: Results of larger quantum learning models with residual encoding.
figure 8

The panel a is the evolution processes of cost function with optimization steps when the residual connection is applied on the qubit Q0 in data-encoding block, while panel b shows the average classification accuracy from ten repetitions under different qubit numbers.

Conclusion

In summary, we have proposed a complete quantum circuit-based architecture for the implementation of quantum residual neural networks, dubbed QResNets. The classical residual connection channel is quantized by adding an auxiliary qubit to the data-encoding and trainable blocks, which is then generalized with additional parameterized gates. We further prove mathematically that the Fourier spectrum of quantum models output can be enriched when the residual connections are applied to the data-encoding blocks. There is a squared improvement in the number of frequency generation forms of residual encoding over the traditional schemes. It means that the l-layer residual encoding strategy can produce \({{{{{{{\mathcal{O}}}}}}}}({l}^{2})\) frequency combination methods, rather than just by the difference of sum of generator eigenvalues as in traditional methods. Moreover, the diverse spectrum construction methods in the residual outputs and additional optimization degrees of freedom in the generalized residual operators could make the Fourier coefficients more flexible, favoring the access to higher-order components. This indicates that the residual encoding can enrich the spectrum and broaden the Fourier coefficient distribution, that is, it can enhance the expressivity of various parameterized quantum circuits. Various numerical simulation of fitting the functions of Fourier series, and a demonstration of binary classification in images of handwritten digits with MNIST datasets are conducted to show the algorithm performance. Compared with the traditional encoding, the accuracy of residual encoding can be improved by about seven percent. Our work advances the design of quantum neural networks with specific structures and enables a full quantum realization of classical residual connections, and also provides a quantum feature map strategy.

Methods

Generalized residual operators

We have discussed the form of residual operator \({{{{{{{\mathcal{R}}}}}}}}(\lozenge)\) and its corresponding residual output fR(x, θ) above. In this part, we give a detail introduction to the generalized residual operators \({{{{{{{{\mathcal{R}}}}}}}}}_{1,2}(\lozenge)\) and the corresponding generalized residual outputs \({f}_{{R}_{1,2}}(x,\theta )\), which present stronger expressivity. As shown in equation (3) where one Hadamard gate is replaced by a parameterized gate, we further assume that both two Hadamard gates on the ancillary qubit are replaced by gates Ry(2α) and Ry(2γ) with trainable angles α and γ, then the \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(\lozenge)\) operator can be expressed as

$${{{{{{{{\mathcal{R}}}}}}}}}_{2}(\lozenge)=\cos \alpha \cos \eta {\sigma }_{0}^{\otimes n}+\sin \alpha \sin \eta \cdot {{{{{{{\mathcal{L}}}}}}}}(\lozenge)$$
(17)

with a relabeled angle η = πma/2 − γ. The residual operator \({{{{{{{{\mathcal{R}}}}}}}}}_{1}(\lozenge)\) can be seen as a special case with γ = − π/4 ignoring a global phase factor. When the generalized residual operator \({{{{{{{{\mathcal{R}}}}}}}}}_{1,2}(x)\) is used in the data-encoding block, the residual output is

$${f}_{{R}_{1,2}}(x,\theta ) = \, \left\langle {\phi }_{0}\right\vert {{{{{{{{\mathcal{R}}}}}}}}}_{1,2}^{{{{\dagger}}} }(x)O{{{{{{{{\mathcal{R}}}}}}}}}_{1,2}(x)\left\vert {\phi }_{0}\right\rangle \\ = \,{A}_{1}^{{R}_{1,2}}f(x,\theta )+{A}_{2}^{{R}_{1,2}}\left\langle {\phi }_{0}\right\vert O\left\vert {\phi }_{0}\right\rangle \\ +{A}_{3}^{{R}_{1,2}}{{{{{{{\rm{Re}}}}}}}} \left(\left\langle {\phi }_{0}\right\vert OU(x)\left\vert {\phi }_{0}\right\rangle \right)$$
(18)

where the trainable coefficients for \({{{{{{{{\mathcal{R}}}}}}}}}_{1}(x)\) operator are \({A}_{1}^{{R}_{1}}(\alpha )={\sin }^{2}\alpha /2,{A}_{2}^{{R}_{1}}(\alpha )={\cos }^{2}\alpha /2\) and \({A}_{3}^{{R}_{1}}(\alpha )={(-1)}^{{m}_{a}}\sin 2\alpha /2\), while for the \({{{{{{{{\mathcal{R}}}}}}}}}_{2}(x)\) operator are \({A}_{1}^{{R}_{2}}(\alpha ,\eta )={(\sin \alpha \sin \eta )}^{2},{A}_{2}^{{R}_{2}}(\alpha ,\eta )={(\cos \alpha \cos \eta )}^{2}\) and \({A}_{3}^{{R}_{2}}(\alpha ,\eta )=(\sin 2\alpha \sin 2\eta )/2\). Such extension offers additional degree of freedom for the optimization process and can relax the range of Fourier coefficients for the new frequency component wk in equation (6) to \({A}_{3}^{{R}_{1,2}}{\sum }_{j}{\phi }_{j}^{* }{o}_{jk}{\phi }_{k}\), and similar effect is true for other frequency components. In fact, the generalized residual outputs \({f}_{{R}_{1,2}}(x,\theta )\) can be seen as the weighted version of the residual outputs fR(x, θ), where the weights of each term are trainable.

Proof of frequency combination forms

As mentioned above, there are four kinds of combination forms for frequency generation with a two-layer residual encoding. When another residual encoding layer is added, the spectrum \({\Omega }_{l = 1}^{R}=\{{w}_{k}-{w}_{j},\pm {w}_{k}| j,k\in [d]\}\) would be combined to the spectrum \({\Omega }_{l = 2}^{R}\). We first consider the component of difference of the sum of generator eigenvalues, and it would bring new frequency components for the three-layer residual spectrum as

$$\begin{array}{rcl}&&\left\{\mathop{\sum }_{m=1}^{3}{w}_{{j}_{m}}-{\sum }_{n=1}^{3}{w}_{{k}_{n}},\pm \left({\sum }_{m=1}^{3}{w}_{{j}_{m}}- {\sum }_{n=1}^{2}{w}_{{k}_{n}}\right)\right.\\ &&\left.\pm \left({\sum }_{m=1}^{3}{w}_{{j}_{m}}-{w}_{{k}_{1}}\right),{\sum }_{m=1}^{2}{w}_{{j}_{m}}- {\sum }_{n=1}^{2}{w}_{{k}_{n}}\right\}\end{array}$$
(19)

with index j1, j2, j3, k1, k2, k3 ∈ [d]. If we further consider the effect of eigenvalues \(\pm {w}_{k}\in {\Omega }_{l = 1}^{R}\), more frequency components can be involved as

$$\begin{array}{rcl}&&\left\{\pm \left({\sum }_{m=1}^{3}{w}_{{j}_{m}}- {\sum }_{n=1}^{2}{w}_{{k}_{n}}\right),\pm \left( {\sum }_{m=1}^{3}{w}_{{j}_{m}}-{w}_{{k}_{1}}\right),\pm {\sum }_{m=1}^{3}{w}_{{j}_{m}},\right.\\ &&\left. {\sum }_{m=1}^{2}{w}_{{j}_{m}}- {\sum }_{n=1}^{2}{w}_{{k}_{n}},\pm \left({\sum }_{m=1}^{2}{w}_{{j}_{m}}-{w}_{{k}_{1}}\right)\right\}\end{array}$$
(20)

We can combine the above cases for frequency generation and simply mark the combination forms of \(\pm (\mathop{\sum }_{m = 1}^{{l}_{1}\ge 1}{w}_{{j}_{m}}-\mathop{\sum }_{n = 1}^{{l}_{2}\ge 1}{w}_{{k}_{n}})\) as \({\mathbb{DS}}({l}_{1},{l}_{2})\), which means the difference between the sum of two sets with l1 and l2 frequencies. Note that we mark the combination form of \(\pm \mathop{\sum }_{m = 1}^{l\ge 1}{w}_{{j}_{m}}\) as \({\mathbb{DS}}(l,0)\). Then we can find that there are six kinds of frequency combination forms for the three-layer residual encoding, and it can be concluded as \(\{{\mathbb{DS}}(3,3),{\mathbb{DS}}(3,2),{\mathbb{DS}}(3,1),{\mathbb{DS}}(3,0),{\mathbb{DS}}(2,2),{\mathbb{DS}}(2,1)\}\). Further, for the l-layer residual encoding, the spectrum with various frequency generation forms can be formally expressed as

$${\Omega }_{l}^{R} = \left\{{\mathbb{DS}}(l,l),{\mathbb{DS}}(l,l-1),\cdots \,,{\mathbb{DS}}(l,1),{\mathbb{DS}}(l,0)\right.\\ {\mathbb{DS}}(l-1,l-1),\cdots \,,{\mathbb{DS}}(l-1,1)\\ \cdots \\ \left.{\mathbb{DS}}(\lceil l/2\rceil ,\lfloor l/2\rfloor )\right\}$$
(21)

where the ⌈ ⋅ ⌉ and ⌊ ⋅ ⌋ are roundup and rounddown functions. Based on the number of items in each row of equation (21), we can determine the number of components in the set as

$${{{{{{{\mathcal{N}}}}}}}}\left({\Omega }_{l}^{R}\right) = \,(l+1)+(l-1)+\cdots +(\lceil l/2\rceil -\lfloor l/2\rfloor +1)\\ = \frac{(l+2)+(\lceil l/2\rceil -\lfloor l/2\rfloor )}{2}\frac{(l+2)-(\lceil l/2\rceil -\lfloor l/2\rfloor )}{2}\\ = \, (\lceil l/2\rceil +1)(\lfloor l/2\rfloor +1)$$
(22)

It can be concluded that compared with the traditional encoding method which generates frequency only with \({\mathbb{DS}}(l,l)\)50, there is a squared improvement in frequency generation methods for the residual encoding scheme with \({{{{{{{\mathcal{N}}}}}}}}({\Omega }_{l}^{R})\propto {{{{{{{\mathcal{O}}}}}}}}({l}^{2})\). While different combinations may produce some of the same frequency components, in general, more frequency-generation methods suggest that the possible upper bounds for the size of the Fourier spectrum of quantum model outputs can be larger, allowing for more complex learning tasks. Moreover, the diverse construction methods for frequencies can also improve the flexibility of Fourier coefficients, favoring the access to higher-order components and further improving the expressivity of quantum models.