1 Introduction

The quantum computers have attracted much attention due to its potential impact on quantum chemistry (Aspuru-Guzik et al. 2005; Cao et al. 2019; McArdle et al. 2020; Armaos et al. 2020), machine learning (Schuld et al. 2015; Biamonte et al. 2017; Schuld and Killoran 2022), cryptography (Shor 1994, 1997; Lenstra 2000), search problems (Grover 1996), and so on. With advancements in quantum technology, commercially available quantum computers have become a reality. In principle, we could realize a fault-tolerant quantum computer, if the number of qubits is more than 10 million with a fidelity around 0.999 (Jones et al. 2012; Devitt et al. 2013; Gidney and Ekerå 2021). However, in the current device, the available number of qubits is an order of 500 or less, which is much smaller than that required for the fault-tolerant quantum computation. A more feasible scenario to be realized in the near future is the so-called NISQ regime (Preskill 2018; Bharti et al. 2022).

Numerous quantum algorithms have been designed for execution on NISQ devices. Among these, VQAs are considered some of the most promising applications for NISQ devices (Bharti et al. 2022; Endo et al. 2021). Specifically, quantum machine learning has emerged as an appealing use case for VQAs. As a NISQ algorithm, quantum machine learning has been predominantly investigated in the context of qubit-based systems. Recent studies have shown that data reuploading, the process of repeatedly encoding classical data into quantum circuits, is essential for achieving expressive quantum machine learning models within traditional quantum computing frameworks (Pérez-Salinas et al. 2020; Gil Vidal and Theis 2020; Schuld et al. 2021). However, data reuploading often demands much quantum resources. This encourages us to seek alternative approaches to achieve expressive quantum machine learning. We could adopt a photonic device where fock states can be used and the necessary quantum resources for the data embedding with this device could be different from that with the conventional approach using qubits (Killoran et al. 2019; Steinbrecher et al. 2019; Volkoff 2021; Gan et al. 2020; Liu et al. 2023).

On the other hand, the KPO is one of the candidates to realize quantum computation (Milburn and Holmes 1991; Wielinga and Milburn 1993; Cochrane et al. 1999). The KPO is a parametric oscillator with large Kerr nonlinearity. This Kerr nonlinearity can be used to generate cat states. We can realize the KPO by using superconducting resonators with Josephson junctions (Bourassa et al. 2012; Meaney et al. 2014). The KPO is one of the candidates to perform gate-type quantum computation (Cochrane et al. 1999; Goto 2016a; Puri et al. 2017) and quantum annealing (Goto 2016b; Puri et al. 2017), and the KPO qubit is realized experimentally (Grimm et al. 2020). It is known that the KPO qubit is highly tolerant to bit-flip errors, and we can exploit this property to reduce the overhead for fault-tolerant quantum computation (Puri et al. 2017; Masuda et al. 2022).

In this paper, we propose to use the KPO for the supervised machine learning with a variational algorithm. KPO is a bosonic system, and we can in principle use the infinitely large Hilbert space with the single KPO. Also, unlike the conventional approach to use parametrized gates, we use a natural Hamiltonian dynamics where we change the Hamiltonian parameter to implement the variational algorithm. We numerically study the performance of our method to use the KPO with that of the conventional method with qubits.

In our method, we start from a coherent state with an amplitude of \(\alpha \). Importantly, we numerically find that, by changing the amplitude, we can tune the expressibility. Since we encode the input classical data by using the detuning of the KPO, we can include a higher frequency as we increase the amplitude of the coherent state. We expect that the high-frequency terms will improve the expressibility, and we confirm this point by using numerical simulations. As the expressivity increases, on the other hand, more often, overfitting occurs, and so our method allows us to optimize the expressibility by tuning the amplitude of the coherent state.

This paper is organized as follows. In Section 2, we review the physics of single and multiple KPO systems. The latter is called KPO network. In Section 3, we explain a standard supervised machine learning algorithm as a NISQ algorithm, and a supervised machine learning algorithm for KPO is proposed based on the ideas in Section 4. We performed numerical simulations to validate our proposed method. In Section 5, we explain the simulations and the results precisely. Finally, we conclude with some final thoughts in Section 6.

2 KPO

KPO is a bosonic system with a nonlinear effect called Kerr nonlinearity. Here, we first describe a single KPO and next explain a network of KPOs that have been used for a gate-type quantum computer or quantum annealing.

First, in a frame rotating at half the pump frequency of the parametric drive and in the rotating wave approximation, the Hamiltonian of the single KPO is written as Goto (2016b, 2019)

$$\begin{aligned} \hat{H}&=\chi \hat{a}^{\dagger 2}\hat{a}^{2} + \Delta \hat{a}^{\dagger } \hat{a}\nonumber \\&\quad - p(\hat{a}^{2} + \hat{a}^{\dagger 2}) + r(\hat{a}+\hat{a}^{\dagger }), \end{aligned}$$
(1)

where \(\chi \), \(\Delta \), p, and r are the Kerr nonlinearity, the detuning, the pump amplitude of the parametric drive, and the strength of the coherent drive, respectively.

We can easily tune \(\Delta \), p, and r during the experiment by changing the parameters of the external driving fields. Although we can tune \(\chi \) by changing magnetic flux penetrating the superconducting loop of the KPO, the dynamic range is typically small, and therefore, we assume that \(\chi \) is fixed at a specific value.

The coherent state is defined by

$$\begin{aligned} \vert {\alpha }\rangle = e^{-\frac{|\alpha |^{2}}{2}}\sum _{k} \frac{\alpha ^{k}}{\sqrt{k!}}\vert {k}\rangle , \end{aligned}$$
(2)

where \(\vert {k}\rangle \) are the fock states. The system is initially prepared in the coherent state in our method. For a linear resonator, we can prepare the coherent state by adding the coherent driving term \(r(\hat{a}+\hat{a}^{\dagger })\). However, due to the term \(\chi \hat{a}^{\dagger 2}\hat{a}^{2}\) in Eq. 1, we cannot prepare the coherent state just by adding the coherent drive. Instead, we can prepare the coherent state by using the KPO as follows. By setting \(p=r=0\), the Hamiltonian Eq. 1 becomes

$$\begin{aligned} \hat{H}&=\chi \hat{a}^{\dagger 2}\hat{a}^{2} + \Delta \hat{a}^{\dagger } \hat{a}. \end{aligned}$$
(3)

If \(\Delta >\chi \) is satisfied, the ground state of this Hamiltonian becomes the vacuum state \(\vert {0}\rangle \). On the other hand, when \(\Delta \) and r are zero, Eq. 1 can be rewritten as

$$\begin{aligned} \hat{H} =\chi \left( \hat{a}^{\dagger 2}-\frac{p}{\chi }\right) \left( \hat{a}^{2}-\frac{p}{\chi }\right) -\frac{p^{2}}{\chi }, \end{aligned}$$
(4)

and the ground state is in the eigenspace that is spanned by two coherent states \(\vert {\sqrt{p/\chi }}\rangle \) and \(\vert {-\sqrt{p/\chi }}\rangle \). By adding a coherent drive as a perturbation, we can solve the degeneracy, and the ground state becomes approximately \(\vert {\sqrt{p/\chi }}\rangle \) with a negative value of r. If we prepare a vacuum state with the Hamiltonian of Eq. 3, the system is in the ground state. By adiabatically changing the Hamiltonian from Eq. 3 to Eq. 4, we obtain the coherent state \(\vert {\sqrt{p/\chi }}\rangle \) due to the adiabatic theorem. This operation is frequently used in quantum annealing with KPOs (Goto 2016a, b, 2019; Puri et al. 2017).

Next, the Hamiltonian of multiple KPOs, which is called a KPO network, is written as

$$\begin{aligned} \hat{H} =&\sum _{j=1}^{K} \chi _{j}\hat{a}^{\dagger 2}_{j}\hat{a}_{j}^{2} + \Delta _{j}\hat{a}^{\dagger }_{j}\hat{a}_{j}\nonumber \\&\qquad \quad -p_{j}(\hat{a}_{j}^{2}+\hat{a}^{\dagger 2}_{j})+r_{j}(\hat{a}_{j} + \hat{a}^{\dagger }_{j})\nonumber \\&\ +\sum _{j>j'}^{K} \left( J_{jj'}\hat{a}^{\dagger }_{j}\hat{a}_{j'}+J_{jj'}^{*}\hat{a}^{\dagger }_{j'}\hat{a}_{j}\right) . \end{aligned}$$
(5)

where K denotes the number of KPOs and \(J_{jj'}\) denotes the coupling strength between KPOs. Here, we assume that we fix the values of \(\chi _{j}\) and \(J_{jj'}\) during the experiment, while we can control the values of \(\Delta _{j}\), \(p_{j}\), and \(r_{j}\).

If \(J_{jj'}\) is zero, we can independently perform the adiabatic state preparation described above and prepare the following state:

$$\begin{aligned} \bigotimes _{j=1}^{K} \vert {\alpha _{j}}\rangle . \end{aligned}$$
(6)

Here, each \(\alpha _{j}\) is the eigenvalue of \(\vert {\alpha _{j}}\rangle \) with the annihilation operator on the j-th KPO \(\hat{a}_{j}\).

It is worth mentioning that even when \(J_{jj'}\) is nonzero, we can prepare the product of the coherent state as follows. Let us assume that \(\Delta _j\), \(r_j\), and \(J_{ij}\) are much smaller than \(p_j\) and \(\chi _j\). In this case, the last terms of the Hamiltonian Eq. 5 can be interpreted as the longitudinal-field Ising Hamiltonian in a coherent state basis. If \(J_{ij}\) is negative, we have a ferromagnetic Hamiltonian. Moreover, by setting \(J_{ij}\) to be much smaller than \(r_j\), the state in Eq. 6 can be a ground state, and so we can prepare this state in an adiabatic way. Also, a coupling scheme of KPOs with high fidelity has already been proposed theoretically (Goto 2019; Masuda et al. 2022; Aoki et al. 2024).

3 Quantum supervised machine learning as a NISQ algorithm

In this section, let us review a quantum supervised machine learning as a preparation for introducing our model. In a supervised learning task, a number of training set \(\{(\varvec{x}_{m}, \) \( \varvec{y}_{m})\}_{m=1}^N\) are given. Here, all input data \(\varvec{x}_{m}\) (output data \(\varvec{y}_{m}\)) are \(d_{x}\) (\(d_{y}\)) dimensional arrays. Suppose that there is a hidden relationship between an input data \(\varvec{x}\) and the output data \(\varvec{y}\) as \(\varvec{y} = \tilde{f}(\varvec{x})\) with a function \(\tilde{f}\). The objective of the task is to find the hidden relationship \(\tilde{f}\) from the training data. More specifically, we define the model function f and optimize it so that it becomes close to \(\tilde{f}\) by using the training data.

In most of the quantum machine learning with near-term devices, we use a parameterized quantum circuit to construct a model function. More precisely, by tuning a parameter, we try to minimize a cost function. In usual cases, we choose the mean squared error

$$\begin{aligned} L(\varvec{\theta }) = \frac{1}{N}\sum _{m=1}^{N} \left| f(\varvec{x}_{m};\varvec{\theta })-\varvec{y}_{m}\right| ^{2}, \end{aligned}$$
(7)

for the cost function. Here, N is the number of data sets, \(f(\varvec{x};\varvec{\theta })\) is an array as an output of the parameterized quantum circuit, and \(\varvec{\theta }\) is the corresponding parameter. Let us summarize such a quantum machine learning as follows.

  1. 1.

    Prepare an initial state \(\vert {\psi }\rangle \), and apply an input gate \(\hat{U}(\varvec{x})\) to encode the input data \(\{\varvec{x}_i\}\).

  2. 2.

    Apply a parameterized unitary \(\hat{V}(\varvec{\theta })\) to the state.

  3. 3.

    Measure the expectation values of an observable \(\hat{M}\), and we define the function as \(f(\varvec{x};\varvec{\theta })= \langle \hat{M}\rangle \).

  4. 4.

    By repeating the above three steps, minimize the cost function L by tuning the parameter \(\varvec{\theta }\) iteratively.

The function \(f(\varvec{x})\) is represented as

$$\begin{aligned} f(\varvec{x};\varvec{\theta })=\langle {\psi }\vert \hat{U}^{\dagger }(\varvec{x})\hat{V}^{\dagger }(\varvec{\theta })\hat{M}\hat{V}(\varvec{\theta })\hat{U}(\varvec{x})\vert {\psi }\rangle . \end{aligned}$$
(8)

According to a previous study (Schuld et al. 2021), we may not expect high expressibility with a parametrized quantum circuit using single-qubit rotations in the NISQ era. In fact, the study shows that only a sinusoidal curve can be obtained as Eq. 8 with using a qubit and single-qubit rotations, and if we want different functions as an output, we need to prepare more qubits or obtain other outputs than Eq. 8 with adding another operation called data reuploading. However, neither increasing the number of qubits nor increasing the number of gate operations that cause noise is desirable for the NISQ algorithm.

4 Quantum supervised machine learning with KPO

We introduce our method to use the KPO for supervised quantum machine learning. We begin by describing a simplified scenario with \(d_x=d_y=1\) by using the single KPO. Next, we explain how to use the KPO network for supervised quantum machine learning with \(d_x=d_y=1\). Finally, we describe a scenario to implement supervised quantum machine learning with \(d_x>1\) and/or \(d_y>1\) by using the KPO network.

4.1 \(d_{x} = d_{y} =1\) case

4.1.1 Single KPO

In our paper, the initial state is set to be a coherent state. Also, to upload the classical data, we could adopt \(\hat{U}(x) = e^{-i\pi x \hat{n}}\), where \(\hat{n}\) is the number operator defined by \(\hat{n}=\hat{a}^{\dagger }\hat{a}\). However, if we use the KPO, it is difficult to realize in situ tunability of the nonlinearity \(\chi \). Assuming that we fix the value of \(\chi \) during the experiment, we will adopt the following operator to upload the classical data

$$\begin{aligned} \hat{U}(x) = e^{-i\tilde{\chi }\hat{n}^{2}-i\pi x \hat{n}}, \end{aligned}$$
(9)

where we have,

$$\begin{aligned} \tilde{\chi }&=t_{d}\chi ,\end{aligned}$$
(10)
$$\begin{aligned} \pi x&= t_{d}(\Delta -\chi ). \end{aligned}$$
(11)

In the actual experiment, we can easily tune the time duration \(t_{d}\) and detuning \(\Delta \). Throughout this paper, we fix the value of \(\tilde{\chi }\).

Let us define a set of unitary operators \(\hat{V}_{i}(\Delta _{i}, p_{i}, r_{i})\)

$$\begin{aligned} \hat{V}_{i}(\Delta _{i}, p_{i}, r_{i}) = e^{-i\tau \hat{H}}. \end{aligned}$$
(12)

where \(\hat{H}\) denotes the Hamiltonian of the KPO Eq. 1 and \(\tau \) denotes an evolution time by the Hamiltonian. By turning on and off the parameters of the Hamiltonian, we can construct a unitary operator

$$\begin{aligned} \hat{V}(\varvec{\theta })=\prod _{i}^{D} \hat{V}_{i}(\Delta _{i}, p_{i}, r_{i}), \end{aligned}$$
(13)

where D is the number of combinations of \((\Delta _{i}, p_{i}, r_{i})\). Here, \(\varvec{\theta }\) corresponds to a set of parameters \(\{\Delta _{j}, p_{j}, r_{j}\}_{j=1}^{D}\). For simplicity, we define

$$\begin{aligned} \theta _{k} := {\left\{ \begin{array}{ll} \Delta _{i} &{} k = 3 i -2,\\ p_{i} &{} k = 3 i - 1,\\ r_{i} &{} k = 3 i, \end{array}\right. } \end{aligned}$$
(14)

with \(i = 1, \dots , d.\) We choose \(\hat{a}+\hat{a}^{\dagger }=\hat{M}\) as the observable to be measured. Since a bosonic system has an infinite dimensional Fock space, even a single KPO may have the ability to approximate the target function, while the previous approach required multiple qubits to represent the target function.

To minimize the cost function, we need to tune the parameter \(\varvec{\theta }\). For this purpose, we should adopt a classical algorithm to show how we should update the parameters based on the expectation value of \(\hat{M}\).

Several types of classical algorithms are used to update \(\varvec{\theta }\). One of them is the gradient descent method to use the gradient of the cost function. If we construct the unitary operator \(\hat{V}(\varvec{\theta })\) by using a sequence of parameterized gates, we can use the so-called parameter shift rule (Mitarai et al. 2018; Wierichs et al. 2022) to determine the gradient. On the other hand, since we use the Hamiltonian dynamics to realize the unitary operator \(\hat{V}(\varvec{\theta })\), it is not straightforward to use the parameter shift rule. We could use a numerical differentiation where we measure small changes in \(f(x;\varvec{\theta })\) when we incrementally increase \(\varvec{\theta }\) by changing \(\varvec{\theta }\) in small increments and detecting the resulting small changes in the output \(f(x;\varvec{\theta })\). However, to detect the small changes, this method requires a large number of measurements.

If we cannot use a sufficient number of shots, we could adopt an optimization using the Nelder-Mead or Powell method, which does not use the information of gradients. Throughout this paper, we use the Nelder-Mead method (Nelder and Mead 1965) for our simulation.

Our method to use the single KPO needs to access higher excited states in the Fock space, which may cause experimental difficulties. This problem may be circumvented by using the KPO network.

4.1.2 KPO network

Next, we consider a case using a KPO network. We prepare the product state of the coherent state (6) as the initial state. To upload the classical data, we apply the following operator on the j-th KPO,

$$\begin{aligned} \hat{U}_{j}(x) = e^{-i\tilde{\chi }\hat{n}_{j}^{2}-i\pi x \hat{n}_{j}}, \end{aligned}$$
(15)

where we have \(\tilde{\chi }_j=t_{d}\chi _j\) and \(\pi x= t_{d}\Delta _j.\) We define a unitary operator with 3K parameters,

$$\begin{aligned} \hat{V}(\mathbf {\Delta }, \textbf{p}, \textbf{r}) = e^{-i t_{d} \hat{H}}, \end{aligned}$$
(16)

where \(\hat{H}\) is given by Eq. 5. Here, \(\mathbf {\Delta }=(\Delta _1, \Delta _2, \cdots , \Delta _K)\), \(\textbf{p}=(p_1, p_2,\cdots ,p_{K})\) and \(\textbf{r}=(r_1,r_2,\cdots , r_K)\) are K dimensional arrays.

If we need more than 3K adjustable parameters, we could consider a different combination of \(\mathbf {\Delta }\), \(\textbf{p}\), and \(\textbf{r}\). Let us define a set of such a combination as \(\{\mathbf {\Delta }_{i}, \textbf{p}_{i}, \textbf{r}_{i}\}_{i=1}^{D}\) where D is the number of the combination. Thus, we can generate D different unitary operators based on Eq. 16. When we sequentially implement these, the unitary operator is given as

$$\begin{aligned} \hat{V}(\varvec{\theta }) = \prod _{i}^{D}\hat{V}_{i}(\mathbf {\Delta }_{i},\textbf{p}_{i},\textbf{r}_{i}). \end{aligned}$$
(17)

Here, \(\varvec{\theta }\) corresponds to a set of parameters as described below.

$$\begin{aligned}&(\mathbf {\Delta }_{1},\textbf{p}_{1},\textbf{r}_{1},...,\mathbf {\Delta }_{D},\textbf{p}_{D},\textbf{r}_{D})\nonumber \\&\quad = (\Delta _{11},\Delta _{12},...,\Delta _{1K},p_{11},p_{12},...,p_{1K},...,r_{DK}). \nonumber \end{aligned}$$

After applying \(\hat{V}(\varvec{\theta })\) given by Eq. 17, we measure an observable \(\hat{M}\). For this \(\hat{M}\), we can choose an observable of \(\hat{a}_{1}+\hat{a}^{\dagger }_{1}\), for example.

4.2 \(d_x>1\) and/or \(d_y>1\) case

We describe our method to implement supervised quantum machine learning with \(d_x>1\) and/or \(d_y>1\) by using the KPO network. Let us assume \(K\ge d_x\). For \(j=1,2,\cdots , d_x\), we define

$$\begin{aligned} \hat{U}_{j}(x_{j}) = e^{-i\tilde{\chi }\hat{n}_{j}^{2}-i\pi x_{j} \hat{n}_{j}}. \end{aligned}$$
(18)

To upload the classical data, we use a unitary operator of \(\prod _{j=1}^{d_x}\hat{U}_{j}(x_{j})\).

Subsequently, we apply \(\hat{V}(\varvec{\theta })\) in Eq. 17 and measure a set of observable \(\{\hat{M}_{k}\}_{k=1}^{d_{y}}\). The expectation value of \(\hat{M}_{k}\) corresponds to the k-th component of \(\varvec{y}\). By repeating these steps, we update the parameter \(\varvec{\theta }\) to minimize the cost function. In principle, we could use the single KPO with \(d_x>1\) and \(d_y>1\), and we discuss such an example in Appendix C.

Fig. 1
figure 1

Demonstration of quantum machine learning to represent functions. Blue dots indicate the teacher data. KPO (conventional) indicates the output by our (conventional) method after the optimization. We fit a \(e^{-36x^{2}}\), b |x|, c square wave, and d \(0.4\sin (4\pi x)+0.5\sin (6\pi x)\), respectively

Fig. 2
figure 2

Plot of the absolute value of the Fourier transform of the function against the frequency. The blue dotted line denotes the function to be fitted. The orange (green) line denotes the output by our (conventional) method after the optimization. The functions are a \(e^{-36x^{2}}\), b |x|, c square wave, and d \(0.4\sin (4\pi x)+0.5\sin (6\pi x)\), respectively

4.3 Potential advantage to use KPOs

Even if we can use only a single KPO, the function obtained as Eq. 8 is expected to exhibit a large expressibility. Similar to the previous study (Schuld et al. 2021), we construct the Fourier spectrum of Eq. 8 in the case of a single KPO as \(d_x=d_y=1\).

When \(\chi \) is negligibly small, we obtain

$$\begin{aligned} f(x;\varvec{\theta })&= \langle {\alpha |e^{i\pi x \hat{n}}\hat{V}^{\dagger }(\varvec{\theta })\hat{M}\hat{V}(\varvec{\theta })e^{-i\pi x \hat{n}}|\alpha }\rangle \nonumber \\&=e^{-|\alpha |^{2}}\sum _{k, l=0}^{\infty } \langle {k|\hat{V}^{\dagger }(\varvec{\theta })\hat{M}\hat{V}(\varvec{\theta })|l}\rangle \frac{\alpha ^{l}\alpha ^{*k}}{\sqrt{k! l!}}e^{i\pi x (k-l)}, \end{aligned}$$
(19)

which is the Fourier series. Importantly, there are high-frequency terms in this form, and the number of terms is infinite. This would improve expressibility. A similar discussion has been made by Gan et al. in the context of a multi-mode photonic device, which supports our claim (Gan et al. 2020).

If we can provide an appropriate \(\hat{V}(\varvec{\theta })\) and \(\hat{M}\), we could represent any function that can be represented by the Fourier series. Moreover, previous research shows that the Kerr nonlinearity could enhance the performance of a specific scheme of quantum machine learning (Liu et al. 2023), and so our method to utilize the Kerr nonlinearity might improve the expressibility.

On the other hand, if we use ordinary qubits, the number of high-frequency terms is limited by the finite number of qubits. This could limit the expressibility, as suggested in Schuld et al. (2021). To improve the expressibility, we could increase the number of qubits (Schuld et al. 2021) or circuit depth. However, it is difficult to increase the number of qubits or circuit depth in the NISQ device.

5 Simulations and results

To evaluate the performance of our proposed method, we perform numerical simulations for \(d_x=d_y=1\) and compare the results of our method with that of the conventional one (Mitarai et al. 2018). Specifically, we perform the fitting of \(\tilde{f}(x)=\) \(e^{-36x^{2}}\) (Gaussian), |x|, and \(0.4\sin (4\pi x)+0.5\sin (6\pi x)\). Also, we perform the fitting of the square wave defined as

$$\begin{aligned} \tilde{f}(x) = {\left\{ \begin{array}{ll} 1 &{} (|x|<0.4) \\ 0 &{} (|x|\ge 0.4). \end{array}\right. } \end{aligned}$$
(20)

We create the training set as follows. We set \(N=100\). First, we randomly choose a value between \(-1\) and 1 and adopt these values as \(x_m\). Next, for each \(x_m\), we calculate \(\tilde{f}(x_m)\) by using the given function \(\tilde{f}\) and assign this value as \(y_m\).

For our method to use a single KPO, we choose \(\chi =0.1\), \(t_{d}=\tau =0.7\), \(\hat{M} = \hat{a} + \hat{a}^{\dagger }\), and \(D=12\). Also, we set the cutoff of the Hilbert space dimension as 25.

For the conventional method (Mitarai et al. 2018), we set depth \(D=2\), the number of qubit \(K=6\), time step \(\tau =10\), and \(\hat{M}=2Z^{(1)}\). Precise setups of the conventional method are given in Appendix B. Here, for a fair comparison, we set the number of parameter \(\theta \) as 36, which is equal to that of our method.

We show the results of the fitting in Fig. 1. Our method approximates all functions better than the conventional method. In order to compare the expressibility more clearly, we define a Fourier transform as

$$\begin{aligned} \hat{F}(\nu )= \frac{1}{\sqrt{2\pi }}\int _{-1}^{1} dx F(x)e^{-2\pi i\nu x}, \end{aligned}$$
(21)

for any function F(x), and we plot the absolute value of \(\hat{f}(\nu )\) in Fig. 2. As can be easily seen in (b) and (d), the results by our method contain more Fourier components than that by the conventional method.

Also, we plot the value of the cost function after the optimization by our method and compare this with the conventional method in Table 1.

Next, let us discuss the case of the KPO network for \(d_x=d_y=1\). Here, we use \(\chi _{1} = \chi _{2} = 1\), \(J_{12} = -0.1\), \(K=2\), and \(t_{d} = \tau = 1\). Also, by choosing \(D=6\), we set the total number of parameters as 36, which is equal to that of the single KPO.

We could choose \(\hat{M}= (\hat{a}_{1} + \hat{a}^{\dagger }_{1})\otimes (\hat{a}_{2} +\hat{a}^{\dagger }_{2}) \) for our numerical simulations. However, it is not straightforward to measure such a non-local observable with the KPO. So, instead, we consider two observables \(\hat{M}_{1} = \hat{a}_{1} + \hat{a}^{\dagger }_{1}\) and \(\hat{M}_{2} = \hat{a}_{2} + \hat{a}^{\dagger }_{2}\). Also, we represent the function as \(f(x;\varvec{\theta })=\langle {\hat{M}_{1}}\rangle \langle {\hat{M}_{2}}\rangle \). We perform the fitting of the Gaussian and square wave, which we used in the case of the single KPO. Finally, we set the Hilbert space cutoff dimension of each KPO as 10.

Table 1 Finally obtained values of the cost function
Fig. 3
figure 3

Demonstration results of our quantum machine learning for \(e^{-36x^{2}}\) (a) and square wave (b) for the 1KPO and 2KPO cases. Left: the teacher data and the training results. Right: the average photon number (since we evaluated \(\langle {a^{\dagger }a}\rangle \), the result depends on the variable x)

We plot the results in Fig. 3 and compare the performance of our method to use the KPO network with that to use the single KPO. The cost functions after the optimization for 1KPO (2KPO) are \(1.016\times 10^{-4}\) (\(9.711\times 10^{-5}\)) for the Gaussian (\(e^{-36x^{2}}\)) and \(1.344\times 10^{-2}\) (\(2.119\times 10^{-2}\)) for the square wave. The performance of our method using the KPO network is similar to that using the single KPO. However, we need to access higher excited states for the case of the single KPO than that of the KPO network, and therefore, we could avoid the experimental difficulties by using the KPO network.

Let us explain the runtime of our scheme. During our simulations, we employed a maximum iteration count of 7200, the default setting provided by Scipy.optimize.minimize (Virtanen et al. 2020), when dealing with 36 variables. The optimization process terminates when the cost function either meets the predefined tolerance level (default value, \(10^{-4}\)) or when the maximum allowed number of iterations is reached. In either case, both the parameter set to minimize the cost function and the iteration number are outputted in Table 2.

Table 2 Numbers of iterations
Fig. 4
figure 4

Results of our quantum machine learning for Gaussian described as \(e^{-36x^{2}}\) with \(N=10, 30, 300, 1000\) training data. When N is small, overfitting occurs

Table 3 Variation of the number of iterations with the number of training data
Fig. 5
figure 5

Results of our quantum machine learning for a square wave function with \(N=10, 30, 300, 1000\) training data. When N is small, overfitting occurs

Fig. 6
figure 6

Demonstration results of our quantum machine learning for \(e^{-36x^{2}}\) (a) and the square wave (b) for the different \(\alpha =1, 3, 5\) cases. Left: the teacher data and the training results. Right: the Fourier spectrum of the training results

In the KPO cases, the number of iterations is equal to or less than that in the conventional cases. The most time-consuming part of the practical runtime of the superconducting circuit is the execution time of two-qubit gates. Importantly, coupling strength between KPOs, as demonstrated in previous work (Yamaji et al. 2022), is approximately \(10 \textrm{MHz}\), which is similar to that of superconducting transmon qubits (Stehlik et al. 2021). Consequently, these findings highlight that the runtime of our method using KPOs is comparable with that of the conventional approach using transmon qubits.

We show how our fitting results depend on the number of training data N in Figs. 4 and 5 and the variation of the number of iterations in Table 3. For small N, our method seems to be susceptible to overfitting due to its inherent high expressiveness. Fortunately, to reduce the impact of overfitting, we can regulate this expressiveness by adjusting the photon number of the initial coherent state, as we will show in Section 5.1.

5.1 \(\alpha \) and expressive power

From Eq. 19, we find a tendency that, as we increase (decrease) \(\alpha \), more (less) high-frequency terms are added. Therefore, it is expected that we can control the expressibility by tuning the size of the coherent state prepared as the initial state.

We confirmed this point by numerical simulations. In Fig. 6, we perform numerical simulations for \(\alpha = 1, 3,\) and 5 with the use of supervised data generated from two functions, a Gaussian and a square wave. Only for \(\alpha = 5\), we change the cutoff dimension of the Hilbert space from 25 to 100 because the average photon number is 25.

In machine learning, there is a trade-off between increasing expressive power and overfitting. This means that, as we increase the expressibility, the problem of the overfitting becomes more severe. In our method, we could tune the parameter \(\alpha \) to choose the best point for the fitting.

To illustrate our concept, we performed numerical simulations in which we varied the number of photons in the initial coherent state. As we mentioned before, in Fig. 4, overfitting occurs for a smaller number of the training data N. We apply our method to tune the expressibility to this case. In Fig. 7, we present the results, highlighting that reducing the photon number of the initial coherent state effectively mitigates the impact of overfitting.

Fig. 7
figure 7

Results of our quantum machine learning for Gaussian (left side) and the square wave function (right) for \(N=10\) where we change the photon number of the initial coherent state. We successfully mitigate the impact of the overfitting by reducing the photon number of the coherent state

6 Conclusions and discussion

In conclusion, we propose to use the KPO for the quantum supervised machine learning with variational quantum circuits. We numerically show that, although we use a single KPO, the expressibility of our method is higher than the conventional method with six qubits. In our method, we can tune an amplitude of the initial coherent state, and we numerically show that the expressibility increases as we increase the amplitude.

In this paper, we provide proof of concept using a regression problem as an example. Our method also offers advantages due to its expressive nature for other machine learning problems, including classification, generation, reinforcement learning, and sequential learning. Furthermore, we acknowledge that the quantum kernel method (Havlíček et al. 2019) could be another promising application of our approach, as our data encoding methodology into quantum states introduces new types of quantum kernels. Exploring these applications is a promising direction for future research.

In the NISQ era, it is crucial to implement the algorithm with a fewer resource, and our results to use the KPO will contribute to reduce resource. KPO network may be used as a variant of continuous variable neural network (Killoran et al. 2019). There are many potential applications to use the continuous degrees of freedom of the KPO. We hope that our research will help to expand the range of applications of the KPO.