1 Introduction

A machine learning algorithm uses training data as input, learns the rules and properties inherent in the data, and solves problems such as regression and classification for unknown data. These algorithms are currently used in various fields, such as recommendation systems for e-commerce sites (Sheikh et al. 2019) and reaction prediction for chemical substances (Stocker et al. 2020), and represent one of the key technologies supporting artificial intelligence.

Domingos, a machine learning researcher, categorized machine learning researchers into five major tribes: symbolists, connectionists, evolutionists, Bayesians, and analogists (Domingos 2015). Each tribe has different ideas about learning and a core algorithm. Connectionists consider the brain to be represented by its operating principles and learn by adjusting the strength of connections between neurons. Analogists see similarities between situations as the key to learning and thus reasoning about other similarities. Their core algorithm is the support vector machine (SVM), a machine learning algorithm proposed in Cortes and Vapnik (1995). It is a general-purpose algorithm that can be applied to regression problems, classification problems, etc. It finds the optimal solution by maximizing the margin.

The use of SVM in combination with the kernel method, which uses a kernel function to represent the similarity between the data, enables data to be mapped into a feature space for training and prediction. The use of a kernel trick enables various feature mapping functions to be virtually used without explicitly computing high-dimensional feature mappings. This is a major advantage of the kernel method, and the flexibility to choose a feature mapping function for the target problem makes it possible to use the appropriate feature space for the problem. This improves the classification accuracy of the learned model. Another advantage of the kernel method is that it has clear theoretical properties, such as the fact that learning with SVM is often formulated as a convex optimization problem (Bishop 2006), which guarantees that a global optimal solution can be found.

There are, however, disadvantages of the kernel method, such as its computational complexity. When calculating the inner product of the data using the feature mapping function, the amount of calculation is substantially higher when using large-scale datasets, which may become a bottleneck in the training process. Connectionist approaches using deep neural networks are thus often used for large-scale datasets, but they also have disadvantages, such as model complexity (which makes it impossible to guarantee the optimality of the learning results), low interpretability, and a lack of theoretical understanding of generalization performance. For these situations, recently, some research has focused on the dynamics of neural networks in learning (Jacot et al. 2018).

The use of various machine learning models has been increasing along with the growth of big data. However, the increasing amount of data that is handled increases the time it takes to train machine learning models. Therefore, it is important to develop models that are scalable with the amount of training data.

Quantum computing is used in various fields, including quantum mechanics (Nielsen and Chuang 2010). In quantum computing, the key to gaining an advantage over classical computing is to make effective use of two unique properties: superposition and entanglement. Superposition is the ability to maintain multiple states simultaneously, which is generally not easy to do with classical computing. Superposition makes it possible to compute in parallel and to use amplitudes, or degrees of overlap, in calculations. Entanglement enables a dependency to be introduced between the states of each qubit, making it possible to create more complex superposition states.

Various research has been conducted on high-performance algorithms that use these properties, including Shor’s algorithm (Shor 1994), used for prime factorization, and Glover’s algorithm (Grover 1996), used for the fast search of unordered data. Many other algorithms have been proposed that can reduce computational complexity compared to the currently used algorithms. Some groups (Arute et al. 2019; Zhong et al. 2020) have recently published research results showing that quantum computing is faster than classical computing on certain tasks. There are also some quantum-inspired methods, which are classical methods based on the idea of existing quantum methods. Liu et al. (2021) tackles the classical computational complexity tasks in Arute et al. (2019) and shows that classical methods can tackle the same task almost as fast as quantum model.

In quantum machine learning, exponential speedup based on the Harrow-Hassidim-Lloyd (HHL) algorithm has attracted much attention, and various models have been developed. Machine learning algorithms that aim to quadratic acceleration using Grover’s algorithm (Grover 1996) have also been developed. A recent trend is not only to use fast quantum algorithms to accelerate machine learning subroutines, but also to theoretically discover machine learning tasks or domains that are relatively easy for quantum computing but difficult for classical computing. As a result of recent algorithmic developments, expectations for quantum computing have been rising in recent years.

Hardware for quantum computers has also been the subject of intense research. A variety of materials for quantum computers are being considered, including superconductors, ion traps, and light. Each material has certain advantages and disadvantages in terms of performance, such as ease of integration, the number of clocks, and the fidelity of gate operations, and it remains to be seen which material will bring the greatest advantage.

Despite the rising expectations, the problem settings in which quantum computing can demonstrate its superiority over classical computing are very limited at present because of the noise factor and limited hardware scalability. This makes finding useful applications a major issue. The use of quantum computing in machine learning should speed up the linear algebra operations used in the computation and is thus a candidate application. The quantum SVM (Rebentrost et al. 2014) is an algorithm for quantum machine learning and, under certain conditions, it can exponentially reduce computational complexity compared to existing SVMs. The implementation of quantum SVM is studied in Li et al. (2015) and Yang et al. (2019).

We have devised a quantum SVM with the Kronecker kernel for the pairwise classification problem, which is the problem of determining whether two sets of data satisfy a certain relationship. It has been studied in a number of classical computing methods. Kronecker kernel, one of the kernels used in pairwise classification, is difficult to apply to large-scale data in classical computing because the Kronecker product of the kernel matrix is huge. On the other hand, in quantum computing, the Kronecker product of matrices can be expressed simply by using the tensor product of quantum states, so the computational complexity is smaller than that in classical computing. This means that it is easy to scale up to large-scale data. We thus hypothesize that the pairwise problem is a candidate problem setting in which quantum computing may have an advantage.

2 Backgrounds

2.1 Methods for simple classification problem

Before discussing the pairwise classification problem, we first discuss a simple version of the problem to make the subsequent explanation easier to understand. The classification problem is to predict the label assignment of the test data using training data for which the label assignment is already known. Here, we focus on the binary classification problem, the simplest classification problem. Let \(d \in \mathbb {N}\) be the dimension of feature vector, the set \(\mathcal {X} \subset \mathbb {R}^d\) be a dataset, vector \(\varvec{x} \in \mathcal {X}\) be a data point, and set \(C_+,C_- \subset \mathbb {R}^d\) be two classes that divide set \(\mathbb {R}^d\); i.e., \(C_+ \cup C_- = \mathbb {R}^d\) and \(C_+ \cap C_- = \emptyset\). Function \(f:\mathbb {R}^d \rightarrow \mathbb {R}\) is then defined as

$$\begin{aligned} f(\varvec{x}) = \left\{ \begin{array}{ll} +1 &{} (\varvec{x} \in C_+) \\ -1 &{} (\varvec{x} \in C_-). \end{array} \right. \end{aligned}$$

In the binary classification problem, the values of \(\varvec{x}\) and \(f(\varvec{x})\) are given for some \(\varvec{x}\) as a training dataset but not for a test dataset. The goal of the binary classification problem is, for a new data point \(\tilde{\varvec{x}} \in \mathbb {R}^d\) in the test dataset, to predict the value of \(f(\tilde{\varvec{x}})\).

2.1.1 Kernel method

One approach to solving pairwise classification problems is to use the kernel method (Schölkopf et al. 2002), which enables classification in a high-dimensional space by replacing the inner product of data points in the model formula with a kernel function. Different kernel functions yield different classification boundaries.

Let k be a function of two variables over \(\mathbb {R}^d\) (this explanation also applies when k is a two variables function over arbitrary set X; but for simpleness, we describe here the case \(X = \mathbb {R}^d\)). Function k is a kernel function over \(\mathbb {R}^d\) if it satisfies two conditions:

  1. (i)

    symmetry, i.e.,

    $$\begin{aligned} k(\varvec{x},\varvec{y}) = k(\varvec{y},\varvec{x}) \end{aligned}$$

    holds for all \(\varvec{x},\varvec{y} \in \mathbb {R}^d\).

  2. (ii)

    positive semidefiniteness, i.e.,

    $$\begin{aligned} \sum \limits _{i,j=1}^n c_i c_j k(\varvec{x}_i,\varvec{x}_j) \ge 0 \end{aligned}$$

    holds for any \(n \in \mathbb {N}\) and all \(\{\varvec{x}_1,\ldots ,\varvec{x}_n\} \subset \mathbb {R}^d\),\(\{c_1,\ldots ,c_n\} \subset \mathbb {R}\).

Here we define \(\langle \cdot |\cdot \rangle\) as a regular inner product in \(\mathbb {R}^d\); i.e., for \(\varvec{x},\varvec{y} \in \mathbb {R}^d\) such that \(\varvec{x} = [x_1,\ldots ,x_n]^T,\varvec{y} = [y_1,\ldots ,y_n]^T\),

$$\begin{aligned} \langle \varvec{x} |\varvec{y} \rangle = \sum \limits _{i=1}^n x_i y_i. \end{aligned}$$

Then, for any mapping \(\Phi :\mathbb {R}^d \rightarrow {\mathcal {H}}\) (Hilbert space),

$$\begin{aligned} k(\varvec{x},\varvec{y}) = \langle \Phi (\varvec{x}) |\Phi (\varvec{y}) \rangle \end{aligned}$$

is a kernel function. From this, we can consider the inner products of the elements of \(\mathbb {R}^d\) in \({\mathcal {H}}\) by using k. An example kernel is a linear kernel \(k(\varvec{x},\varvec{y})=\langle \varvec{x} |\varvec{y} \rangle\), the simplest type of kernel. There are many kinds of kernel functions, such as polynomial kernel \(k(\varvec{x},\varvec{y})=(\langle \varvec{x} |\varvec{y} \rangle + b)^{c}(b \in \mathbb {R}, c \in \mathbb {N})\), which is an extension of the linear kernel, and Gaussian kernel \(k(\varvec{x},\varvec{y})=\exp (-\gamma \Vert \varvec{x} - \varvec{y}\Vert ^2 )\), \(\gamma \in \mathbb {R}\) is the hyper parameter.

This means that the kernel best suited to solving the problem can be selected. The mapping of data in another space is called kernel trick. This mapping \(\Phi\) is called feature mapping, and the SVM model is based on this trick.

We also define the \(n \times n\) matrix \(K_{\mathcal {X}}\) for the dataset \(\mathcal {X} = \{\varvec{x}_1,\ldots ,\varvec{x}_n \}\) so that (ij) element of \({K_{\mathcal {X}}}\) is \(\langle \varvec{x}_i |\varvec{x}_j \rangle\). We call the matrix \(K_{\mathcal {X}}\) as kernel matrix over \(\mathcal {X}\).

2.1.2 Least squares support vector machine

The least squares support vector machine (LS-SVM) is a classical classification model derived from the SVM, proposed in Suykens and Vandewalle (1999). In this section, we consider an LS-SVM model for the binary classification problem. For the training data point \(\varvec{x}\in \mathbb {R}^d\), we assume that the correct answer label is determined by \(t\in \{+1,-1\}\). In this case, the linear model

$$\begin{aligned} y(\varvec{x})=\varvec{w}^T \phi (\varvec{x})+b \end{aligned}$$
(1)

can be used to estimate the label for the data point \(\varvec{x}\). Here, \(b\in \mathbb {R}\) is the bias parameter and \(\varvec{w} = [w_1,\ldots ,w_{\dim \mathcal {H}}] \in {\mathcal {H}}\) \((\dim \mathcal {H}\) is the dimension of Hilbert space \(\mathcal {H}\)) is the weight vector. The function \(\phi : \mathbb {R}^d \rightarrow {\mathcal {H}}\), the feature mapping function, is used to move the raw data into another (usually high-dimensional) space, where it may become linearly separable. The sign of \(y(\varvec{x})\) calculated using these values is then used as the prediction of the classification class. That is, prediction label \(\tilde{y}(\varvec{x})\) for data point \(\varvec{x}\) is given by

$$\begin{aligned} \tilde{y}(\varvec{x}) = \left\{ \begin{array}{ll} +1 &{} (y(\varvec{x}) \ge 0) \\ -1 &{} (y(\varvec{x}) < 0 ). \end{array}\right. \end{aligned}$$

In the LS-SVM problem setting, the formulation in a mathematical optimization problem is given as

$$\begin{aligned} \underset{\varvec{w},b,\varvec{e}}{\min } L(\varvec{w},b,\varvec{e}) = \frac{1}{2}\varvec{w}^T \varvec{w} + \frac{\gamma }{2} \sum \limits _{i=1}^N e_i \end{aligned}$$

subject to the constraints

$$\begin{aligned} y_i\left\{ \varvec{w}^T \phi (\varvec{x}_i) + b \right\} = 1-e_i. \end{aligned}$$

Given \(y = \{+1,-1\}\), we have

$$\begin{aligned} {e_i}^2= & {} \left[ 1-\left( y_i\left\{ \varvec{w}^T \phi (\varvec{x}_i) + b\right\} \right) \right] ^2 \\= & {} \left\{ {y_i}^2-\left( y_i\left\{ \varvec{w}^T \phi (\varvec{x}_i) + b\right\} \right) \right\} ^2 \\= & {} \left( y_i - \left\{ \varvec{w}^T \phi (\varvec{x}_i) + b\right\} \right) ^2 \end{aligned}$$

holds, and we assume \(e_i = y_i - \left\{ \varvec{w}^T \phi (\varvec{x}_i) + b\right\}\). Solving this problem using the Lagrange multiplier method,

$$\begin{aligned} \varvec{w} = \sum \limits _{i=1}^N a_i \phi (\varvec{x}_i) \end{aligned}$$

holds by the condition that \(\frac{ tial L'}{ tial \varvec{w}} = 0\) for

$$\begin{aligned} L'(\varvec{w},b,\varvec{e},\varvec{a}) = \frac{1}{2}\varvec{w}^T \varvec{w} + \frac{\gamma }{2} \sum \limits _{i=1}^N e_i - \sum \limits _{i=1}^N a_i \left( e_i - y_i + \left\{ \varvec{w}^T \phi (\varvec{x}_i) + b\right\} \right) . \end{aligned}$$

Then, prediction model (1) to be rewritten as

$$\begin{aligned} y(\varvec{x})=\sum \limits _{i=1}^n a_i{\phi (\varvec{x}_i)}^T{\phi (\varvec{x})}+b=\sum \limits _{i=1}^n a_i \langle {\phi (\varvec{x}_i)} |{\phi (\varvec{x})} \rangle +b, \end{aligned}$$
(2)

where \(a_1,\ldots ,a_n \in \mathbb {R}\) and \(b \in \mathbb {R}\) are parameters that need to be obtained in order to perform the classification. In the LS-SVM model, the parameters are obtained by dividing the problem into multiple linear equations and solving the equations. Specifically, we solve

$$\begin{aligned} \left[ \begin{array}{cc} 0 &{} \varvec{1}^T \\ \varvec{1} &{} K_{\mathcal {X}}+\gamma ^{-1}I_n \\ \end{array}\right] \left[ \begin{array}{c} b\\ \varvec{a}\\ \end{array} \right] = \left[ \begin{array}{r} 0\\ \varvec{t}\\ \end{array}\right] , \end{aligned}$$
(3)

where \(\varvec{1} = [1,\ldots ,1]^T\), \(\varvec{a}=[a_1,\ldots ,a_n]^T\), \(\varvec{t}=[t_1,\ldots ,t_n]^T\), and \(K_{\mathcal X}\) is the kernel matrix. The \(\gamma\) is a positive number (hyperparameter), and \(I_n\) is the \(n \times n\) identity matrix. We define the matrix \(F_{\mathcal {X}}\) as

$$\begin{aligned} F_{\mathcal {X}} = \left[ \begin{array}{cc} 0 &{} \varvec{1}^T \\ \varvec{1} &{} K_{\mathcal {X}}+\gamma ^{-1}I_n \\ \end{array}\right] . \end{aligned}$$

Then Eq. (3) can be expressed as

$$\begin{aligned} F_{\mathcal {X}} \left[ \begin{array}{c} b\\ \varvec{a}\\ \end{array}\right] = \left[ \begin{array}{r} 0\\ \varvec{t}\\ \end{array} \right] . \end{aligned}$$

We can now solve the Eq. (3) by inverse matrix calculation; then

$$\begin{aligned} \left[ \begin{array}{c} b\\ \varvec{a}\\ \end{array}\right] = (F_{\mathcal {X}})^{-1} \left[ \begin{array}{r} 0\\ \varvec{t}\\ \end{array}\right] \end{aligned}$$
(4)

follows (we can choose the hyperparameter \(\gamma\) so that \(F_{\mathcal {X}}\) has an inverse matrix). Doing this gives the values of the parameters \(\varvec{a}\) and b, and the prediction can be made using the formula (2).

2.1.3 Quantum support vector machine

Quantum SVM uses quantum computing and was proposed in Rebentrost et al. (2014). The basic idea is the same as that of the LS-SVM classification model. Its use of quantum computing speeds up the calculation of the inverse matrix, a computationally expensive part of the classical LS-SVM algorithm.

The whole model of quantum SVM is shown in Fig. 1. This model is composed of four registers, and qubits in the first and fourth registers are used as ancilla qubits. Here, we note that the figures of quantum circuits are drawn using Quantikz (Kay 2019) in this paper. The behavior of the quantum SVM can be summarized as follows:

  1. 1.

    Prepare a quantum state from the training data. The model is shown in Fig. 2. At that time, training data are pre-normalized and encoded in the third register. When the training data is \(\varvec{x} = [x_1,\ldots ,x_n]^T\) and \(C=\sum _{i=1}^{n} |{x_i}|^2\), this training data is encoded as \(|\varvec{x} \rangle = \frac{1}{\sqrt{C}} \sum _{i=1}^n x_i |i \rangle\). After encoding, obtain parameters for prediction by using the training data encoded in the quantum state and applying the HHL algorithm (Harrow et al. 2009). This algorithm is a powerful quantum algorithm that enables the inverse matrix to be calculated more quickly than classical methods. It solves linear equations in the form \(A |\varvec{y} \rangle = |\varvec{b} \rangle\) in a quantum state, and the result is the quantum state equivalent to \(|\varvec{y} \rangle = A^{-1} |\varvec{b} \rangle\). Using HHL algorithm and the quantum state corresponds to \([0 \, \varvec{t}]^T\), the state corresponds to \([b \, \varvec{a}]^T\) is obtained as a quantum state \(|\varvec{a} \rangle\) (see Eq. 4). Lastly, apply amplitude encoding of training data in the second register.

  2. 2.

    Prepare a quantum state from the test data. The model is shown in Fig. 3. Apply amplitude encoding of test data in the second register and Hadamard gates in the third register to make a superposition state. X gate in the first register is applied to take the inner product with the state that HHL algorithm succeeded.

  3. 3.

    Determine the sign of the inner product between the quantum state \(|\tilde{\varvec{x}} \rangle\) corresponding to data point \(\tilde{\varvec{x}}\) of the test data and \(A^{-1}|b\rangle = |\varvec{a} \rangle\) obtained in (1). Swap test operation is performed to measure the inner product of the parameters and the new data point. The sign of the inner product is used in the classification problem with the SVM, so a variant of the swap test is used to obtain the sign of the inner product. We call this a signed swap test, the method proposed in Zhao et al. (2019).

Fig. 1
figure 1

Outline of quantum SVM model comprising three main parts: train, test, and signed swap test (prediction). In the train part, the quantum state is calculated using training data, and training data is encoded as a quantum state for prediction. In the test part, test data is encoded as a quantum state. In the signed swap test part, measurement is repeated many times and the inner product of the states is estimated from the result of measurement

Fig. 2
figure 2

Outline of the train part in existing quantum SVM model. First, applying amplitude encoding(AE), encoding training data label as amplitude in the third register, the state becomes \(|0 \rangle |0 \rangle |\varvec{t} \rangle\). And then applying HHL algorithm with \(K_{\mathcal {X}}\), the state becomes \(|1 \rangle |0 \rangle |\varvec{a} \rangle\). 1 in the first register means success of the HHL algorithm. Lastly, by amplitude encoding in the second register, the state becomes \(|1 \rangle |\varvec{x} \rangle |\varvec{a} \rangle\)

Fig. 3
figure 3

Outline of the test part in existing quantum SVM model. X gate is applied for the first register to match \(|1 \rangle\) in the first register of the training part. Amplitude encoding for the second register encodes test data. Hadamard gate for the third register makes superposition state to evaluate the inner product

Signed swap test is performed as follows. We assume that \(|\psi _1 \rangle\) and \(|\psi _2 \rangle\) in the following are real-valued vectors and set a goal to obtain the (signed) value of \(\langle \psi _1 |\psi _2 \rangle\).

  1. 1.

    Prepare the state \(\frac{1}{\sqrt{2}} (|0 \rangle |\psi _1 \rangle + |1 \rangle |\psi _2 \rangle )\).

  2. 2.

    Apply a Hadamard gate to the first qubit so that the state changes to \(|\psi ' \rangle = \frac{1}{2} ((|0 \rangle + |1 \rangle )|\psi _1 \rangle + (|0 \rangle - |1 \rangle ) |\psi _2 \rangle ) = \frac{1}{2} (|0 \rangle (|\psi _1 \rangle + |\psi _2 \rangle ) + |1 \rangle (|\psi _1 \rangle - |\psi _2 \rangle ))\).

  3. 3.

    Measure the first qubit. The probability of obtaining 0 from the first qubit is \(p_0 = |\langle 0 |\psi ' \rangle |^2 = \frac{1}{4}(\langle \psi _1 |+ \langle \psi _2 |)(|\psi _1 \rangle + |\psi _2 \rangle ) = \frac{1}{2}+\frac{1}{2}\langle \psi _1 |\psi _2 \rangle\). This means that we know that \(\langle \psi _1 |\psi _2 \rangle = 2 p_0 - 1\). Therefore, we can predict the value of the inner product of \(|\psi _1 \rangle\) and \(|\psi _2 \rangle\) by again measuring the first qubit and calculating the probability of obtaining 0.

The final prediction class is determined on the basis of the sign of the inner product obtained in this way. This means, for example, that we can predict that the label is \(+1\) when the value of \(y(\varvec{x}) \ge 0\) and \(-1\) otherwise.

2.2 Pairwise classification problem

Pairwise classification is a problem in which two data points are given as input and the problem is to determine if they satisfy a certain relationship, investigated in Oyama and Manning (2004). This formulation is used in many problem settings, such as link prediction (Yokoi et al. 2017) and chemical interaction prediction (Ben-Hur A et al. 2005). It is used, for example, to determine whether user \(\varvec{x}\) likes item \(\varvec{z}\) and whether chemical compounds \(\varvec{x}\) and \(\varvec{z}\) react chemically.

Here, we define the pairwise classification problem. Let \(d_1,d_2 \in \mathbb {N}\), \(\mathcal {X} \subset \mathbb {R}^{d_1}\) and \(\mathcal {Z} \subset \mathbb {R}^{d_2}\) be datasets, \(\varvec{x} \in \mathcal {X}\) and \(\varvec{z} \in \mathcal {Z}\) be \(d_1\)-dimensional and \(d_2\)-dimensional data points, and \(\varvec{R}\) be some relation over \(\mathcal {X} \times \mathcal {Z}\). We define \((\varvec{x},\varvec{z}) \in \mathcal {X} \times \mathcal {Z}\) as a pair of data points and define the function f as

$$\begin{aligned} f(\varvec{x},\varvec{z}) = \left\{ \begin{array}{ll} +1 &{} (\varvec{x} \, \text {and} \, \varvec{z} \, \text {satisfy relation} \, \varvec{R}) \\ -1 &{} (\text {otherwise}) \end{array}\right. . \end{aligned}$$

In the pairwise classification problem, values of \(f(\varvec{x},\varvec{z})\) are given for the training datasets but not for the test datasets. The goal is, for a new data point \((\tilde{\varvec{x}}, \tilde{\varvec{z}}) \in \mathbb {R}^{d_1} \times \mathbb {R}^{d_2}\), to predict the value of \(f(\tilde{\varvec{x}}, \tilde{\varvec{z}})\). In the training data, information about the class to which each data point belongs is not given. Instead, information is given about whether two data points satisfy a certain relationship, which requires a model suitable for pairwise classification.

2.3 Kronecker kernel

The Kronecker kernel is used to solve problems such as pairwise classification. When two pairs of data points \(((\varvec{x}_{1},\varvec{z}_{1}),(\varvec{x}_{2},\varvec{z}_{2}))\) are extracted from the product set \(\mathcal {X} \times \mathcal {Z}\) (\(\varvec{x}_1, \varvec{x}_2 \in \mathcal {X}\), \(\varvec{z}_1, \varvec{z}_2 \in \mathcal {Z}\), \((\varvec{x}_{1},\varvec{z}_{1}),(\varvec{x}_{2},\varvec{z}_{2}) \in \mathcal {X} \times \mathcal {Z}\)), the Kronecker kernel function \(k_{\otimes }\) is defined as

$$\begin{aligned} k_{\otimes }((\varvec{x}_{1},\varvec{z}_{1}),(\varvec{x}_{2},\varvec{z}_{2})) = k_{\mathcal {X}}(\varvec{x}_1, \varvec{x}_2) k_{\mathcal {Z}}(\varvec{z}_1, \varvec{z}_2). \end{aligned}$$

This function satisfies the two conditions discussed in Section 2.1.1. On the basis of this representation, Kronecker kernel matrix can be expressed as

$$\begin{aligned} K_{\mathcal {X} \otimes \mathcal {Z}} = K_{\mathcal {X}} \otimes K_{\mathcal {Z}}, \end{aligned}$$

where \(\otimes\) means the Kronecker product. The Kronecker product of matrices A and B is an \(r_A r_B \times c_A c_B\) matrix defined as

$$\begin{aligned} (A \otimes B)_{(i-1)*r_A+k,(j-1)*c_A+l} = A_{i,j} B_{k,l}, \end{aligned}$$

where A is a matrix \(r_A \times c_A\) and B is a matrix \(r_B \times c_B\). A standard kernel matrix stores the inner products between the data, and the Kronecker kernel stores the inner products between data pairs. From the above definition, the inner product of \(Pair1(\varvec{x}_{1},\varvec{z}_{1})\) and \(Pair2(\varvec{x}_{2},\varvec{z}_{2})\) can be expressed as

$$\begin{aligned} \langle Pair1 |Pair2 \rangle = \langle \varvec{x}_1 |\varvec{x}_2 \rangle \langle \varvec{z}_1 |\varvec{z}_2 \rangle . \end{aligned}$$

This expression of the inner product of data pairs intuitively tells us that, when \(\varvec{x}_1\) and \(\varvec{x}_2\) are similar and \(\varvec{z}_1\) and \(\varvec{z}_2\) are similar, then Pair1 and Pair2 can be thought similar. In later discussion, to simplify the problem, we assume that \(\mathcal {X} = \mathcal {Z}\). In the general case, \(\mathcal {X} \ne \mathcal {Z}\), our model can be applied with little change. In the following part, we define the dimensions of \(\varvec{x} \in \mathcal {X}\) and \(\varvec{z} \in \mathcal {Z}\) is d, and the numbers of elements in \(\mathcal {X}\) and \(\mathcal {Z}\) is n respectively.

2.4 Computational complexity for solving pairwise classification problem with Kronecker kernel

Here, we analyze the computational complexity of the inverse calculation for a kernel matrix, an \(n \times n\) semi-definite matrix. To simplify the problem, we discuss the case in which the model is used without an offset term. In classical computing, Gaussian elimination is one of the general methods used to calculate the inverse of a general regular matrix, and its complexity is \(O(n^3)\). As a classical approximation method, conjugate gradient method (Hestenes and Stiefel 1952) is used, whose complexity is \(O(ns \kappa \log (1 / \epsilon ))\) (Dervovic et al. 2018). Conjugate gradient method can be applied in the condition that the matrix is real symmetric and s-sparse, which means that the number of non-zero elements in the matrix is at most s. The value \(\kappa\) is called condition number and is defined as the absolute value of the ratio of the largest singular value to the smallest singular value; \(\epsilon\) implies the error between the obtained values and the accurate values.

In quantum computing, we can apply the HHL algorithm, the calculation of which is the bottleneck in the quantum SVM, in \(O(s\kappa \, poly(\log (s\kappa / \epsilon )))\) time complexity (Childs et al. 2017). We thereby achieve a computational complexity of \(O(s\kappa \, poly(\log (s\kappa / \epsilon )))\) by assuming the use of quantum random access memory (QRAM) (Giovannetti et al. 2008), by which we can encode the training and test data in \(O(\log n)\) time. Therefore, when the kernel matrix is sparse, especially when \(s=O(\log n)\), quantum computing can perform the inverse matrix computation exponentially faster than classical computing for n.

For the pairwise classification problem with the Kronecker kernel, the matrix of the pairwise kernels has \(O(n^2)\) rows and \(O(n^2)\) columns, which causes the large computational complexity. In this case, the computational complexity of the SVM increases is \(O(n^6)\). With the conjugate gradient method, the complexity is \(O(n^2 s^2 \kappa ^2 \log (1 / \epsilon ))\). For the complexity of the space, if the elements of the Kronecker kernel matrix are stored explicitly, at least \(O(n^4)\) memory is needed. It is thus difficult to prepare memory large enough to store the Kronecker kernel matrix for a large dataset. Applying these algorithms to large datasets is thus difficult due to time and space complexities.

3 Proposed model

We have devised a quantum SVM model with a Kronecker kernel for solving pairwise classification problems. The outline of our model is the same as the existing model (Fig. 1), and our model also comprises three parts: train, test, and signed swap test.

First, initial state \(|0 \rangle |0 \ldots 0 \rangle |0 \ldots 0 \rangle |0 \rangle\) is prepared using \(O(\log n)\) qubits to store information. Next, a Hadamard gate is applied to the fourth register, and the state becomes \(\frac{1}{\sqrt{2}} |0 \rangle |0 \ldots 0 \rangle |0 \ldots 0 \rangle (|0 \rangle + |1 \rangle )\). The fourth register is used to estimate the sign of the inner product.

The circuit in the training part (Fig. 4) uses three registers \(|0 \rangle |0 \ldots 0 \rangle |0 \ldots 0 \rangle\) and here we only focus on these three registers. First, amplitude encoding of the training data label is applied and the state becomes to \(|0 \rangle |0 \ldots 0 \rangle |\varvec{t} \rangle\). Next, HHL algorithm is applied. In the phase estimation part of the HHL algorithm, controlled-\(e^{i \theta F_{\mathcal {X} \otimes \mathcal {Z}}}\) gates are applied; this calculation is done like the controlled-\(e^{i \theta F_{\mathcal {X}}}\) in the original quantum SVM model. Here, \(F_{\mathcal {X} \otimes \mathcal {Z}}\) is defined as

$$\begin{aligned} F_{\mathcal {X} \otimes \mathcal {Z}} = \left[ \begin{array}{cc} 0 &{} \varvec{1}^T \\ \varvec{1} &{} K_{\mathcal {X} \otimes \mathcal {Z}}+\gamma ^{-1}I_{n^2} \\ \end{array} \right] = \left[ \begin{array}{cc} 0 &{} \varvec{1}^T \\ \varvec{1} &{} K_{\mathcal {X}} \otimes K_{\mathcal {Z}}+\gamma ^{-1}I_{n^2} \\ \end{array} \right] . \end{aligned}$$
Fig. 4
figure 4

Outline of the train part in the proposed quantum SVM model. First, encoding training data label as amplitude in the third register, the state becomes \(|0 \rangle |0 \rangle |\varvec{t} \rangle\). And then applying HHL algorithm with \(K_{\mathcal {X} \otimes \mathcal {Z}}\), the state becomes \(|1 \rangle |0 \rangle |\varvec{a} \rangle\). 1 in the first register means success of HHL algorithm. Lastly, encoding training data as amplitude in the second register, the state becomes \(|1 \rangle |\varvec{x}, \varvec{z} \rangle |\varvec{a} \rangle\)

By applying the HHL algorithm to the state \((|0 \rangle |0 \ldots 0 \rangle |\varvec{t} \rangle)\), the state becomes \(|1\rangle|0\ldots 0 \rangle|\mathbf{a}\rangle\) where \(|\varvec{a} \rangle = b |1 \rangle + \sum _{i=1}^{n^2} a_i |i+1 \rangle\). The training data is then again encoded as a quantum state so that the state is \(|1 \rangle |\varvec{x}, \varvec{z} \rangle |\varvec{a} \rangle\), here \(|\varvec{x}, \varvec{z} \rangle = |\varvec{x} \rangle |\varvec{z} \rangle\). We define this state \(|\psi _{train} \rangle\), and at this point the overall quantum state is \(\frac{1}{\sqrt{2}} (|\psi _{train} \rangle |0 \rangle + |0 \rangle |0 \ldots 0 \rangle |0 \ldots 0 \rangle |1 \rangle )\).

The circuit in the test part (Fig. 5) also uses three registers \(|0 \rangle |0 \ldots 0 \rangle |0 \ldots 0 \rangle\). The test data are encoded as a quantum state and superposed equally. The state is then \(\frac{1}{\sqrt{n^2+1}} \sum _{i=1}^{n^2+1} |1 \rangle |\tilde{\varvec{x}}, \tilde{\varvec{z}} \rangle |{i} \rangle\). We define this state \(|\psi _{test} \rangle\), and at this point the overall quantum state is \(\frac{1}{\sqrt{2}} (|\psi _{train} \rangle |0 \rangle + |\psi _{test} \rangle |1 \rangle )\). After this, the inner product \(\langle \psi _{train} |\psi _{test} \rangle\) is calculated. The evaluation is done using the signed swap test, described in Section 2.1.3, and finally the prediction is obtained.

Fig. 5
figure 5

Outline of the test part in the proposed existing quantum SVM model. X gate is applied for the first register to match \(|1 \rangle\) in first register of training part. By amplitude encoding for second register, encoding test data in second register. Hadamard gate for third register makes superposition state to evaluate inner product

The complexity of the proposed model is reflected in the bottleneck, i.e., calculation of the HHL algorithm, so changing the implementation of the HHL algorithm affects the total complexity of our model. The time complexity of the fastest HHL algorithm is \(O(s \kappa poly(\log (s \kappa / \epsilon )))\). Therefore, if the kernel matrix has \(O(n^2)\) rows and \(O(n^2)\) columns, the complexity of the fastest HHL algorithm is \(O(s^2 \kappa ^2 poly(\log (s \kappa / \epsilon )))\). This is better than \(O(n^2 s^2 \kappa ^2 \log (1 / \epsilon ))\) of the conjugate gradient method, the fastest classical matrix inversion algorithm. Especially in the case \(s, \kappa , 1/\epsilon = O(poly (\log n))\), when \(s, \kappa , 1/\epsilon\) grow exponentially smaller than n, the proposed algorithm achieves exponential acceleration compared to classical algorithms (Dervovic et al. 2018). This implies the possibility of quantum advantage in time complexity for the dataset such that the kernel matrix is sparse. In this case, the time complexity of the proposed quantum model for pairwise classification is \(O(poly (\log n))\) against \(O(n^2 poly (\log n))\) of the existing classical model. It is a larger reduction than in the case of binary classification, where \(O(n \, poly (\log n))\) in the classical model reduced to \(O(poly (\log n))\) in the quantum model.

We use \(O(\log n)\) qubits to store the training and test data, some (constant) qubits for binary expression of the eigenvalue in the HHL algorithm, and some qubits as ancilla qubits. Therefore, if only \(O(\log n)\) qubits are needed for the binary expression of eigenvalue \(K_{\mathcal {X}} \otimes K_{\mathcal {Z}}\), the space complexity of our model is \(O(\log n)\), smaller than the \(O(n^4)\) of the classical methods. In terms of space complexity, our model can be applied to large datasets.

4 Experiment

4.1 Setup

In an experiment, we performed pairwise classification using a quantum SVM with a Kronecker kernel and a classical LS-SVM with a Kronecker kernel, both on the same dataset. For a Kronecker kernel, we use two different base kernels, the linear kernel \(K_{\mathcal {X} \otimes \mathcal {Z}}(\varvec{x},\varvec{z}) = \langle \varvec{x}|\varvec{z}\rangle\) and the quadratic kernel (polynomial kernel with degree 2 and no bias term) \(K_{\mathcal {X} \otimes \mathcal {Z}}(\varvec{x},\varvec{z}) = \langle \varvec{x}|\varvec{z}\rangle ^2\). The results were used to determine whether the quantum SVM with a Kronecker kernel has the same classification rate as the classical LS-SVM. We use Gaussian elimination for the calculation of the inverse matrix in the classical LS-SVM model.

In terms of hardware, classical and quantum methods use the same computer, WA9A-G200/WT. For the execution of quantum circuits, we use the simulator (qasm_simulator) in Qiskit, a quantum computing framework provided by IBM. Note that we did not use an actual quantum computer but rather a simulator due to the size of the circuit needed in the experiment. An actual quantum computer without error collection system would not be able to provide accurate calculation results because of its high error rate, which causes unintentional change of quantum state and measurement results. The results come probabilistically, so the prediction result is determined by the results of 8192 runs.

In the experiment, we used the basic OCR image dataset, where each example contains an image of a digit, used in a previous study (Li et al. 2015). Each image in the dataset has \(128 \times 128\) size, and each pixel has two values 0 and 1, which corresponds to white and black. Since this dataset is too large to process the data directly on a simulator, we used the features calculated from the data as input data instead of the original image data. By feature extraction, we used two regularized values instead of raw image input. An extracted feature vector, not the raw image input, was also used for the classical model.

First, we calculated the horizontal (HR) and vertical (VR) ratios of pixels. HR is the ratio of the number of black pixels in the upper half of the image to the number of black pixels in the lower half of the image, and VR is the ratio of the number of black pixels in the left half of the image to the number of black pixels in the right half of the image. Then, we calculate elements of the feature vector \(v'_1\) and \(v'_2\) by \(v'_1 = HR \times 1.3 - 0.62\) and \(v'_2 = VR \times 0.95 - 0.42\)(linear mapping). Next, \(v'_1\) and \(v'_2\) are scaled to \(v_1\) and \(v_2\) so that \((v_1)^2 + (v_2)^2 = 1\) holds. This regularized feature vector \((v_1, v_2)\) is used as input of training and test data. These definitions of feature vectors are after the definitions used in Li et al. (2015) and Yang et al. (2019).

We use only the images of 6 and 9. Two images (an image of 6 and an image of 9) in the training dataset were used, so four pairs comprised the training dataset. Twenty images (ten images of 6 and ten images of 9) were prepared for the test dataset, so we tested 400 pairs in total. In this experiment, we define the pairwise relationship between two digits as whether the two numbers are the same or not. For example, we assign the value \(+1\) for the relationship between two images of 6 and the value \(-1\) for the relationship between an image of 6 and an image of 9.

We remark that the classical model also uses the extracted feature vector, not raw image input. In the experiment, we use one classical model and three quantum models, which vary in the size of circuit, and set value of \(\gamma\) with \(\gamma = 2^{-5},2^{-4},\ldots ,2^9\), to search the relation with hyperparameter.

4.2 Implementation

Our implementation model refers to Li et al. (2015) in many parts, but has some changes, so we mainly note the differences of the models. The general circuit of our model is shown in Fig. 6, which contains four registers.

Fig. 6
figure 6

Outline of implementation of our proposed model. In our experiment, the second register has m qubits, and the third register has 2 qubits. First applying Hadamard gate to the fourth register to make super position. Then, unitary circuits \(U_{train}\) and \(U_{test}\) are applied controlled by the state of the fourth register. After, X gate is applied to the qubit in the first register for the evaluation of the inner product. Lastly, the Hadamard gate is applied to the fourth register again and measure the state of the first and fourth registers

In the model, \(m+4\) qubits in total (m is the number of qubits used for binary encoding of eigenvalues in quantum phase estimation (QPE)). In the implementation, we cut the bias term b for the simpleness of implementation. The detail of \(U_{train}\) is shown in Fig. 7.

Fig. 7
figure 7

Outline of train part implementation of our proposed model. Encoding training data in the beginning, and applying HHL algorithm, then encodes training data in the second register. In the part of encoding training data, encode training data based on the index of the third register. We remark that measurement operation is not applied in the train part to avoid collapse of super position for the signed swap test

In the implementation model, \(P(\theta )\) and \(Ry(\theta )\) gate is defined as

$$\begin{aligned} P(\theta ) =\left[ \begin{array}{cc} 1 &{} 0 \\ 0 &{} e^{i \theta } \\ \end{array} \right] , Ry(\theta ) = \left[ \begin{array}{cc} \cos \frac{\theta }{2} &{} -\sin \frac{\theta }{2} \\ \sin \frac{\theta }{2} &{} \cos \frac{\theta }{2} \\ \end{array} \right] , \end{aligned}$$

and the \(A^{\dagger }\) gate is defined as a complex conjugate of A gate, that is, \(AA^{\dagger } = A^{\dagger }A = I\). At first, encode the training data label to the amplitudes of the third register. After that, the state of the third register is \(\frac{1}{2} \sum _{i=0}^{3} t_i |i \rangle\). Then, the QPE is applied to write the eigenvalue of \(F_{\mathcal {X} \otimes \mathcal {Z}}\) in the second register in the form of a binary expression. Next, using the state of the second register, write the inverse of the eigenvalues to the amplitude using the Ry (rotation around the y-axis) gate. After, uncomputation of QPE is done, and lastly encode training data to qubits in the second register. In this time, we encode data, not data label. In the implementation, we separate controlled-\(e^{i \theta F_{\mathcal {X} \otimes \mathcal {Z}}}\) gates into two parts, controlled-\(e^{i \theta \gamma ^{-1}I}\) gates and controlled-\(e^{i \theta K_{\mathcal {X} \otimes \mathcal {Z}}}\) gates. In implementation, we use the phase gate to express controlled-\(e^{i \theta \gamma ^{-1}I}\) gates. And the detail of \(U_{test}\) is shown in Fig. 8.

Fig. 8
figure 8

Outline of test part implementation of our proposed model. Encoding test data in the second (first in this figure) register, and make superposition in the third (second in this figure) register

After applying \(U_{train}\) and \(U_{test}\), the signed swap test is needed to estimate the inner product but preparing the quantum state \(\frac{1}{\sqrt{2}}(|0 \rangle |\psi _{train} \rangle + |1 \rangle |\psi _{test} \rangle )\) is difficult because measurement, done in HHL algorithm, breaks the superposition. However, in reality, in this problem setting, the evaluation of the inner product can be done from the measurement results similarly to the case that superposition \(\frac{1}{\sqrt{2}}(|0 \rangle |\psi _{train} \rangle + |1 \rangle |\psi _{test} \rangle )\) can be gained (Appendix 1). The quadratic kernel is implemented by replacing \(|\varvec{x} \rangle\) with \(|\varvec{x} \rangle |\varvec{x} \rangle\), and \(|\varvec{z} \rangle\) with \(|\varvec{z} \rangle |\varvec{z} \rangle\) in the linear kernel. We have conducted experiments with the quadratic kernel only for \(m=4\) and 6 because at least 4 qubits are needed to encode data for the quadratic kernel.

5 Results

Results are shown in Figs. 9 and 10. For the proposed model, the experimental results show that the accuracy is sensitive to the value of \(\gamma\). In the case of \(\gamma = 1\) the proposed model achieves an accuracy of 1.0 for all m values. On the other hand, if the value of \(\gamma\) is small or large, the accuracy of the proposed model decreases, while the accuracy of the classical model remains high. When using the quadratic kernel, the accuracy of the proposed model becomes stable. In particular, the accuracy for small and large values of \(\gamma\) is significantly improved. For the classical model, the accuracy is not sensitive to \(\gamma\) when using either the linear kernel or the quadratic kernel. For the classical model, accuracy is 1.0 in all cases, that means the classical model achieves perfect classification in this dataset. In addition, accuracy in the classical model does not change by the value of \(\gamma\), which differs from the tendency of the proposed quantum model. We suppose this phenomenon is caused by the limitations on the representation of the eigenvalues in quantum SVM. Eigenvalues are precisely calculated in the classical model, but not necessarily in quantum SVM. In quantum phase estimation subroutine in quantum SVM, all eigenvalues of the kernel matrix, which is Hermitian matrix, are assumed to be included in the range [0, 1) to be calculated precisely. In fact, when the value of \(\gamma\) is too small, the eigenvalues of \(K_{\mathcal {X} \otimes \mathcal {Z}} + \gamma ^{-1} I\) become too large (close to \(\gamma ^{-1}\)) that exceed the range [0, 1), which causes inaccurate results of calculation. When the value of \(\gamma\) is too large, the eigenvalues of \(K_{\mathcal {X} \otimes \mathcal {Z}} + \gamma ^{-1} I\) are close to that of \(K_{\mathcal {X} \otimes \mathcal {Z}}\), which are approximately (0.561, 0.188, 0.188, 0.063). In quantum SVM, the eigenvalues of \(K_{\mathcal {X} \otimes \mathcal {Z}}/2^m\) are binary encoded using m qubits, for example, 0.625 is expressed as 0.1010 when \(m = 4\). This expression causes cancellation of significant digits because the eigenvalues are expressed in \(1/2^m\) increments, against that max eigenvalue of \(K_{\mathcal {X} \otimes \mathcal {Z}}/2^m\) is about \(0.561/2^m\). These explanation gives the reason why the accuracy is insensitive to \(\gamma\) in classical SVM, but sensitive in quantum SVM. And execution time per datum can be seen propotional to hyper-parameter m, and it seems natural because the depth of quantum circuit in proposed model increases propotional to hyper-parameter m. And execution time does not vary by the value of \(\gamma\), and execution time of the classical model is more than thousand times faster than the proposed model executed using the quantum simulator. We suppose that the advantage of the quantum method will be kept theoretically. The main purpose of the experiment with small data and the quantum simulator is to confirm implementability. The advantage of the quantum method will become apparent when we apply the method to large data with a real quantum device in the future.

Fig. 9
figure 9

The accuracy with the classical model and the proposed model for different values of \(\gamma\). The accuracy of the proposed model is sensitive to the value of \(\gamma\) while that of the classical model is insensitive

Fig. 10
figure 10

Execution time with the classical model and the proposed model for different values of \(\gamma\). The execution time is insensitive to the value of \(\gamma\). The execution time of the proposed model is proportional to the value of m, which controls the size of the quantum circuit in the model

6 Discussion

We investigated the theoretical speedup of pairwise learning obtained using the Kronecker kernel and showed that the accuracy of a model with a Kronecker kernel can approach that of the classical one. However, it is necessary to investigate whether an actual quantum device has performance comparable to that of conventional devices. From the results and findings gotten through this study, we present some discussion related to performance of quantum machine learning.

One such discussion is about kernel expression. Although making controlled-\(e^{i \theta F_{\mathcal {X} \otimes \mathcal {Z}}}\) gates is difficult in general, there is an algorithm which approximates such gate, known as density matrix exponentiation, used in Lloyd et al. (2014). However, applying this algorithm to our model seems difficult because density matrix exponentiation needs to calculate trace many times and then the superposition collapses by measurement. One way to solve this problem is using an algorithm which makes quantum state \(\frac{1}{\sqrt{2}} |\psi _1 \rangle |\psi _2 \rangle\) from the quantum state \(|\psi _1 \rangle |\psi _2 \rangle\), and using this algorithm, we can make the state \(\frac{1}{\sqrt{2}} |\psi _{train} \rangle |\psi _{test} \rangle\) from the state \(|\psi _{train} \rangle |\psi _{test} \rangle\), which can be prepared easily. However, such algorithm has not been found as far as we investigated. Therefore, finding such algorithm (or prove that such algorithm cannot be made) helps and improves our understanding for quantum machine learning.

We should also focus on the complexity of adding the bias term b to the learning model. The matrix \(F_{\mathcal {X} \otimes \mathcal {Z}}\) is \((n^2+1) \times (n^2+1)\) matrix, and preparing such matrix as quantum state (in the form of density matrix) is complicated compared to the case without a bias term. Finding an effective implementation to prepare such a matrix will improve usability of the quantum SVM model.

And we should think about the performance of other kinds of pairwise kernel, such as the Cartesian kernel proposed in Kashima et al. (2010). Identifying which type of models benefit from quantum computation is very important as a step to develop quantum machine learning. We assumed that the Kronecker product can be computed faster in quantum computing, but more detailed research is needed on what kind of machine learning models and problems to which this approach can be applied. In addition, the size of the problems that can be handled by quantum computers is currently quite limited, so evaluating learning performance experimentally requires defining appropriate problems.

7 Conclusion

We present a model for applying a quantum support vector machine with a Kronecker kernel to pairwise classification. We theoretically demonstrated that the proposed model is faster and more scalable than the conventional models. Especially in the case that the kernel matrix is guaranteed to be very sparse, the proposed model can exponentially faster than the existing models. Under the specific condition, our model works in \(O(poly (\log n))\) for the size of training data n, while the existing classical model works in \(O(n^2 poly (\log n))\). This reduction has a larger impact than the reduction from \(O(n \, poly (\log n))\) to \(O(poly (\log n))\) by the standard quantum SVM. In addition, we experimentally demonstrated that the accuracy of a model with a Kronecker kernel can approach that of the classical one. As for space complexity, the proposed model requires a very small amount of space compared to the existing models. This connection of pairwise learning with quantum machine learning is the first step towards developing a quantum pairwise learning domain.