1 Introduction

With low-depth quantum circuits coming to pass, the interest for devising applications for these physical units has much increased. One fast-developing direction that already forms a sub-discipline of Quantum Machine Learning (Dunjko and Briegel 2018; Biamonte et al. 2017) is devising methods for addressing problems of classical machine learning (ML) with variational quantum circuits (VQCs) (Cerezo et al. 2021). These types of quantum circuits have adjustable angles in gates which can be trained in a fashion analogous to neural networks (Schuld et al. 2020; Farhi and Neven 2018; Pérez-Salinas et al. 2020; Havlíček et al. 2019). On formal level though the mathematical analogy of VQCs with neural networks is far from straightforward, mainly due to the reversibility of VQCs, and the problem of quantum neuron is usually approached with more intrigued models (Schuld et al. 2014; Tacchino et al. 2019; Wan et al. 2017; Cao et al. 2017; Verdon et al. 2017; Torrontegui and García-Ripoll 2019). In addition to neural networks, VQCs show similarities with classical kernel machines (Schuld and Killoran 2019; Schuld 2021; Havlíček et al. 2019) by generating a feature map of classical data in the Hilbert space. In general, the interpretation and most profitable use of VQCs in ML tasks remains an open topic of discussion, including the accurate evaluation of their capacity and their potential or advantages compared to classical models.

This work aims to contribute to the question whether quantum circuits are suitable for solving ML tasks and how increasing the dimension of the Hilbert space can be exploited for this purpose. There are two paths to follow: One is to employ n entangled qubits achieving an exponential increase of space, the other, less investigated path, is to employ qudits. For a single qudit, the dimension of the Hilbert space is increasing linearly with d and without requiring entangling operations which remain demanding on practical level. Our quantum toy model consists of a single qudit operated by a low-depth quantum circuit (which we call single layer). With these limited resources, we are able to show that with a proper encoding and adjustment of d with respect to the dimension of the input one may achieve double lossless memory (LM) dimension (Friedland and Krell 2018) as compared to habitual single-layer neural networks (NN) possessing the same number of trainable parameters. This effect cannot be achieved in the absence of parameters in the encoding phase controlling the feature map on the Bloch hyper-sphere.

Going one step further, while keeping the input dimension fixed, the capacity of the quantum system (in the sense of LM dimension) can be further increased by either re-uploading the data (Pérez-Salinas et al. 2020; Wach et al. 2023), this way introducing more depth into the quantum circuit, or, alternatively, as we propose in this work, to use higher dimensional quantum systems by increasing d. We get preliminary evidence that the two methods give comparable results and therefore the selection should be done in dependence of the available resources.

The structure of the manuscript is as follows. We start by introducing qudits first as physical entities in the lab and then as mathematical objects, i.e., as vectors on the Bloch hyper-sphere. Based on this representation, we develop a general scheme for mapping the data on its surface and rotating them. We then evaluate different encoding-rotation models according to LM dimension and we draw conclusions on optimal methodology. We illustrate the efficiency of qubit and qutrit models by applying them to standard classification problems including both synthetic and real-world data.

2 Qudits

A qudit stands for the state of a d-level quantum system just as a qubit describes a quantum 2-level system. The quantum degrees of freedom and systems habitually used as qubits, with appropriate modifications, are suitable to accommodate qudits in most of the cases. The theoretical interest in qudits (Wang et al. 2020) has driven a considerable number of experimental results on generation and manipulation of single and entangled qudits. While neutral multilevel Rydberg atoms (Deller et al. 2023; Weggemans et al. 2022) or molecular magnets seem the most natural candidates for qudit’s realization, the majority of qudit’s demonstrations in the context of quantum information have been achieved with photons (Lapkiewicz et al. 2011; Erhard et al. 2018; Luo et al. 2019; Kues et al. 2017; Imany et al. 2019). There the qudit’s state is realized by a single photon superposed over d modes which can be spatial, time, frequency or orbital angular momentum ones. With integrated photonics, on-chip generation of two entangled qudits has been reported for \(d=10\) (Kues et al. 2017) and \(d=32\) (Imany et al. 2019) giving good promises for scalability of photonic platforms with d. Superconducting circuits are yet another platform accommodating transmonic qutrits (Fedorov et al. 2012; Kunzhe et al. 2018; Blok et al. 2021) and recently a circuit of five qutrits has been demonstrated (Blok et al. 2021) able to perform teleportation. As the authors state in this work (Blok et al. 2021) these circuits could be possibly extended from qutrits to qudits by leveraging the intrinsic anharmonicity of transmons. Finally, trapped chains of \(^{40}Ca^+\) ions with their natural multilevel structure have been tested (Ringbauer et al. 2022) as a platform of qudits with d up to 7.

Circuits made of qubits require entangling operations in order to exploit the exponentially large space. Two-qubits operations define the complexity of a circuit and experimentally form the most challenging part demanding either controllable interaction between qubits or in the case of quantum optical circuits nonlinearities which have weak effect. On the contrary, in this work we build our model on single qudits which only require coherent manipulation achievable in its full range in most of the aforementioned qudit platforms. For transmonic qutrits, single-qutrit gates forming a universal set have been demonstrated (Blok et al. 2021) showing high fidelity and in addition the readout of the states can be performed with high accuracy. Similarly for ion qudits (Ringbauer et al. 2022) full single-qudit control for \(d=3\) and 5 has been achieved. Concerning Rydberg atoms, there is an extensive experimental knowledge on the manipulating of their state via laser fields and one may see for instance (Weggemans et al. 2022) for specific prescriptions leading to full control of a fermionic strontium \(^{87}Sr\) qudit, \(d=10\). Even though the highest d at the moment is achieved in integrated photonics platforms, there the full set of operations that our model demands for is less obvious. However, recent results (Kues et al. 2017) demonstrating gates on frequency modes, achieved using programmable filters and an electro-optic phase modulator, are highly promising, and open new pathways for the applications of qudits.

2.1 The Bloch hyper-sphere of a qudit

As a mathematical entity a qudit state “lives” in the d-dimensional Hilbert space which is spanned by the eigenstates of the Hamiltonian of the system. Let us denote by \(\left\{ \left| k\right\rangle \right\} _{k=0}^{d-1}\) such a set of normalized eigenstates. Then one can express a generic qudit state

$$\begin{aligned} \left| \psi \right\rangle =\sum _{k=0}^{d-1} c_k \left| k\right\rangle \end{aligned}$$
(1)

by d complex amplitudes c over this basis, being constrained by the normalization condition \( \sum _{k=0}^{d-1} \left| c_k\right| ^2 =1\).

We claim a full su(d) algebra for the system, spanned by \(d^2-1\) generators \(\left\{ \hat{g}_i\right\} \) that can be chosen to be orthogonal with respect to the Hilbert-Schmidt product such that and \(Tr\left( \hat{g}_i^\dagger \hat{g}_j\right) = G \delta _{i,j}\) with G a positive constant. For \(d=2\) these generators can be identified with the Pauli operators (\(G=2\)) while for \(k=3\) and \(G=2\) with the Gell-Mann operators (see the Appendix). Extending the set \(\left\{ \hat{g}_i\right\} \) by an element \(g_0= \sqrt{\frac{G}{d}} \hat{1}\), the generators of the algebra form a basis in Hilbert-Schmidt space of Hermitian operators so that any observable \(\hat{H}\) of the qudit can be written as

$$\begin{aligned} \hat{H}= & {} \sum _{m=0}^{d^2-1} h_m \hat{g}_m= h_0 \hat{g}_0+\phi \sum _{m=1}^{d^2-1} n_m \hat{g}_m\nonumber \\= & {} h_0 \hat{g}_0+ \phi ~\textbf{n}.\mathbf {\hat{g}} \end{aligned}$$
(2)

with \(h_m = Tr\left( \hat{g}_m^\dagger \hat{H}\right) /G\) \(\in \Re \), \(\textbf{n}= \left\{ n_1, n_2, \ldots , n_{d^2}\right\} \) a normalized real vector and \(\phi \) an angle.

The density operator \(\hat{\rho }\) of a pure state \(\left| \psi \right\rangle \), \(\hat{\rho }=\left| \psi \right\rangle \left\langle \psi \right| \), being a positively defined Hermitian matrix with \(Tr(\hat{\rho })=1\), can also be decomposed on the basis of the generators as

$$\begin{aligned} \hat{\rho }=\frac{1}{G}\hat{g}_0 + \sum _{m=1}^{d^2-1} r_m \hat{g}_m \end{aligned}$$
(3)

with \(r_m = Tr\left( \hat{g}_m^\dagger \hat{\rho }\right) /G\) and \(\textbf{r}= \left\{ r_1, r_2, \ldots , r_{d^2-1}\right\} \) proportional by 1/G factor to the unit-length Bloch vector, living on the \((d^2-2)\)-dimensional surface of the so-called Bloch hyper-sphere. For completeness we note here that pure states occupy only a sub-manifold of this surface of dimension \(d^2-d\) while the rest of the surface corresponds to non-positive Hermitian matrices. Mixed states correspond to vectors inside the Bloch hyper-sphere.

Furthermore, since any unitary operation \(\hat{U}\) is generated by a Hermitian matrix \(\hat{H}\) as \(\hat{U}=e^{i \hat{H}}\), in view of the decomposition (2), one can re-write \(\hat{U}=e^{i \phi ~\textbf{n}.\mathbf {\hat{g}}}\) up to a phase factor. The latter expression leads (with some extra work) to the interpretation of a unitary operation acting on a pure state \( \hat{U}\left| \psi \right\rangle \) (or \( \hat{U}\hat{\rho }\hat{U}^{\dagger }\)) as a rotation of Bloch vector around the \(\textbf{n}\)-axis for an angle proportional to \(\phi \). One can also see that the most general unitary operation \(U_{\textbf{n}}\left( \phi \right) \) is parameterized by \(d^2-1\) real parameters.

Measurable quantities on a qudit are described by Hermitian operators which, again in view of the decomposition presented in Eq. (2), define a direction on the Bloch hyper-sphere. In addition, the d eigenvectors of the observables, corresponding to d real different measurement outcomes, i.e., eigenvalues, are mutually orthogonal to each other and offer a separation of the Bloch hyper-surface into d adjacent segments of equal area, in absence of degeneracies.

3 Employing qudits for supervised classification tasks

Let us consider classical data consisting of n k-dimensional feature vectors \(\left\{ \textbf{x}\right\} \), i.e., \(\textbf{x}=\left\{ x_1, x_2,\ldots x_k\right\} \). Every data point belongs to one od M classes. A random subset of the data composed of l-elements (\(l<n\)), \(\left\{ \textbf{x}\right\} _l\), is picked as the training set.

3.1 Quantum resources

For this problem, the required resource is a single qudit where \(d^2-1\ge k\) and \(d^2-1- k\) increasing with the complexity of the task. One should be able to perform the full SU(d) group of operations on the qudit and in addition to measure a single observable \(\hat{O}\). For simplicity, we assume the spectrum to be non-degenerate, yielding d distinct measurement outcomes. Since the classification is based on mean values of measurement outcomes, one should be able to perform experiments in identical conditions multiple times.

Table 1 Qubit models

3.2 Encoding classical data

Let us now introduce \(P=S+W\) adjustable weights that we separate into two groups: \(\textbf{s}=\left\{ s_1,\ldots s_{S}\right\} \) and \(\textbf{w}=\left\{ w_1,\ldots w_{W}\right\} \), with \(S, W \le d^2-1\).

In the first part, there is the encoding phase where the classical data, i.e., the elements of the vector \(\textbf{x}_i\), together with the adjustable weights \(\textbf{s}\), are “uploaded” on the qudit that is initially in its ground state:

$$\begin{aligned} \left| \psi _{\textbf{x}, \textbf{s} }\right\rangle = \exp \left[ i x_1\sum _{j\in A_1} s_j \hat{g}_j+ \ldots i x_k\sum _{j\in A_k} s_j \hat{g}_j \right] \left| 0\right\rangle ~~. \end{aligned}$$
(4)

where \(A_j\) implies different grouping of the generators with \(A_j\cap A_k=0\) being a suggestive condition. With \(\left| 0\right\rangle \) we denote the ground state of the qudit. Overall, the angles and axis of rotation of the initial vector \(\left| 0\right\rangle \) are related to both classical data and adjustable weights \(\textbf{s}\) in an intrigued way, and the result of such encoding is a map from the Cartesian space, where the inputs are initially described (k-dimensional real vector \(\textbf{x}\)), onto the surface of the \(d^2-1\) dimensional hyper-Bloch sphere. Given the requirement \(k\le d^2-1\), we actually map the data onto a higher dimensional feature space characterized by the kernel

$$\begin{aligned} \left| \left\langle \psi _{\textbf{x}_2, \textbf{s} } \right. \left| \psi _{\textbf{x}_1, \textbf{s} }\right\rangle \right| ^2~~. \end{aligned}$$
(5)

In the Appendix, we provide an explicit expression of the simplest kernel employed in this work, namely the qubit model A (see Table 1). Contrary to the usual rotation encoding consisting of successive rotations around orthogonal directions of the Bloch sphere, resulting in cosine kernels, the “combined” encoding of Eq. (4) results in more intrigued kernels. Naturally, the complexity of these kernels is increasing with d and k.

3.3 Rotating and measuring

After mapping the data onto the hyper-sphere, it is separated into M groups. A projective measurement of \(\hat{O}\) observable, provides, with some probability which depends on the state Eq. (4), an outcome from the d values of its spectrum \(\left\{ o_1,\ldots ,o_d\right\} \) (arranged in increasing order).

We take the habitual assumption that the whole procedure can be repeated many times in an identical way and use the mean value of \(\left\langle \hat{O}\right\rangle \) that lies in the interval \(\left[ o_1,o_d\right] \) to divide the interval (equally or unequally) into M segments, classifying the data, i.e., \(\left[ o_1,y_1\right] ,~\left[ y_1,y_2\right] \ldots \left[ y_{M-1},o_d\right] \). To get optimum results though one should be able to rotate \(\hat{O}\) in order to “match” its orientation with the one of the mapped data on the hyper-surface. Alternatively, one can keep \(\hat{O}\) intact and rotate \(\left| \psi _{\textbf{x}_l, \textbf{s} }\right\rangle \). So, in this stage, one applies arbitrary rotations to the state vector carrying the classical information, yielding

$$\begin{aligned} \left| \psi _{\textbf{x}_l, \textbf{s},\textbf{w} }\right\rangle = \exp \left[ i \sum _{j= 1}^{d^2-1} w_j \hat{g}_j\right] \left| \psi _{\textbf{x}_l, \textbf{s} }\right\rangle , \end{aligned}$$
(6)

and measures \(\hat{O}\). Let us note that it is not always profitable in terms of capacity to keep all the weights \(w_j\) in Eq. (6) and some should be ignored or set zero so that \(W\approx S\).

The whole “encode-rotate-measure” scheme is repeated many times in identical conditions until mean value for the measurement

$$\begin{aligned} \left\langle \hat{O} \right\rangle _{\textbf{x}_l, \textbf{s},\textbf{w}}=\left\langle \psi _{\textbf{x}_l, \textbf{s},\textbf{w} } \right| \hat{O}\left| \psi _{\textbf{x}_l, \textbf{s},\textbf{w} }\right\rangle \end{aligned}$$
(7)

is obtained that classifies the data point \(\textbf{x}_l\) according to the choice of segmentation \(\left\{ y_i\right\} \) of the interval \(\left[ o_1,o_d\right] \) of mean values. The values \(y_i\) can be also adjustable in the same way the threshold values of perceptrons in neural networks can be variable and optimizable.

One may summarize the total scheme in the following diagram:

Finally, while the full scheme could be written as

$$\begin{aligned} \left| \psi _{\textbf{x}, \textbf{s},\textbf{w} }\right\rangle = \exp \left[ i \hat{H}\right] \left| 0\right\rangle , \end{aligned}$$
(8)

it is important to note that \(\hat{H}\) is highly nonlinear in the input \(\textbf{x}\) due to BCH formula. Our scheme, in contrast, relies on a simple linear encoding of Eq. (4).

3.4 Training

To perform the training, we define a loss function that penalizes misclassified data of the training set

$$\begin{aligned} E=\sum _{i \in T}\left( \left\langle \hat{O}\right\rangle _i-Y_i\right) ^2~~, \end{aligned}$$
(9)

while correctly classified data do not contribute to its value. Here, T is the set of misclassified data of the training set, and \(Y_i\) is the upper or lower value of the spectral segment that characterizes correct class for the ith point. In Sect. 6.3 we use for convenience the cross entropy loss function.

The optimization of parameters implies a minimization of E, which is achieved (in all analysis apart from the examples in Sect. 6.3) by gradient descent. The landscape of E though contains a number of local minima, and, when starting from a random initial point in the space of parameters \(\textbf{s}\vee \textbf{w}\), the procedure might get trapped in one of those. To improve minimization, we use a sample of \(~l=50\) initial points and we pick the best result among all runs of gradient descent. When dealing with real-world data using a qutrit (in Sect. 6.3) and comparing its outcome the one obtained with classical models, a more advanced stochastic gradient descent is applied.

4 Lossless memory dimension of different encoding-rotation models

In this section, we compare different models of encoding using a measure of capacity with clear theoretical meaning that is also suitable for numerical evaluation. Our aim is not to accurately compare with the capacity of classical neural networks (Wright and McMahon 2020), but to identify optimum way for introducing the trainable parameters in encoding and rotating stages, Eqs. (4) and (6), of the proposed scheme. Due to limited computational capacity our numerical tests are not “exhaustive” but indicative.

We are employing a recently suggested measure (Friedland and Krell 2018), which has been constructed for evaluating the informational/memory capacity of multi-layered classical neural networks, the so-called LM dimension. This is a generalization of the Vapnik-Chervonenkis (VC) dimension (Vapnik and Chervonenkis 1971) that is based on the work of MacKey (Mackay 2003), embedding the memory capacity into the Shannon communication model. The definition of LM dimension (Friedland and Krell 2018) is the following:

  • The LM dimension \(D_{LM}\) is the maximum integer number \(D_{LM}\) such that for any dataset with cardinality \(n \le D_{LM} \) and points in random position, all possible labelings of this dataset can be represented with a function in the hypothesis space.

  • A set of points \(\left\{ x_n \right\} \) in K-dimensional space is in random position, if and only if from any subset of size \(< n\) it is not possible to infer anything about the positions of the remaining points.

For this measure, the authors showed analytically that the upper limit of LM dimension scales linearly with the number of parameters in a classical neural network with a factor of proportionality that is the unity. In practice, a training method cannot be perfect, therefore this linear dependence persists with a lower factor of proportionality, For more details and the informational meaning of this measure we refer the interested reader to the original work of Friedland and Krell (2018).

Table 2 Qutrit models

For quantum models where analytical calculations are not available, we proceed with the numerical evaluation of LM dimension, which we denote as \(\tilde{D}_{LM}\). Naturally \(\tilde{D}_{LM}\) lower bounds \(D_{LM}\) and this can be understood from the procedure that we follow for each encoding-rotation model under test:

  • We set the k-dimension of the inputs of the model. According to our general model for a qudit, we have \(k\le d^2-1\). We generate a set of \(\left\{ x_n \right\} \) points in random position, which we call random pattern, by selecting each of k coordinates from a uniform distribution in the interval \(\left[ -0.5, 0.5\right] \). We start with a \(n\le P\) where \(P=S+W\) the total number of parameters.

  • According to the definition of LM dimension, we treat only binary classification tasks, and we attribute labels randomly to the vectors of the random pattern to two groups. For a given random pattern, one should test all \(2^n\) different labelings, but not having the computational capacity for this, when \(n>6\) we perform our estimate by taking a sample of 50 different random labelings.

  • If the training of parameters via gradient descent, with 50 different starting points, does not lead to classification of the vectors of the random pattern into two groups with \(100\%\) success ratio, we repeat for other random patterns \(\left\{ x_n \right\} \) until we find a pattern that is successfully classified for all possible labelings. However, we do not exceed the number of 10 different random patterns under test.

  • The number n is step-wised increased up to the point where the classification is no longer successful for any tested random pattern. The empirical LM dimension, \(\tilde{D}_{LM}\) is the highest n where the classification is achieved for at least one random pattern (all possible labelings).

In Table 1 we present the results on the empirical estimation of LM dimension for a qubit. The generators of the algebra, \(\hat{g}_i\) for a qubit system, are identified as the Pauli operators: \( \hat{g}_1=\hat{\sigma }_x, ~\hat{g}_2=\hat{\sigma }_y,~\hat{g}_3=\hat{\sigma }_z ~~.\) For all qubit schemes under study the classification is performed by measuring the operator \(\hat{g}_3\) with eigenvalues \(\left\{ -1,1\right\} \) and corresponding eigenvectors \(\left\{ \left| 1\right\rangle ,\left| 0\right\rangle \right\} \). The two groups of data are separated according to \(\left\langle \hat{g}_3\right\rangle \gtrless 0\). For comparison with our single-layer model, we have also included models (EF Table 1) which implement re-uploading of input data (Pérez-Salinas et al. 2020; Wach et al. 2023).

We proceed with the estimation of \(\tilde{D}_{LM}\) for a qutrit with results presented in Table 2. The generators \(\hat{g}_i\) of the SU(3) group can be chosen to be the Gell-Mann operators, \(\hat{\lambda }_i\) with \(i=1,\ldots ,8\), which are provided in matrix form in the Appendix. According to Sect. 3, during the encoding phase the classical data are mapped onto the Bloch hyper-sphere of a qutrit embedded in the 8-dimensional space which cannot be visualized. To obtain a partial visualization, as, for example, in Sect. 6.1.1, we use the Bloch ball representation offered by the su(2) subalgebra of su(3), spanned by the generators \(\left\{ \hat{L}_x= \hat{\lambda }_1+\hat{\lambda }_6,~\hat{L}_y=\hat{\lambda }_2+\hat{\lambda }_7, \hat{L}_z=\hat{\lambda }_3+\sqrt{3}\hat{\lambda }_8\right\} \). For all schemes, we choose to measure the operator \( \hat{L}_z=\hat{\lambda }_3+\sqrt{3}\hat{\lambda }_8\) that is diagonal in the computational basis and with uniform spectrum \(\left\{ -2,0,2\right\} \). For binary classification results, as shown in Table 2, we separate the two groups according to the sign of \(\left\langle \hat{L}_z \right\rangle \).

With regard to efficiency, the single-layer schemes that achieve \(\tilde{D}_{LM}=2 P\) can be considered as the most successful ones, i.e., qubit: A, H, qutrit: D2. From the qubit models C, D and qutrit model D1 we may conclude that both the absence and the excessive input of parameters in the encoding phase are not recommended. We also observe that the most successful single-layer models are the ones where \(k \approx d^2-1\).

For classical neural networks, for a fixed input dimension k, one can linearly augment LM dimension with the number of parameters by adding hidden layers (Friedland and Krell 2018). For the model presented here, this becomes possible by using a qudit system where \(k<d^2-1\). The scaling \(\tilde{D}_{LM}=2 P\) is not maintained but one rather achieves \(\tilde{D}_{LM}\approx P\) as it is shown with \(k=2\) with qutrit model B. An alternative way to increase LM dimension is to use re-uploading, see qubit models \(E-G\), but there the scaling \(\tilde{D}_{LM}=2 P\) also is not achieved but rather \(\tilde{D}_{LM}= P+L\), with L being a constant. This analysis confirms the findings in Wright and McMahon (2020) and underlines the need for more research in identifying quantum models which exceed the classical limits.

Finally, for single-layer models and \(k\approx d^2-1\), we see that \(\tilde{D}_{LM}\) is higher than the one for the classical neural network. It would be interesting to see whether more exotic classical perceptron models such as product-units (Durbin and Rumelhart 1989; Dellen et al. 2019) or complex-valued perceptron (Dellen and Jaekel 2021) exhibit similar augmentation of LM dimension. In addition we underline the fact that the LM dimension only captures a specific aspect of the model. For a complete evaluation of the quantum model for supervised learning task, other aspects (Abbas et al. 2021) would have to be taken into account, e.g., difficulty in training (barren plateaus problem) and presence of noise in implementation.

The conclusions of the numerical studies on LM dimension are illustrated with examples in next sections. Since LM dimension only concerns capacity of binary classification tasks, we address classification problems with \(M>2\) classes as well.

5 Classification problems treated with a qubit

We start the illustration of the suggested method by addressing two typical classification problems with a qubit (\(d=2\)). Even though the power of a qubit has been extensively studied in the literature, this is the first example showing that a qubit can be logically complete, i.e., it is able to implement all binary logical functions. This is achieved with the model A, Table 1, which contains two real parameters. This outcome does not come as surprise since model A has LM dimension \(\tilde{D}_{LM}=4\) for \(k=2\), or, in other words, can shatter all possible ways four 2-dimensional vectors in random positions.

5.1 Binary logical functions

Let us consider four datasets on a plane (\(k=2\)), as shown in Fig. 1(a). The logical functions for these noisy data correspond to different attributions of each dataset to one of two groups, A and B. For instance, the XOR function requires a classification of the datasets as in Fig. 1(a).

To implement classification according to the logical functions, we first map the data onto the 2-dimensional surface of the Bloch sphere. Even if the feature space has the same dimension as the initial space, the change in topology proves to be helpful. Numerical tests show that all logical functions can be implemented this way with 2 real weights (\(S=1\) and \(W=1\)). In more details, we use the encoding and rotation as in model A for a qubit, see Table 1, and the classification is conducted using the sign of \(\left\langle \hat{g}_3 \right\rangle \).

We successfully solved classification problems for all logical functions (AND, OR, XOR); however, we present in Fig. 1(b) only the results about XOR, which is the most challenging task, since it is a nonlinearly separable problem. The total number of data is 2000 and we use \(4\%\) of them for the training. A success ratio of classification of \(100\%\) was readily achieved.

Fig. 1
figure 1

(a) Data to be classified according to XOR logical function, into groups A and B. (b) The classified data mapped on the surface of Bloch sphere (projection on the x-z plane) after training on the 2 weights has been performed

It is important to note here that all binary logical functions can be solved with 2 real parameters also by the complex perceptron model presented in Dellen and Jaekel (2021). We proceed with an example that it is not solvable with any single-layer classical perceptron model up to our knowledge.

5.2 Classification for circular boundaries

We proceed with a more complex classification problem and show that it can still be tackled with a single qubit. For this purpose, we employ model B, Table 1, because it achieves a higher LM dimension than model A.

The problem consists of classifying the data (1000 2-dimensional vectors) in Fig. 2(a) into two groups. In Fig. 2(b), we present the classification achieved on the Bloch sphere after the weights \(s_1, w_1, w_2, w_3\) have been optimized. The classification ratio achieved is \(100\%\) using \(10\%\) of total dataset as training dataset.

Fig. 2
figure 2

(a) The initial data (1000 points) to be classified into groups A and B. (b) The data are mapped on the Bloch surface and perfectly classified

Following the same encoding-rotation scenario (model B), we are able to treat elliptical data (not presented here), but with a lower final classification ratio (\(\approx 90\%\)).

6 Examples solved with a qutrit

Even though we have been able to solve a couple of basic classification problems with one qubit, it is obvious that one needs a higher dimensional space d to resolve more complicated problems since one qubit can accommodate at most 5 parameters/weights according to our single-layer model. As shown in Sect. 4 qutrits may accommodate more parameters and therefore achieve higher LM dimension. In addition, tests have shown that qutrit models perform better than qubit models for classification tasks into \(M>2\) groups. This is not obvious studying LM dimension alone.

6.1 Noisy XOR

We first investigate the binary classification task presented in Fig. 3(a) for which all qubit models exhibited low performance but where qutrit’s model B, see Table 2, gives adequate results. More specifically, we use \(1\%\) of the total data (2000 points) for training and achieve a success classification ratio of \(96\%\).

Fig. 3
figure 3

Qutrit model: (a) classification into 2 groups employing 8 weights, (b) classification into 3 groups employing 9 weights

6.1.1 Classification into three groups and a geometric picture

We increase the difficulty of the previous problem by demanding classification in 3 groups of data and reducing the margins between sets, as shown in Fig. 4(b). We use a comparable number of weights (9), but now the encoding-rotation model is:

  • Encoding via

    $$\begin{aligned} \left| \psi _{\textbf{x}, \textbf{s} }\right\rangle= & {} \exp \left[ i x_1 ( s_1 \hat{g}_3 +s_2 \hat{g}_5+s_3 \hat{g}_7)\right. \nonumber \\{} & {} \left. +i x_2( s_4 \hat{g}_4 +s_5 \hat{g}_6+s_6 \hat{g}_8)\right] \left| 0\right\rangle ~~. \end{aligned}$$
    (10)
  • Rotation via

    $$\begin{aligned} \left| \psi _{\textbf{x}_l, \textbf{s},\textbf{w} }\right\rangle = \exp \left[ i \sum _{j=1}^3 w_j \hat{L}_j\right] \left| \psi _{\textbf{x}_l, \textbf{s} }\right\rangle \end{aligned}$$
    (11)

    where \(\hat{L}_1=\hat{L}_x,~\hat{L}_2=\hat{L}_y,~\hat{L}_3=\hat{L}_z\).

  • Measurement of \(\hat{L}_z\) and classification by comparing the value of \(\left\langle \hat{L}_z\right\rangle \) with \(A:\left[ -2, -2+4/3\right] \), \(B:\left[ -2+4/3,2-4/3\right] \), \(C:\left[ 2-4/3,2\right] \).

Using \(4\%\) of the total data (2000 points) for training, a success ratio of classification \(87\%\) is achieved for the rest of data. In Fig. 4, we depict the mapping (with optimized parameters) of the data into the SU(2) Bloch ball generated by \(\left\{ \hat{L}_x,\hat{L}_y\hat{L}_z\right\} \) operators. The classification “intervals” for \(\left\langle \hat{L}_z\right\rangle \) are also presented in the picture as horizontal lines. This “local” picture offered by the subgroup is equivalent to the picture one would obtain by inspecting the local density matrix of an entangled system. One can thus claim by borrowing terms by the notion of generalized entanglement (Barnum et al. 2004) that the self-entanglement of a qutrit has the same use in the classification procedure as physical entanglement between subsystems, i.e., this extends the mapping from the surface to the inside area of the Bloch hyper-sphere of a subsystem. The generation of self-entanglement in a qudit does require the ability to fully operate the system but in practice this is less demanding than the entangling interaction between subsystems.

Fig. 4
figure 4

The classification of data of Fig. 4(b) into 3 groups as perceived in the SU(2) Bloch sphere representation provided by the operators \(\left\{ \hat{L}_x,\hat{L}_y,\hat{L}_z\right\} \)

6.2 Classifying moon sets with a qutrit

Finally, by using qutrit model C of Table 2, we attempt a common classification task, the one of moon sets. By optimizing the 8 parameters of the model, we achieve a classification ratio of \(90\%\) using \(10\%\) of 800 total data points. In Fig. 5, we present \(\left\langle \hat{L}_z\right\rangle \) for the optimized set of parameters, together with the datasets.

Fig. 5
figure 5

Classification of moon sets with the qutrit model C using 8 weights. A contour plot of \(\left\langle \hat{L}_z\right\rangle \) is depicted together with the moon data after optimization has been performed

6.3 Real-world data multi-class classification

We will now turn to multi-class classification tasks using real-world data and more advanced methods of training. We use datasets from the UCI Machine Learning Repository, a widely used and publicly available repository (Dua and Graff 2017), maintained by the University of California, Irvine. Our aim is to explore the feasibility of using a single qutrit to accurately distinguish between three classes in datasets with more than two dimensions, such as the Iris and Wine datasets. Our results illustrate that supervised learning in the context of less structured data is achievable.

Table 3 Real-world data treated with a qutrit

The Iris dataset consists of 150 samples of iris flowers, with measurements of four features: sepal length, sepal width, petal length, and petal width. Each sample is labeled with one of three possible iris species. The Wine Cultivars dataset consists of measurements of thirteen chemical constituents found in three different wine cultivars. The objective is to classify the cultivar of the wine based on the chemical composition measurements.

Our aim is to use the same encoding and number of parameters for both datasets. Thus for the Wine Cultivars dataset which possesses 13 different features we employ Principal Component Analysis (PCA) (Hotelling 1933) in order to reduce the number of features to four (Table 3).

The encoding and rotating scheme that we follow is

  • Encoding via

    $$\begin{aligned} \left| \psi _{\textbf{x}_l, \textbf{s} }\right\rangle = \exp \left[ i s \sum _{j=1}^4 x_j ~ \hat{g}_{j} \right] \left| 0\right\rangle ~~. \end{aligned}$$
    (12)
  • Rotation via

    $$\begin{aligned} \left| \psi _{\textbf{x}_l, \textbf{s},\textbf{w} }\right\rangle = \exp \left[ i \sum _{j=5}^8 ~ w_{j-4} ~ \hat{g}_8\right] \left| \psi _{\textbf{x}_l, \textbf{s} }\right\rangle \end{aligned}$$
    (13)

    where the variational weights \(\textbf{w} = (s, w_1, w_2, w_3,w_4)\) are the parameters to be optimized.

We ensured an equal representation of each class. For reproducibility, we used the same seed to split the data into train and test sets. To avoid overfitting, early stopping was employed and different gradient-based methods were trialed to combat the barren plateaus problem before settling to stochastic gradient descent (SGD) (Schuld et al. 2019) using the parameter-shift rule. This method reduces the number of measurements needed during implementation compared to the standard method, making it more efficient and practical for quantum machine learning. SGD is a variant of gradient descent that randomly selects a subset of data points, called mini-batch, to calculate the gradient of the cost function at each iteration. Since these are multi-class problems categorical cross entropy loss was used as a cost function. Using this approach, we achieved competitive scores with a single qutrit as can be seen in Table 4.

Table 4 Comparative numerical studies for classification of Iris and Wine data

Since these are multi-class problems categorical cross entropy was used as cost function, which combines the softmax activation and the negative log likelihood loss as follows:

$$\begin{aligned} \text {{loss}} = -\frac{1}{N} \sum _{i=1}^{N} \sum _{j=1}^{C} t_{ij} \log (p_{ij})~. \end{aligned}$$
(14)

Here N is the number of samples, C is the number of classes, \(t_{ij}\) represents the true label for sample i and class j, and \(p_{ij}\) represents the predicted probability.

Using this approach, we achieved competitive scores with a single qutrit as can be seen in Table 4. In these benchmarks, we present the results of the single-qutrit model against a classical machine learning model using Support Vector Machines (SVM) and a Variational Quantum Classifier (VQC) model with entangled qubits in Qiskit. These tests were conducted using four qubits and the popular ZZ feature map with twelve parameters, utilizing the Limited-memory Broyden-Fletcher-Goldfarb-Shanno Bound (L-BFGS-B) optimizer to minimize sensitivity to local minima and the barren plateau issue (Liu and Nocedal 1989).

These results showcase that even a single-qudit classifier is capable of multi-class classification for multi-dimensional real-world data. Although on the Iris dataset the four-qubits model outperformed the single-qutrit model, the single-qutrit model produced better results even with five parameters compared to the twelve used by the ZZ feature map on the Wine dataset. Increasing the encoding layers could further enhance the classifier’s performance, but since the aim of our study was to demonstrate that a single-layer qudit classifier can accurately distinguish between multiple classes, it is not further investigated here.

7 Discussion

Qudits are extensions of qubit units to higher dimensions, which can enhance the performance in quantum computing (Gokhale et al. 2019; Pavlidis and Floratos 2021; Gedik et al. 2015; Kunzhe et al. 2018; Wang et al. 2020), gate decomposition (Fedorov et al. 2012), error correction (Rosenblum 2018), communication (Cozzolino et al. 2019; Sheridan and Scarani 2010; Amblard and Arnault 2015) and variational algorithms (Deller et al. 2023; Weggemans et al. 2022). These are experimentally realizable with different physical models and recent proposals also use them in quantum machine learning (Wach et al. 2023; Useche et al. 2022). In this work, we have described a model for data classification using a single qudit. The parametrization is introduced according to geometric intuition, partially for controlling the mapping on the Bloch hyper-surface and partially for adjusting the projective measurement to the dataset’s orientation on the Bloch hyper-sphere. Entangling or adding more layers can certainly enhance the quantum classifier, similar to how classical neural networks yield better results with increased depth. Nonetheless, given the expense and error-prone nature of entangling in near-term quantum hardware, our results indicate that even a low-depth single-qudit classifier holds a promise for quantum machine learning, if it is thoughtfully employed with a balanced distribution of parameters in the encoding and rotating steps.

The simple model that we present shares obvious similarities and borrows ideas from previous works (Pérez-Salinas et al. 2020; Farhi and Neven 2018; Havlíček et al. 2019; Wach et al. 2023). Being though only in the mid-way of exploration of the potential role of quantum systems for ML tasks, this geometrically dressed entanglement-free proposal gives its own contribution, connecting current efforts with the geometry of Hilbert-Schmidt space and underlying the equivalence of self-entanglement (Barnum et al. 2004) with physical one in practice. In addition, with the help of empirical estimation of LM dimension for a qubit and a qutrit, we have been able to demonstrate that the “capacity” of single-layer quantum systems can be higher than for classical neural network systems bearing the same number of training parameters. It remains an open question for future work to investigate and compare the capacity of the quantum model with more intrigued single-layer classical perceptron models but also to investigate whether quantum multi-layer structures can exist which can keep the advantage in LM dimension over classical NN.