1 Introduction

Quantum computing is a rapidly evolving field that promises to revolutionize various domains, and finance is no exception. There is a variety of computationally hard financial problems for which quantum algorithms can potentially offer advantages (Herman et al. 2022; Egger et al. 2020; McKinsey & Company 2021; Bouland et al. 2020), for example in combinatorial optimization (Leclerc et al. 2022; Rebentrost and Lloyd 2018), convex optimization (Kerenidis et al. 2019; Rebentrost et al. 2022), Monte Carlo simulations (Doriguello et al. 2022; Suzuki et al. 2020; Giurgica-Tiron et al. 2022), and machine learning (Pistoia et al. 2021; Emmanoulopoulos and Dimoska 2022; Alcazar et al. 2020; Nguyen and Chen 2022).

In this work, we explore the potential of quantum machine learning methods in improving the performance of forecasting in finance, specifically focusing on two use cases within the business of Itaú Unibanco, the largest bank in Latin America.

In the first use case, we aim to improve the performance of Random Forest methods for churn prediction. We introduce quantum algorithms for Determinantal Point Processes (DPP) sampling (Kerenidis and Prakash 2022), and develop a method of DPP sampling to enhance Random Forest models. We evaluate our model on the churn dataset using classical DPP sampling algorithms and perform experiments on a scaled-down version of the dataset using quantum algorithms. Our results demonstrate that, in the classical setting, the proposed algorithms outperform the baseline Random Forest in precision, efficiency, and bottom line, and also offer a precise understanding of how quantum computing can impact this kind of problem in the future. The quantum algorithm run on an IBM quantum processor gives similar results as the classical DPP on small batch dimensions but falters as the dimensions grow bigger due to hardware noise.

In the second use case, we aim to explore the performance of neural network models for credit risk assessment by incorporating ideas from quantum compound neural networks (Landman et al. 2022). We start by using quantum orthogonal neural networks (Landman et al. 2022), which add the property of orthogonality for the trained model weights to avoid redundancy in the learned features (Arjovsky et al. 2016). These orthogonal layers, which can be trained efficiently on a classical computer, are the simplest case of what we call compound neural networks, which explore an exponential space in a structured way. For our use case, we design compound neural network architectures that are appropriate for financial data. We evaluate their performance on a real-world dataset and show that the quantum compound neural network models both have far fewer parameters and achieve better accuracy and generalization than classical fully connected neural networks.

This paper is organized as follows: In Sections 24, we focus on the churn prediction use case and present the DPP-based quantum machine learning methods. In Sections 57, we present quantum neural network models for risk assessment. Finally, in Section 8, we conclude the paper and discuss potential future research directions.

2 DPP-enhanced Random Forest models for churn prediction

2.1 DPP-Random Forest model

The Random Forest algorithm was introduced in 2001 by Breiman (2001) and has easily become one of the most popular supervised machine learning algorithms in use. It consists of an ensemble of decision trees, each trained on a uniform subsample of rows and columns from the dataset.

In this section, we propose an extension of the Random Forest, called the DPP-Random Forest (DPP-RF), which utilizes Determinantal Point Processes (DPPs) instead of uniform sampling to subsample rows and columns for individual decision trees. In the original RF algorithm, subsampling makes the model more robust to variance in the training data; however, the use of uniform sampling runs the risk of improperly representing the dataset and missing under-sampled areas. DPP sampling better preserves the diversity of the dataset, and corrects for sampling bias (Kulesza and Taskar 2012). We first introduce the theory and techniques of DPP sampling, then present the algorithm.

2.2 Determinantal point processes

We will now introduce the Determinantal Point Process (DPP), which lies at the core of the methodology behind our solution to the churn problem. DPPs are a class of probabilistic models that can be used to sample diverse subsets of items from a larger set. They were first formalized by Macchi in 1975 as a way to model fermions in quantum mechanics (Macchi 1975). More recently, these models are showing increasing promise in the context of machine learning (Kulesza and Taskar 2012), where they can be used for a variety of tasks, such as building unbiased estimators for linear regression (Derezinski and Mahoney 2021), performing Monte Carlo estimation (Bardenet and Hardy 2020), and promoting diversity in model outputs (Elfeki et al. 2019).

2.2.1 Definitions

A point process P on a set Y is a probability measure over the subsets of Y. Sampling from a point process on Y will produce some subset \(S \subseteq Y\) with probability P(S). A repulsive point process is a point process in which points that are more similar to each other are less likely to be selected together.

A determinantal point process (DPP) is a particular case of a repulsive point process, in which the selection probability of a subset of items \(T \subseteq Y\) is given by a determinant. Given a real, symmetric \(n \times n\) matrix K indexed by the elements of Y:

$$ P\{T \subseteq S \} = \det (K_{T,T}) \ , $$

where \(K_{T,T}\) denotes the \(|T | \times |T |\) submatrix indexed by the set T and n is the cardinality of Y. In other words, the marginal distribution \(P\{T \subseteq S\}\) is defined by the subdeterminants of K.

The above is the most general definition, but in machine learning, we typically focus on a slightly more restrictive class of DPPs called L-ensembles. In L-ensembles, the whole distribution, not just the marginals, is given by the subdeterminant of a real, symmetric \(n \times n\) matrix L.

$$ P\{S\} \propto \det (L_{S,S}) \ . $$

Just like K, L is indexed by the elements of Y. Because of some convenient properties of the determinant (Kulesza and Taskar 2012), we can explicitly write down the distribution of an L-ensemble:

$$ P\{S\} = \frac{\det (L_{S,S})}{\det (L+I)} \ . $$

In machine learning literature, DPPs are typically defined over a set of points \(\textbf{X}\), with each item \(\textbf{x}_i\) a row in the data matrix \(\textbf{X}\). If we preprocess \(\textbf{X}\) such that its columns are orthonormal and choose \(\textbf{L}\) to be the inner-product similarity matrix, i.e., \(L = \textbf{X}\textbf{X}^T\), then the distribution becomes even simpler to write down. Instead of explicitly computing the \(\textbf{L}\) matrix, we can write the distribution in terms of the data matrix \(\textbf{X}\) itself, courtesy of the Cauchy-Binet formula,

$$\begin{aligned} P\{S\} = \frac{\det (\textbf{X}_S)^2}{\det (\textbf{X}\textbf{X}^T + \textbf{I})} \ . \end{aligned}$$

Moreover, the distribution will almost surely produce samples with size d, the rank of the orthogonalized data matrix X. This kind of DPP is denoted d-DPP. We will focus here on an application of sampling from a d-DPP from a data matrix X.

2.2.2 Unbiased least squares regression

One unique feature of the DPP compared to i.i.d sampling techniques is that it can lead to provably unbiased estimators for least squares linear regression (Derezinski 2018; Dereziński et al. 2018). Given an \(n\times d\) data matrix \(\textbf{X}\) and a target vector \(\textbf{y} \in \mathbb {R}^n\), where \(n \gg d\), we wish to approximate the least squares solution \(\textbf{w}^{*} = \textrm{argmin}_\textbf{w} ||\textbf{X}\textbf{w} - \textbf{y}||\). \(\textbf{w}^{*}\) represents the best-fit parameters to a linear model to predict \(\textbf{y}\).

Surprisingly, if we sample d points S from DPP(\(\textbf{X}\textbf{X}^T\)) and solve the reduced system of equations \(\textbf{y}_S = \textbf{X}_S\textbf{w}\), we get an unbiased estimate of \(\textbf{w}^*\). Formally, if \(S \sim d\text {-DPP}_\textbf{L}(\textbf{X}\textbf{X}^{T})\),

$$\begin{aligned} \mathop {\mathbb {E}}[\textbf{X}_S^{-1} \textbf{y}_S] = \textrm{argmin}_\textbf{w} ||\textbf{X}\textbf{w} - \textbf{y}|| = \textbf{w}^{*} \ . \end{aligned}$$

This allows us to create an “ensemble” of unbiased linear regressors, each trained on a DPP sample. In some regard, this was the inspiration for trying an ensemble of decision trees trained on DPP samples, as detailed in Section 2.1.

2.2.3 Algorithms for sampling

There are several efficient algorithms for sampling from DPPs and computing their properties. The naive sampling method — calculating all subdeterminants and performing l2 sampling — takes exponential time. The first major leap in making DPP sampling feasible on today’s computers was the “spectral method” (Kulesza and Taskar 2011; Hough et al. 2006). This algorithm performs an eigendecomposition of the kernel matrix before applying a projection-based iterative sampling approach. Thus, the first sample takes \(O(nd^2)\) time, and subsequent samples take \(O(d^3)\).

Monte Carlo methods have been proposed to approximate the DPP distribution (Anari et al. 2016; Li et al. 2016), though they are not exact, and are often still prohibitively slow with a runtime of \(O(n \text {poly}(d))\) per sample.

In a counter-intuitive result, Derezinski et al. (2019) and Calandriello et al. (2020) proposed methods that avoid performing the full DPP sampling procedure on large parts of the basis set. This approach resulted in a significant reduction in runtime, making DPPs more practical for mid-to-large-scale datasets. These techniques allow exact sampling of subsequent d-DPP samples in \(O(\text {poly}(d))\), independent of the size of the full basis set n. Many of these algorithms are implemented in the open-source DPPy library (Gautier et al. 2019), which we used in the experiments in this paper.

Recent work has shown that quantum computers are in principle able to sample from DPPs in even lower complexity in some cases. We describe this quantum algorithm in Section 4. This and several other algorithms which arise from the techniques introduced in Kerenidis and Prakash (2022) are a budding area of research in the quantum computing space, and will hopefully inspire more applications like the one we describe in this paper. For example, in Kazdaghli et al. (2023), DPPs and deterministic DPPs were used to improve the methods for the imputation of clinical data.

Fig. 1
figure 1

Steps 1 to 7 of the DPP-Random Forest algorithm

2.3 DPP-RF algorithm outline

In principle, the DPP-RF uses DPP sampling on the whole dataset to select diverse subsets of data on which to train decision trees. However, sampling from a DPP on large datasets (like the entire churn dataset of 174,000 points) can take copious time, especially when using current open-source implementations of DPP sampling. To be able to test these techniques quickly, a novel sampling procedure was developed which preserves many of the benefits of DPPs, but does not require sampling from the full dataset. The procedure can be summarized as follows:

  1. 1.

    Divide the training set uniformly into smaller batches;

  2. 2.

    Sample \(S_1 \sim d\)-DPP\((\textbf{X}_{batch}\textbf{X}_{batch}^T)\) data points from every batch;

  3. 3.

    Sample \(S_2 \sim d\)-DPP\((\textbf{X}_{S_1}^T \textbf{X}_{S_1})\) features;

  4. 4.

    Train a first group \(G_1\) of \(N_1\) decision trees on these small patches of data;

  5. 5.

    Aggregate the patches of data resulting from step 2 to create a larger dataset \(\textbf{X}_{agg}\);

  6. 6.

    Repeat for \(N_2\) times: sample \(S_3 \sim d\)-DPP\((\textbf{X}_{agg}^T \textbf{X}_{agg})\) features to create a long matrix;

  7. 7.

    Train a second group \(G_2\) of \(N_2\) decision trees on these new datasets;

  8. 8.

    Combine \(G_1\) and \(G_2\) by aggregating them to make predictions (similar to the classical Random Forest algorithm).

3 Classical DPP-RF results

The DPP-RF algorithm was designed for the purpose of predicting customer churn in the bank. In this section, we define this use case and present the results. In addition, we benchmark our proposed DPP-RF method by constructing models on public tabular classification tasks (Fig. 1).

3.1 Use case introduction: churn prediction

Churn, defined as a customer withdrawing more than a certain amount of money in a single month, is a significant concern for retail banks. Our objective is to predict which customers are most likely to churn in the next three months using customer data from the previous six months.

The primary dataset used in this study consists of 304,000 datapoints, with 153 features for each datapoint. Each datapoint represents a banking customer at a particular month in time, with the features representing various aspects of their activity over the previous six months. The target variable is a binary flag indicating whether or not the customer churned in at least one of the following three months. The data was anonymized and standardized before being split into training and test sets based on time period, with 130,000 datapoints being set aside as the test set and 174,000 datapoints used for training. The data was split in a way that did not produce any significant covariate shift between the train and test sets.

With the end goal of preventing churn, the model works by flagging customers with the highest risk of potential churn. For these flagged customers, the bank can deploy a representative to intervene and better understand their needs. However, resource limitations make it necessary to flag a relatively small number of customers with high confidence. The focus of this exploration was to reduce false positives in the flagged customers to increase the efficiency of bank interventions. In terms of the precision-recall trade-off, our model should be tuned to provide the highest possible precision for low recall values. Despite this simplification to a classification problem, the primary business KPI is the amount of withdrawal money correctly captured by the model, as discussed more in Section 3.4.

This use case already had a solution in production: a Random Forest classifier (Breiman 2001), whose performance was used as a benchmark. The model in production already captured a significant amount of churn, but there was clear room for improvement in the amount of withdrawals captured (see Fig. 3). Moreover, given the large number of customers in the dataset and the relative homogeneity of the population of interest, there existed an opportunity to employ techniques that explicitly try to explore diversity in the data.

We focused on three key performance indicators (KPIs): the precision-recall curve, the training time and the bottom line.

3.2 Precision-recall

To evaluate the performance of our proposed method, we optimized hyperparameters and measured the precision for a low fixed recall (6% in this case). As seen in Fig. 2, our method showed an improvement in precision from \(71.6\%\) for the benchmark model to \(77.5\%\) with the new model. Our method also provided similar improvements in precision for the relevant range of small recall.

Fig. 2
figure 2

Precision-recall curve for the test set. Using DPP with the Random Forest algorithm shows an improvement of \(5.9\%\)

3.3 Training time

The DPP-RF model has a longer training time compared to the traditional random forest on a classical computer: it took 54 min to train the model with the best hyperparameters using, compared to 311 s for the benchmark model. The models were trained on a computer with an Intel© Core™ i5-8350U CPU running at 1.70 GHz, 24 GB of RAM and Windows 10 version 21H2, compilation 19044.2604.

The computational bottleneck in this algorithm is the DPP sampling. Instead of simulating quantum DPP circuits (which is infeasible for large datasets), we used a classical SVD-based sampling algorithm (Hough et al. 2006) implemented in the dppy library (Gautier et al. 2019). We believe that improved classical sampling techniques (Calandriello et al. 2020) and future quantum techniques (Section 4) can reduce the runtime dramatically.

Hyperparameters were selected using a grid search with 5-fold cross-validation over the typical RF parameters n_estimators, max_depth, min_samples_leaf, min_samples_split, max_features, max_samples, and the DPP-RF-specific parameter batch_size. The training time of a DPP-RF model depends heavily on this batch_size parameter, which is the size of the batches from which we take DPP samples. Choosing a batch size higher than 1000 can increase runtime dramatically. Thus, in our hyperparameter search, we limited the batch size to less than 1000.

Within the bank, the churn model is retrained just once every few months, so the training time was not prohibitive. However, faster sampling algorithms still serve to increase the range of feasible hyperparameters (especially the batch size).

3.4 Bottom line — withdrawals captured

From a business perspective, the most direct indicator of the success of the model is the amount of assets under management (AUM) that can be salvaged via interventions. Thus, we evaluated the amount of money withdrawn every month by the 500 customers flagged by the model, i.e., the 500 customers which had the highest predicted probability of churning in one of the following 3 months. As seen in Figs. 3, 4 and 5, our model showed substantial overall improvements. The true financial impact of these predictions is dependent on the success of the interventions as well as the bank’s profit-per-dollar-AUM.

Fig. 3
figure 3

Classical benchmark (BM) vs DPP-RF solution: money withdrawn per month by the flagged 500 customers, comparing the benchmark model (blue line) to the DPP-RF one (orange line). On the y-axis, we have monetary values (not shown). The green line represents the total amount of money withdrawn by all customers in each month. The purple line is the sum of the 500 largest withdrawals, which is the maximum value that the model could capture. The red line represents the withdrawals captured by randomly flagging 500 observations. The y-axis units are omitted for confidentiality

Fig. 4
figure 4

Classical benchmark vs DPP-RF solution — percentage of total withdrawals captured per month, that is, relative to the green line in Fig. 3. On average over the 11 test months, the BM model captures 61.42% of the total, while the DPP-RF model captures 62.77% — an improvement of 1.35%

Fig. 5
figure 5

Classical benchmark vs DPP-RF solution — the percentage of maximum money possible to be captured (given \(\text {n\_flags} = 500\) customers flagged every month), that is, relative to the purple line in Fig. 3. On average over the 11 test months, the BM model captures 69.18% of the total, while the DPP-RF model captures 70.72% — an improvement of 1.54%

3.5 Summary of results

The proposed DPP-Random Forest model provides significant improvements in precision and bottom line, while taking significantly longer to train. The results are summarized in (Table 1).

3.6 Further benchmarks

We further benchmarked our model on various classification datasets from OpenML. All except one (madelon) of these datasets were used in Grinsztajn et al. (2022) and preprocessed accordingly. They were chosen to be representative of a wide variety of classification tasks. Each dataset was split into train, validation, and test sets. For each model, 400 sets of hyperparameters were randomly chosen and evaluated on the validation set. Both models used the same hyperparameter space, except for the addition of the batch_size parameter for the DPP-RF. The hyperparameters which gave the best results on the validation set were evaluated on the test set, and the results are reported in (Table 2). Models were evaluated with the ROC-AUC metricFootnote 1.

4 Quantum DPP-RF

4.1 Quantum circuits for determinantal point processes

Classical DPP sampling algorithms have improved significantly since their inception, but there may still be room for improvement. Recent work by Kerenidis and Prakash (2022) has shown that a quantum computer can more natively perform DPP sampling, achieving a gate complexity of O(nd) and a circuit depth of \(O(d\log (n))\) for an orthogonal matrix of size \(n \times d\). The classical time complexity for sampling is \(O(d^3)\) (Hough et al. 2006). Note that when n is very large, then one can reduce the number of rows to \(O(d^2)\) before performing the sampling (Calandriello et al. 2020).

Table 1 Summarized comparison between models
Table 2 Comparison of DPP-Random Forest and Random Forest models for different datasets. Superior results in bold

For a thorough review of the quantum methods and circuits, we refer the reader to Kerenidis and Prakash (2022). The circuit is described in brief below.

Given an orthogonal matrix \(\textbf{X} = (\textbf{x}^1, \textbf{x}^2, \dots , \textbf{x}^n) \in \mathbb {R}^{n\times d}\), the quantum DPP circuit applied on \(\textbf{X}\) performs the following operation:

$$\begin{aligned} \mathcal {D}(\textbf{X})|0^n \rangle = \sum _{\begin{array}{c} |S|=d \\ S \in \{0,1\}^n \end{array}} \det (\textbf{X}_S)| 1_S \rangle \ , \end{aligned}$$

where \(\textbf{X}_S\) is the \(\mathbb {R}^{d\times d}\) submatrix obtained after sampling the rows of \(\textbf{X}\) indexed by S; \(1_S\) is the characteristic vector of S (with 1’s in the positions indexed by the elements of S) and \(\mathcal {D}(\textbf{X})\) represents the quantum d-DPP circuit, as detailed below.

Fig. 6
figure 6

Clifford loader circuit \(\mathcal {C}(x)\) for \(x \in \mathbb {R}^8\)

Thus, the probability of sampling S, i.e., of measuring \(|1_S \rangle \), is: \(Pr(S) = \det (\textbf{X}_S)^2 = \det (L_{S,S})\), where \(L=\textbf{X}\textbf{X}^T\). This draws the link between the quantum determinantal sampling circuit and the classical d-DPP model as seen in Eq. 1.

To construct the quantum d-DPP circuit, we need to first introduce a circuit known as a Clifford loader, which performs the following operation:

$$\begin{aligned} \mathcal {C}(\textbf{x}) = \sum _{i=1}^n x_i Z^{i-1} X I^{n-i}, \quad \text {for} \quad \textbf{x} \in \mathbb {R}^n \ . \end{aligned}$$

The Clifford loader was shown to have a log-depth circuit in Kerenidis and Prakash (2022), and is shown for \(n=8\) in Fig. 6, in which the gates represented by vertical lines are RBS gates — parameterized, hamming weight-preserving two-qubit gates.

The full quantum d-DPP circuit is a series of d Clifford loaders, one for each orthogonal column of \(\textbf{X}\):

$$\begin{aligned} \mathcal {D}(\textbf{X}) = \mathcal {C}(\textbf{x}^1) \mathcal {C}(\textbf{x}^2) \dots \mathcal {C}(\textbf{x}^d) \ . \end{aligned}$$

An example of a d-DPP circuit as a series of Cliffords for \(n=4\) is shown in Fig. 7.

4.2 Hardware experiment results

As a hardware experiment, we aimed to implement a simplified version of our algorithm on a quantum processor. We chose to use the “ibmq_guadelupe” 16-qubit chip, which is only capable of running small quantum DPP circuits for matrices of certain dimensions, such as (4, 2), (5, 2), (5, 3), (6, 2), (8, 2). As a result, we had to reduce the size of our problem.

To accomplish this, we defined reduced train/test sets: a train set of \(\sim \)1000 points from 03/2019 and a test set of \(\sim \)10,000 points from 04/2019. The quantum hardware-ready simplified algorithm is outlined in Fig. 8. It includes the following steps:

  1. 1.

    Applying PCA to reduce the number of columns from 153 to \(d=2,3\);

  2. 2.

    Dividing the dataset into batches of \(n=4,5,6,8\) points;

  3. 3.

    Sampling \(S \sim d\)-DPP\((\textbf{X}_{batch}\textbf{X}_{batch}^T)\) rows from each batch, resulting in small \(d\times d\) patches of data;

  4. 4.

    Aggregating these patches to form a larger dataset, then training one decision tree on this dataset.

We repeated this process for a number of trees and estimated the F1Footnote 2 score for every tree. We then compared the results for different sampling methods: uniform sampling, quantum DPP sampling using a simulator, and quantum DPP sampling using a quantum processor.

The IBM quantum computer only allows using RBS gates on adjacent qubits, so we cannot use the circuit described in Section 4.1. Instead, we use two different Clifford loader architectures which only use adjacent-qubit connectivity. The diagonal Clifford loader Fig. 9 is explained in Kerenidis and Prakash (2022), and the semi-diagonal loader (Fig. 10) is a modification that halves the circuit depth. As an error mitigation measure, we disregarded all results that did not have the expected hamming weight (d). The results are shown in the violin plots in Figs. 11 and 12.

The results indicate that for small matrix dimensions — up to (6,2) — the IBM quantum processor gives results similar to the ones achieved with the simulator. However, as the dimensions grow bigger, the samples from the quantum DPP circuits lead to worse classifier performance. This highlights the limitations of the available quantum hardware, which is prone to errors.

5 Quantum neural networks for credit risk assessment

5.1 Quantum neural networks with orthogonal and compound layers

In recent years, variational/parameterized quantum circuits (Benedetti et al. 2019) have become very prominent as NISQ-friendly QML techniques. When applied to classification problems, they are commonly known as Variational Quantum Classifiers (VQC) (Havlíček et al. 2018). The quantum circuits associated with VQCs may be schematically thought of as composed of three layers: the feature map \(\mathcal {U}_{\Phi (\textbf{x})}\), which encodes classical data \(\textbf{x}\) into quantum states; the variational layer \(W(\varvec{\theta })\), which is the part of the circuit parameterized by a set of parameters \(\varvec{\theta }\) which are learned in the training process; and finally, the measurement layer, which measures the quantum registers and produces classical information used in training and inference.

Fig. 7
figure 7

DPP circuit as a series of Clifford loaders

Fig. 8
figure 8

Quantum hardware-ready procedure for DPP sampling

Fig. 9
figure 9

Diagonal Clifford loader

The feature map and variational layers can take different forms, called ansätze, consisting of many possible different quantum gates in different configurations. Such immense freedom raises an important question: how should one choose an architecture for a given problem, and can it be expected to yield a quantum advantage? This question is of major practical importance, and although benchmark results have been shown for very particular datasets (Havlíček et al. 2018; Liu et al. 2021), there is little consensus on which ansätze are good choices for machine learning.

In our work, we use quantum neural networks with orthogonal and compound layers. Although these neural networks roughly match the general VQC construction, they produce well-defined linear algebraic operations, which not only makes them much more interpretable but gives us the ability to analyze their complexity and scalability. Because we understand the actions of these layers precisely, we are able to identify instances for which we can design efficient classical simulators, allowing us to classically train and test the models on real-scale datasets.

A standard feed-forward neural network layer modifies an input vector by first multiplying it by a weight matrix and then applying a non-linearity to the result. Feed-forward neural networks usually use many such layers and learn to predict a target variable by optimizing the weight matrices to minimize a loss function. Enforcing the orthogonality of these weight matrices, as proposed in Jia et al. (2019), brings theoretical and practical benefits: it reduces the redundancy in the trained weights and can avoid the age-old problem of vanishing gradients. However, the overhead of typical projection-based methods to enforce orthogonality prevents mainstream adoption.

In Landman et al. (2022), an improved method of constructing orthogonal neural networks using quantum ideas was developed. We describe it below in brief.

5.2 Data loaders

In order to perform a machine learning task with a quantum computer, we need to first load classical data into the quantum circuit.

5.2.1 Unary data loading circuits

Fig. 10
figure 10

Semi-diagonal Clifford loader

Fig. 11
figure 11

Decision trees performance using quantum DPP sampling with diagonal Clifford loaders

The first way we will load classical data is an example of amplitude encoding, which means that we load the (normalized) vector elements as the amplitudes of a quantum state. In Johri et al. (2021), three different circuits to load a vector \(\varvec{x} \in \mathbb {R}^d\) using \(d-1\) gates are proposed. The circuits range in depth from O(log(d)) to O(d), with varying qubit connectivity (see Fig. 13). They use the unary amplitude encoding, where a vector \(\varvec{x} = (x_1,\cdots ,x_d)\) is loaded in the quantum state \(|{\varvec{x}}\rangle = \frac{1}{\Vert x\Vert }\sum _{i=1}^d x_i|{e_i}\rangle \), where \(|{e_i}\rangle \) is the quantum state with all qubits in \(|{0}\rangle \) except the \(i^{th}\) qubit in state \(|{1}\rangle \) (e.g., \(|{e_3}\rangle = |{00100000}\rangle \)). The circuit uses RBS gates: a parameterized two-qubit hamming weight-preserving gate implementing the unitary given by Eq. 6:

$$\begin{aligned} RBS(\theta ) = \left( \begin{array}{cccc} 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} \cos \theta &{} \sin \theta &{} 0 \\ 0 &{} -\sin \theta &{} \cos \theta &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 \end{array} \right) \ . \end{aligned}$$

The parameters \(\theta _i: i \in \{1,...,d-1\}\) of the \(d-1\) RBS gates are classically pre-computed to ensure they encode the correct vector \(|{\textbf{x}}\rangle \).

Fig. 12
figure 12

Decision trees performance using quantum DPP sampling with semi-diagonal Clifford loaders

Fig. 13
figure 13

Three possible unary data loaders for d-dimensional vectors (\(d=8\)). From left to right: the parallel, diagonal, and semi-diagonal circuits have respectively a circuit depth of log(d), d, and d/2. The X gate represents the Pauli X gate, and the vertical lines represent RBS gates with tunable parameters

Fig. 14
figure 14

Non-unary loaders

5.2.2 RY-loading circuits

We will also use data loading procedures beyond the unary basis. In particular, for a normalized input vector \(\varvec{x} \in \mathbb {R}^d\), we use d qubits, where on each of the qubits, we apply an \(RY(\theta )\) rotation gate where the angle parameter on the \(i^{th}\) qubit is \(\theta _i = 2\pi x_i\), according to Eq. 7. Multiplication with \(2\pi \) allows us to cover the entire range of the \(\sin \) and \(\cos \) functions. This technique loads the data in the entire \(2^d\)-dimensional Hilbert space encompassing all the hamming weights from 0 to d. This loading technique has constant depth independent of d, and we refer to it as the RY loading, whose circuit for \(d=8\) is illustrated in Fig. 14.

$$\begin{aligned} RY(\theta )|{0}\rangle = \cos {\frac{\theta }{2}}|{0}\rangle + \sin {\frac{\theta }{2}}|{1}\rangle \end{aligned}$$

5.2.3 H-loading circuits

Lastly, we define a different technique for loading the data in the entire \(2^d\)-dimensional Hilbert space, which loads the vector in the unary basis and then applies a Hadamard gate on each qubit. This operation applies a Fourier transform on \(\mathbb {Z}_2\) and gives us a state encompassing all the hamming weights from 0 to d at no additional cost to the circuit depth. We call this the H-loading, whose circuit for \(d=8\) is illustrated in Fig. 14.

$$\begin{aligned} H|{0}\rangle = \frac{|{0}\rangle +|{1}\rangle }{\sqrt{2}} \hspace{1cm} H|{1}\rangle = \frac{|{0}\rangle -|{1}\rangle }{\sqrt{2}} \end{aligned}$$

The RY and H loading circuits spread the data over all \(2^n\) bases to allow future RBS-based neural network layers to utilize an exponential space. In contrast, unary loaders spread the data across the n bases with hamming weight 1, and since RBS gates are hamming weight-preserving (known as match gates (Jozsa and Miyake 2008), they cannot change this. Their action when using unary loaders is thus restricted to a much smaller space.

Fig. 15
figure 15

Parameterized quantum circuits for orthogonal and compound layers. Vertical lines represent two-qubit RBS gates, parameterized with independent angles \(\theta \), which are shown as the same in the figures for simplicity

5.3 Quantum orthogonal and compound layers

Quantum orthogonal layers consist of a unary data loader plus a parametrized quantum circuit made of RBS gates, while quantum compound layers consist of a general data loader plus a parametrized quantum circuit made of RBS gates.

RBS gates and circuits preserve the hamming weight of the input state, and thus if we use a unary data loader, then the output of the layer will be another vector in unary amplitude encoding. Similarly, if the loaded quantum state is a superposition of only basis states of hamming weight k, so is the output state. More generally, we can think of such hamming weight-preserving circuits with n qubits as block-diagonal unitaries that act separately on \(n+1\) subspaces, where the \(k^{th}\) subspace is defined by all computational basis states with hamming weight equal to k. The dimension of these subspaces is equal to \(n \atopwithdelims ()k\). The first block of this unitary is an \({n \times n}\) orthogonal matrix, such that when a vector is loaded in the unary basis, this circuit simply performs orthogonal matrix multiplication. In general, the k-th block of this unitary applies a compound matrix of order k of the \(n \times n\) unary matrix. The dimension of this k-th order compound matrix is \({n \atopwithdelims ()k} \times {n \atopwithdelims ()k}\). We refer to the layers that use bases beyond the unary as compound layers.

There exist many possibilities for building a parametrized quantum circuit made of RBS gates which can be used in a quantum orthogonal or compound layer, each with different properties.

The Pyramid circuit (Fig. 15), proposed in Landman et al. (2022), is a parameterized quantum circuit composed of exactly \(n(n-1)/2\) RBS gates. This circuit requires only adjacent-qubit connectivity, which makes it suitable for most superconducting qubit hardware. In addition, when restricted to the unary basis, the pyramid circuit expresses exactly the Special Orthogonal Group, i.e., orthogonal matrices with the determinant equal to \(+1\). To allow this circuit to express the entire orthogonal group, we can add a final Z gate on the last qubit. This allows us to express orthogonal matrices with a \(-1\) determinant as well. The pyramid circuit is, therefore, very general and covers all the possible orthogonal matrices of size \(n \times n\).

The \(\textbf{X}\) circuit (Fig. 15), introduced in Cherrat et al. (2022), uses just O(n) gates and has nearest-neighbor connectivity. Due to reduced depth and gate complexity, it accumulates less hardware noise.

The Butterfly circuit (Fig. 15) is inspired by the classical fast Fourier transform algorithm, and uses O(nlog(n)) gates. It was also introduced in Cherrat et al. (2022), and despite having reduced expressivity compared to the Pyramid circuit, it often performs just as well.

In Landman et al. (2022), a method is proposed to train orthogonal layers for the unary basis by computing the gradient of each parameter \(\theta _i\) using backpropagation. This backpropagation method for the pyramid circuit (which is the same for any circuit with RBS gates) takes time \(O(n^2)\), corresponding to the number of gates, and provides a polynomial improvement in runtime compared to the previously known orthogonal neural network training algorithms which relied on an \(O(n^3)\) SVD operation (Jia et al. 2019). Since the runtime corresponds to the number of gates, it is lower for the butterfly and X circuits. See Table 3 for full details on the comparison between the three types of circuits. For the compound layers, we need to consider the entire \(2^n \times 2^n\) space and thus train an exponential size weight matrix, which takes exponential time on a classical computer. In principle, a compound layer can also be trained using the parameter shift rule for quantum circuits, which can be more efficient since the number of parameters is polynomial in the input size, though noise in current quantum hardware makes this impractical for the time being.

5.4 Expectation-per-subspace compound layer

We describe here a compound layer that we call the Expectation-per-subspace compound layer. This layer in-volves loading the input vector using a non-unary basis which could be done either via the RY-loading or the H-loading circuit as previously defined. Then, we apply a parameterized quantum circuit with RBS gates, e.g., a pyramid circuit, which performs the compound matrix operation on all the fixed hamming weight subspaces. More precisely, we can think of the operation as performing the matrix-vector multiplication of an \({n \atopwithdelims ()k} \times {n \atopwithdelims ()k}\) matrix with an \({n \atopwithdelims ()k}\)-dimensional vector for each hamming weight k from 0 to n. Note that for 0 and n, the dimension is 1, and hence the unitary acts as identity.

Table 3 Comparison of different parameterized quantum circuits for orthogonal and compound layers with n qubits
Fig. 16
figure 16

Expectation-per-subspace compound layer. In the final step, we combine the \(n \atopwithdelims ()0\) and the \(n \atopwithdelims ()n\) subspaces and calculate their overall expectation

If we look at the output quantum state, it defines a distribution over a domain of size \(2^n\). Given the exponential size of the distribution, it is not advisable to try and train the entire distribution, since that would take exponential time. However, one can still try to use a loss function that contains some information about the distribution. For example, one can use the expectation of the distribution, which is what normally happens in variational quantum algorithms where one approximates this expectation by using a number of measurement outcomes. Given the fact that our unitary is block-diagonal, one can try to define a more complex loss function that contains more information about the distribution. In particular, one can split the domain of the distribution into \(n+1\) subdomains, one for each subspace, and then train on all these expectations.

This is what we do in the Expectation-per-subspace compound layer, where for each k from 0 to n, we take the outputs corresponding to the hamming weight k strings and sort them. Now, for each k, we assign values which are equally spaced between two bounds a and b (which are 0 and 10, in our models) to the \(n \atopwithdelims ()k\) strings. We normalize the outputs using the L1-norm to correspond to a probability distribution over the \(n \atopwithdelims ()k\) values between a and b, and then we calculate the expectation value for that hamming weight. This gives us a set of \(n+1\) values corresponding to each hamming weight. Since for hamming weight 0 and n the dimension of the subspace is 1 (the all-zero and all-one strings), we combine them and calculate the expectation for these two together and make the layer have n outputs. The entire operation is illustrated in Fig. 16.

While these compound layers do increase classical simulation complexity, they do not increase quantum complexity. The advent of better quantum hardware will allow us to test larger compound layers that explore much larger portions of the Hilbert space.

6 QNN results with classical simulation

In this second study, we focus on the problem of credit-default prediction associated with credit applications from Small and Medium Enterprises (SMEs). We report the results of our neural network architectures on this use case and on public datasets.

6.1 Use case introduction: credit risk prediction

The credit operation is one of the largest and most important operations in a retail banking institution. Throughout the credit journey (life-cycle) of a customer within the bank, several different models are used at different points of the journey and for different purposes, such as the determination of interest rates, offering of different products, etc.

The credit granting model is a particularly important one since it determines whether or not a credit relationship will be established. It is also particularly challenging in the case of SMEs, where the relationship with the bank often starts only when the SME submits an application for credit, so very little data is available.

Given these challenges, we propose the use of quantum techniques aiming at improving the predictive performance of the credit granting model.

The credit granting decision may be seen as a binary classification problem, in which the objective is to predict if the SME will default on credit. More specifically, we are interested in calculating the so-called probability of default (PD), which is given by \(P(\hat{y}=1|\varvec{x})\). The PD information is used internally for other pipelines primarily concerned with the determination of credit ratings for the SMEs (though in this study, we focus solely on the PD model), so the PD distribution is the main output of interest from the model. For this reason, we do not threshold the probability outputs from the model — thus, we use threshold-independent classification metrics to evaluate its predictive performance. Namely, the main Key Performance Indicator (KPI) that we use is the Gini score, constructed from the Area Under the Curve (AUC) of the ROC (Receiver Operating Characteristic) curve as \(\text {Gini} = 2 \times \text {AUC} - 1\). The Gini score is easily interpreted by the business team and allows for a holistic estimation of the model’s impact.

In this study, we chose to focus on the development of an “internal model” of credit default, which only uses features collected by Itaú, without considering any external information (from credit bureaus, for instance). The dataset used consists of \(\approx 141,500\) observations, each one represented by 32 features: 31 numerical and 1 categorical. Each observation represents a given SME customer in a specified reference month, whose observed target indicates its default behavior, and whose features consist of internal information about the company. The data was anonymized, standardized, and split into training and test sets based on the time period: the training set consists of \(\approx 74,700\) observations covering 12 months of data, while in the test set, we have \(\approx 66,800\) observations covering the subsequent 8 months.

6.2 Neural network architectures for credit risk

To compare the performance of the orthogonal and compound layers to the classical baseline, we designed three neural network architectures. Each architecture had three layers: an encoding layer, an experimental layer, and a classification head. The encoding layer was a standard linear layer of size \(32 \times 8\) followed by a \(\tanh \) activation. Its purpose is to bring the dimension of the features down to 8, which is a reasonable simulation size for both proposed quantum layers. The second layer was the experimental layer of size \(8 \times 8\) (described below). Finally, the third layer, the classification head, was a linear layer of size \(8 \times 2\) followed by softmax to predict the probabilities.

The first quantum neural network architecture, named OrthoResNN, uses an \(8 \times 8\) orthogonal experimental layer implemented with a semi-diagonal loader and X circuit. Note that the final output of the layer is provided by measurements. We add a skip connection by adding the input of the orthogonal layer to the output.Footnote 3 The layer is followed by a \(\tanh \) activation function. This architecture is illustrated in Fig. 17.

Fig. 17
figure 17

Architecture of the OrthoResNN model

Our second architecture, ExpResNN, replaces the experimental layer with an \(8 \times 8\) Expectation-per-subspace compound layer. We use the H-loader to encode our data. The layer is again followed by a \(\tanh \) activation function. Figure 18 illustrates the ExpResNN architecture.

Fig. 18
figure 18

Architecture of the ExpResNN model using the H-loader

And finally, the classical architecture, ResNN, used an \(8 \times 8\) linear residual layer followed by \(\tanh \).

6.3 Methods and training

The training of the networks was performed using the JAX package by Google. We train our models for 500 epochs. To identify suitable hyperparameters, we performed a search over learning rate, learning-rate-halving points, and batch size. The hyperparameter search was performed with the ray-tune library.

The dataset contains a large number of missing values, which motivated the experimentation of different imputation techniques such as zero-filling, round-robin imputation (implemented in the Python package scikit-learn), and MICE (van Buuren and Groothuis-Oudshoorn 2011). The best results were achieved with round-robin imputation using scikit-learn’s IterativeImputer with Bayesian ridge regression. This was the pre-processing employed in all the results on the SME dataset.

6.4 Results

In our experimental setup, we consider the fully connected residual layer (ResNN) as the classical benchmark. We performed the same experiment with an orthogonal layer using the semi-diagonal loader and the X circuit (OrthoResNN). Finally, we tried the expectation-per-subspace compound layer with the Hadamard loader and X circuit (ExpResNN). While the performance of the OrthoResNN and ExpResNN remained nearly the same as the FNN layer, these new layers learn the angles of 2n RBS gates instead of \(n^2\) elements of the weight matrix, dramatically reducing the number of parameters needed. The results are shown in Table 4.

The results show that quantum orthogonal and compound layers can preserve the performance of fully connected layers on this dataset while using a fraction of the trainable parameters. We note that the ExpResNN did not show advantages over the OrthoResFNN.

6.5 Further benchmarks

We compared the orthogonal and linear layers on public classification datasets from OpenML as we did for the Random Forests (Section 3.6). Again, we used a three-layer architecture: a linear encoding layer to 16 dimensions with GeLU activation, a \(16 \times 16\) experiment layer with tanh activation, and a binary classification head. These architectures did not have residual connections.

We compared two models: OrthoFNN and FNN. OrthoFNN used a \(16 \times 16\) pyramid circuit as the experiment layer, for a total of 120 trainable parameters. FNN used a feed-forward linear layer with 256 trainable parameters.

For all the datasets and models, we use a batch size of 128 and a learning rate of \(10^{-4}\). Each is trained for 500 epochs and evaluated with the ROC-AUC metric. The results are summarized in (Table 5).

Table 4 Comparison between different architectures
Table 5 Comparison of OrthoFNN and FNN for different datasets. Superior results in bold

7 QNN results on quantum hardware

7.1 Implementation of quantum circuits

Using a classical computer, inference using an orthogonal layer takes time \(O(n^2)\), while for a general compound layer, this time is exponential in n. Using a quantum computer, inference with an orthogonal or compound layer uses a quantum circuit that has depth O(n) (Pyramid or X) or \(O(\log (n))\) (Butterfly), and \(O(n^2)\) gates. Therefore, one may find a further advantage if the inference is performed on a quantum computer. This motivated us to test the inference step for classically trained OrthoResNN and ExpFNN models (ExpResNN from the classical experiments without a residual connection) on currently available quantum hardware.

The data loader and orthogonal/compound layer circuits employed in our model architectures are NISQ-friendly and particularly suitable for superconducting qubits, with low depth and nearest-neighbors qubit connectivity. Thus, we chose to use IBM’s 27-qubit machine ibm_hanoi (see Fig. 19).

Fig. 19
figure 19

Topology graph of the 27-qubit ibm_hanoi machine used to perform our hardware experiments. The colors in the qubits indicate readout assignment error; and in the connections the CNOT error — dark blue is low, purple is high

To perform inference on ibm_hanoi, we used the semi-diagonal data loader and X circuit to implement the OrthoResNN model; and the Hadamard loader and X circuit for the ExpFNN model — the same architectures described in Section 6. Both neural networks were trained classically, and the trained parameters were used to construct our quantum circuits for inference.

Given the large size of the test dataset (66, 750 data points), we decided to perform inference using the trained models on a small test subsample of 300 test points, corresponding to the maximum number of circuits we could send in one job to the IBM machine. After testing different subsamples with the OrthoResNN model,Footnote 4 we selected one for which we achieved a subsample test Gini score of 54.19% using a noiseless simulator (blue ROC curve in Fig. 20). The same was done for the ExpFNN experiment, yielding a subsample test Gini of 53.90% with the noiseless simulator. These values were taken as the best possible Gini scores if the inference was performed on noiseless quantum hardware, which could then be compared with the values actually achieved with ibm_hanoi.

Fig. 20
figure 20

ROC curves with Gini score for the ideal simulation, hardware execution, and the error-mitigated hardware execution

The circuits were then run on the quantum processor. Due to its limited Hilbert space of size n, the OrthoResNN has a natural error-rejection procedure: any measurements outside of the unary basis can be disregarded as errors. As a result, the inference yielded a Gini score of \(50.19\%\), as shown in the orange ROC curve in Fig. 20. The achieved Gini was not too far from the noise-free simulation result (54.19%), but there was clearly a room for improvement in order to close the 4 pp difference. We also attempted inference with the more complex ExpFNN, which yielded a Gini score of 40.20%, much farther from the noiseless simulation Gini of 53.90%. Since the ExpFNN uses the entire \(2^n\)-dimensional Hilbert space, it is more prone to errors due to noise, as the error-rejection procedure used for the OrthoResNN cannot be employed.

7.2 Improving the hardware results with error mitigation techniques

Error mitigation and error suppression techniques undoubtedly play a very important role in NISQ-era quantum computing. While these techniques alone may not be sufficient to fully overcome the imperfections of current quantum systems, they can push the practical limits of what can be achieved. As a next step, for the OrthoResNN model, we experimented with various error mitigation and suppression approaches, going beyond the simple hamming weight postselection procedure, in an attempt to close the gap of 4 pp between the Gini score from the noisy simulation and the one from hardware execution.

The first approach that we tried was a correlated readout mitigator. This is a purely classical post-processing technique which demands the construction of a calibration circuit for each one of the possible \(2^N\) states of the full N qubits Hilbert space. The calibration circuits’ execution (simulated using ibm_hanoi’s backend information, in our case) yields a \(2^N \times 2^N\) assignment matrix, which is used to understand how errors might occur during readout. One can see that this method rapidly becomes intractable as the number of qubits N increases. In our case, for \(N=8\), the Gini score improved to 50.24%, a small improvement of only 0.05 pp.

Thus, in order to investigate the effect of more robust error suppression and mitigation techniques in our results, we moved on to a new round of hardware experiments, performing the inference by executing the exact same OrthoResNN circuits via the Qiskit Runtime (contributors 2021) service using the Sampler primitive, which allows one to use circuit optimization as well as error suppression and mitigation techniques, as detailed below.

Firstly, we used circuit optimization at the point of circuit transpilation and compilation by setting the optimization_level parameter to the highest possible value, 3. This performs the following circuit optimization routines: Layout selection and routing (VF2 layout pass and SABRE layout search heuristics (Li et al. 2018); 1 qubit gate optimization (chains of single-qubit u1, u2, u3 gates are combined into a single gate); commutative cancellation (cancelling of redundant self-adjoint gates); 2 qubit KAK optimization (decomposition of 2-qubit unitaries into the minimal number of uses of 2-qubit basis gates).

Secondly, we used the Dynamical Decoupling error suppression technique (Viola and Lloyd 1998; Ezzell et al. 2022). This technique works as a pulse schedule by inserting a DD pulse sequence into periods of time in which qubits are idle. The DD pulses effectively behave as an identity gate, thus not altering the logical action of the circuit, but having the effect of mitigating decoherence in the idle periods, reducing the impact of errors.

Thirdly, we used the M3 (Matrix-free Measurement Mitigation) error mitigation technique (Nation et al. 2021) by setting the Sampler resilience_level parameter to 1 (the only option available for the Sampler primitive). This provides mitigated quasi-probability distributions after the measurement. M3 works in a reduced subspace defined by the noisy input bitstrings supposed to be corrected, which is often much smaller than the full N qubits Hilbert space. For this reason, this method is much more efficient than the matrix-based readout mitigator technique mentioned above. M3 provides a matrix-free preconditioned iterative solution method, which removes the need to construct the full reduced assignment matrix but rather computes individual matrix elements, which uses orders of magnitude less memory than direct factorization.

By employing these three techniques, we were able to achieve a Gini score of 53.68% for the OrthoResNN (Fig. 20). This is a 3.49 pp improvement from the initial 50.19% Gini of the unmitigated run, falling only 0.53 pp behind the ideal noiseless execution (54.21% Gini)! This remarkable result underscores the NISQ-friendliness of the orthogonal layer and highlights the importance of error suppression and mitigation techniques in the NISQ era.

It is important to note that circuit optimization, error suppression, and mitigation techniques typically result in some classical/quantum pre/post-processing overhead to the overall circuit runtime. Some of these techniques are based on heuristics and/or do not have efficient scaling at larger circuit sizes. It is important to balance the desired levels of optimization and resilience with the required time for the full execution, especially as the circuit sizes increase.

8 Conclusion

In this work, we have explored the potential of quantum machine learning methods in improving forecasting in finance, with a focus on two specific use cases within the Itaú business: churn prediction and credit risk assessment. Our results demonstrate that the proposed algorithms, which leverage quantum ideas, can effectively enhance the performance of Random Forest and neural network models, achieving better accuracy and training with fewer parameters.

In the present day, quantum hardware is not powerful enough to provide real improvements or conclusive large-scale benchmarks. Performance enhancements can be achieved today by turning these quantum ideas into classical ML solutions run on GPUs. However, with the advent of better quantum hardware, we expect these methods to run faster and produce even better results when run on quantum computers.

The general nature of the proposed methods makes them applicable to other use cases in finance and beyond, although they must be tuned to specific datasets and tasks. We hope this work inspires confidence that QML research holds promise both for today as well as for the coming era of scaled, fault-tolerant quantum hardware.