1 Introduction

In the quantum technologies context, no quantum device can be considered an isolated (ideal) quantum system. For this reason, the acronym Noisy Intermediate-Scale Quantum (NISQ) technology has been recently introduced (Preskill 2018) to identify the class of early devices in which noise in quantum gates dramatically limits the size of circuits and algorithms that can be reliably performed (Deutsch 2020; Bharti et al. 2022). As early quantum devices become more widespread, a question that naturally arises is to understand, at the experimental level, whether in a generic quantum device the signature left by inner noise processes exhibits universal features or is characteristic of the specific quantum platform. Moreover, one may wonder to determine if such a noise signature has a time-dependent profile or can be effectively considered stable, in the sense of constant over time, while the device is operating.

The answers to these questions are expected to be crucial in defining a proper strategy to mitigate the influence of noise and systematic errors (Degen et al. 2017; Szańkowski et al. 2017; Do et al. 2019; Müller et al. 2020; Wise et al. 2021), possibly going beyond standard quantum sensing techniques (Cole and Hollenberg 2009; Bylander et al. 2011; Álvarez and Suter 2011; Yuge et al. 2011; Paz-Silva and Viola 2014; Norris et al. 2016) and overcoming current limitations on probes dimension and resolution (Cole and Hollenberg 2009; Bylander et al. 2011; Frey et al. 2017; Müller et al. 2018; Hernández-Gómez et al. 2018; Hernández-Gómez and Fabbri 2021). On top of that, it gains even more importance in case one proves that noise signatures are peculiar to the single device, with the consequence that the issue of attenuating noise effects may be harder than expected. Indeed, each quantum technologies platform, ranging from superconducting circuits (Devoret et al. 2004; Clarke and Wilhelm 2008) to trapped ions quantum computers (Wineland et al. 2003), photonic chips (Spring et al. 2013; Metcalf et al. 2014) and topological qubits (Freedman et al. 2003), could need ad hoc solutions that usually are expensive and incompatible from a device to another. In addition, if the noise properties of a quantum device happen to be time-dependent, the system necessarily requires continuous calibrations, thus hindering not only the available runtimes, but also the accessibility from the external user and the replicability of the experiments performed on it. Furthermore, in case the noise fingerprint of the considered device can be easily discerned and remains unchanged over time, one could be able to identify from which specific quantum device certain data were generated just by looking at the noise fingerprint. However, this aspect might create problems, in principle, for possible future usages of the device in privacy-sensitive applications.

In this paper, we aim to shed light on the previously discussed aspects by providing a powerful tool, based on Machine Learning (ML) techniques, for the classification of noise fingerprints in quantum devices with same technical specifications but physically placed in different environmental conditions. ML (Bishop 2006; Hastie et al. 2009) — originally introduced in the classical domain to learn from data, identify distinctive patterns, and then make decisions with minimal human intervention — has been already proven useful to characterize open quantum dynamics (Youssry et al. 2020; Luchnikov et al. 2020; Fanchini et al. 2021) and to carry out quantum sensing tasks (Niu et al. 2019; Harper et al. 2020; Martina et al. 2021; Wise et al. 2021), as for example the learning and classification of non-Markovian noise (Niu et al. 2019; Martina et al. 2021) or the detection of qubits correlations (Harper et al. 2020).

Here, we first design a quantum circuit that operates over 4 qubits belonging to the basis \(\{\lvert 0000\rangle ,\lvert 0001\rangle ,\dots ,\) \(\lvert 1110\rangle ,\lvert 1111\rangle \}\) composed of 16 states. The designed quantum circuit is measured (by locally applying the Z Pauli operators on some qubits of the circuit) in 9 distinct parts that, from now on, we denote as measurement steps. The routine allowing to record all the outcome in each measurement step is instead denoted as execution. Moreover, the repetition of a given number of executions is called run. Employing the open-access quantum computers offered by the IBM Quantum ExperienceFootnote 1, we experimentally classify a set of quantum devices by executing in all of them the same testbed circuit. The classification is enabled by the presence of a peculiar noise fingerprint associated with each quantum machine. In more details, the ML models are trained by taking as input the distributions of the outcomes recorded at the 9 measurement steps of the testbed circuit. As shown in the next sections, the classification is successfully achieved with a test accuracy greater than 99%, both on diverse IBM machines and on single devices but at different times from one execution to another. Indeed, from our experiments we can observe that the noise fingerprint of each tested quantum devices has also a clear time dependence, meaning that executions of a quantum circuit, implemented at different times, can be associated with distinctive main traits.

These experimental evidences lead us to the conclusion that different IBM quantum devices exhibit distinctive, and thus distinguishable, noise fingerprints that can be characterized and predicted by ML methods. Therefore, the proposed solution might be pivotal to certify the time-scheduling and the specific machine on which a given quantum computation is executed. Moreover, learning the noise fingerprint of the quantum device under analysis could play a key role both for diagnostics purposes — especially in all those contexts where logic quantum operations cannot be error-corrected (Deutsch 2020) — and to accomplish benchmarking and certification (Eisert et al. 2020) of quantum noise sources within a pre-established error threshold.

2 Experimental platform

For our experiments we employ the IBM Quantum cloud services to run remotely quantum circuits on several machines. To interact with the remote services, we use the Qiskit Software Development Kit (SDK) (Aleksandrowicz and other 2019), which is an open-source Python SDK, useful both to simulate quantum dynamics (with or without noise) and to program a given set of operations on a real quantum device. Overall, we have at our disposal up to 11 superconducting quantum computers ranging from a single qubit up to 15 qubits, with different topology and calibration routines. For all the available devices and their characteristics, we direct the reader to the IBM documentationFootnote 2.

The accessibility and availability of the IBM devices allow to carry real experiments having the flexibility of taking either a lot of samples in a short amount of time, or collecting samples from the same circuit but at longer time intervals. As it will be shown below, both these aspects will be properly exploited in carrying out our experiments. Moreover, one can also run the same exact circuit not only on a single device but on multiple machines, thus enabling the creation of complete datasets of quantum experiments to be fed in ML algorithms. Regarding the generation of our datasets, we refer the reader to the source codes at the address provided at the end of the manuscript.

Overall, several experiments (explained in detail later) have been conducted on different IBM chips (specifically, ‘Yorktown’, ‘Athens’, ‘Bogota’, ‘Casablanca’, ‘Lima’, ‘Quito’, ‘Santiago’, ‘Belem’, and ‘Rome’). The chips differ by two main aspects. The first is the architecture (or connectivity) of the qubits, which ranges from a simple line topology to a ladder or a star topology. The second important difference is the so-called quantum volume (Cross et al. 2019) (8, 16, 32 for the machines used in our experiments) that quantifies the maximum dimension of a circuit that can be effectively executed, and is correlated also with the noise affecting each device. Indeed, some quantum machines are inherently noisier than other, and even single qubits inside a machine can have a distinctive noise profile. All these peculiar differences in noise and topology represent the fingerprint that we aim to exploit using our method.

Before proceeding, it is worth stressing that, albeit the proposed experiments are carried out on gates-based superconducting devices, the approach adopted here is valid in principle for a larger class of NISQ devices, even not circuit-based.

3 Testbed quantum circuit

To learn the noise fingerprint of IBM quantum devices, we design a quantum system whose evolution can be decomposed over the 16 states \(\lvert 0000\rangle ,\lvert 0001\rangle ,\dots ,\lvert 1110\rangle ,\lvert 1111\rangle \) according to the quantum circuit in Fig. 1. Notice that, for our purposes, the number of qubits of the testbed circuit can be just a few; however, this does not imply that the proposed solutions cannot be applied to circuits with generic dimension.

Fig. 1
figure 1

On the left, circuit implementation of the quantum dynamics employed as a testbed. The quantum circuit, which involves 4 qubits, is repeated more than once, and 2 of the 4 qubits are measured at regular steps. In our experiments the circuit is repeated 3 times and the measurement steps are in correspondence of each CNOT and Toffoli gates. The outcome probabilities obtained by our measurements, which together form the datasets to train, validate and test the used ML models, are fed into a Support Vector Machine (SVM) — schematically represented on the right — in order to be classified

After repeating the quantum circuit a certain number of times (3 times in our experiments), the outcomes of the measurements are recorded and then used to create the dataset for the training of ML classifiers. The aim of the latters is to discriminate the noise fingerprint of different quantum machines, as pictorially shown in Fig. 1. The details about the classifiers will be given in the next section, while here we focus on the implementation of the quantum circuit.

The quantum circuit is initialized in the state \(\lvert 0000\rangle \) and the resulting computations are performed by the action of local operations and of controlled NOT (CNOT) and Toffoli gates (denoted in Fig. 1 by a light blue and light purple rounds, respectively, with the symbol ‘plus’ inside). We recall that the CNOT is a two-qubit quantum operation, commonly used to entangle/disentangle Bell states, that flips the second qubit when the first qubit is in \(\lvert 1\rangle \). Instead, the Toffoli gate is a universal ‘controlled-controlled-not’ (3-qubit) operation where a third qubit is flipped when two control qubits are both in \(\lvert 1\rangle \). In our circuit in Fig. 1, two qubits (i.e. q3 and q2 in the figure) are used to get information on the quantum system, providing at each measurement the pair of bits (0,0),(0,1),(1,0),(1,1), where the first and second bits correspond, respectively, to the outcomes measured on q3 and q2. Conversely, qubits q0 and q1 are employed as ancilla qubits. Then, this quantum circuit is repeated 3 times, with the aim to collect data on the quantum dynamics in each IBM device. As already mentioned in the Introduction, the resulting quantum circuit (given by repeating 3 times the circuit in Fig. 1) is locally measured in 9 distinct parts (corresponding to the measurement steps) thanks to the simultaneous application of Z Pauli operators σz on the qubits q3 and q2, from which the measurement outcomes are collected. Specifically, the 9 measurements are performed right after the implementation of each CNOT and Toffoli gates in the full quantum circuit. It is worth noting that the procedure we are proposing is not based on repeated measurements as in a quantum monitoring protocol or in Zeno quantum dynamics (Fischer et al. 2001; Schäfer et al. 2014; Gherardini et al. 2017; Virzí et al. 2021), since, each time a measurement is performed at a given measurement step (say the k th, with k = 1,…,9), the whole testbed quantum circuit is regenerated and then (locally) measured at the subsequent step, i.e. the (k + 1)-th.

In a single repetition, the quantum circuit is initialized in \(\lvert 0000\rangle \) that corresponds to the measurement outcomes (0,0), and then two Hadamard gates (blue squares ‘H’ in Fig. 1) are applied to both q0 and q1. Thus, since the two CNOT gates are conditioned to q0 and q1 respectively, the probability to get 1 or 0 in q2 and q3 after the CNOTs is 0.5. In this way, after the Pauli-X rotation (green squares ‘X’ in Fig. 1) and the Toffoli gate, the system is in the state \(\frac {1}{2}(\lvert 0110\rangle +\lvert 0111\rangle +\lvert 1001\rangle +\lvert 1100\rangle )\) before that the qubits q2, q3 are measured along the z-axis (black squares in Fig. 1). This entails that, at the end of the circuit, measuring q3 and q2 provides the results {(0,0),(0,1),(1,0),(1,1)} with probabilities respectively {0,0.5,0.25,0.25}. Of course, such a dynamic only occurs under ideal unitary evolution, which is not the case of the implementation on real experimental devices. In our case, the noisy environment, in which the machines are immersed, alters each realization of the simulated quantum dynamics, thus making the latter stochastic. As we will prove below, this randomness is a specific feature of each machine and changes from one device to another, thus allowing us to perform classification tasks. Specifically, are the discrepancies between the measured outcome probabilities (from qubits q2 and q3) on one or more IBM machines that enable to learn the corresponding noise fingerprint, and then classify from which device the input data have been generated. Here, it is worth noting that, despite from one implementation to another a slight different physical Hamiltonian may be implemented in the chips of each quantum device, the variations observed in the measurement outcome distributions — having a prominent random nature — are not ascribable to such a deterministic aspect, but to a stochastic cause thus pertaining to an external noise source. However, a same stochastic process can affect differently two equivalent quantum dynamics but originated by two distinct physical Hamiltonian operators. Therefore, the fingerprint that we leverage for the classification can be due not only to differences in the noise profiles affecting the quantum devices, but also on their dependence on the way the testbed circuit is physically implemented.

Our results (shown in the following) are quite general, since they do not depend on specific dynamics and do not require initial assumptions. Accordingly, we expect that such results may be re-obtained in other quantum devices, even ones not necessarily designed to carry out computing tasks. However, it is worth observing that this generality is gained by using a black-box ML model that provides us little to no information about the specific sources of error that are present in the noise fingerprint. This is in stark contrast with conventional techniques for device benchmarking that employ lots of resources and very specific protocols (thus lacking of generality), but they are able to extract information about the microscopical structure of noise sources affecting the device.

4 Machine learning model

Let us provide some details on the adopted ML model, i.e. the popular Support Vector Machine (SVM) (Hastie et al. 2009).

The dataset yielded as input to the SVM is a set of n points \({\mathbf {x}}_{q}\in {\mathbb {R}}^{p}\), with q = 1,…,n, each of them living in the p-dimension space of the data features, where a feature is a distinctive attribute of the data set elements.

In binary classification problems, to each xq with \(q=1,\dots ,n\) is associated a class yq ∈{− 1,1} that represents the desired output of the SVM. By contextualizing it to our problem, the binary classes yq denote if a given set of points xq have been generated (+ 1) or not (− 1) by a specific machine or in a time window/interval. A SVM for binary classification is trained such that the two classes of points (provided as input to the ML model) are separated by the hyperplane that maximizes the distance between the hyperplane itself and the nearest points of the classes (commonly denoted as margin). If the points xq of the data set are not linearly separable (which is most often the case), then the value of the margin is negative and the points cannot be classified. To circumvent this problem, SVMs employ a clever mapping in an higher-dimensional space (called feature-space) with polynomial or Radial Basis Function (RBF) kernels that allows for an easy classification as in Fig. 1. The extension to multiclass classification problems is then obtained by associating a class with multiple values to each xq. In our experiments, part of the generated dataset is used as a validation set to choose the best mapping among the kernels: linear (meaning that the data is already linearly separable), polynomial with degree 2, 3 and 4, and RBF. In many cases, just the simple linear kernel is enough to successfully perform the classification, but in other cases (e.g. in multiclass classification) the more complex kernels may be beneficial.

Finally, in our experiments, the classification accuracy is computed by comparing the predictions \(\hat {{\mathbf {y}}}\) returned by the ML models with the desired classes y of the test set:

(1)

where is the indicator function such that if c is true, and otherwise. In this regard, to clarify the naming convention for the reader, we refer to ‘training’, ‘validation’ and ‘test’ sets, to identify three non-overlapping partitions of the data. These partitions are used respectively to: train the model, validate the best parameters, and test the performance on unseen data. In the experiments we randomly select 60% of the data to train the SVM model, 20% to validate different configurations (i.e. SVM kernel type), and 20% to report the results on unseen data.

5 Experiments description

The results, which we are going to show, concern three series of ML experiments that use two different datasets, obtained from the IBM quantum chips mentioned previously.

In the first two experiment series, the ML models are trained both to discriminate the noise fingerprint of different quantum devices and to identify a time dependence in each of them. The training of some of the models is performed on the dataset here denoted as FAST that collects the outcome distributions measured in temporally-close executions of the testbed quantum circuit on 7 different IBM quantum machines (i.e. ‘Athens’, ‘Bogota’, ‘Casablanca’, ‘Lima’, ‘Quito’, ‘Santiago’, ‘Yorktown’). In these experiments, 20 parallel tasks (corresponding to the maximum allowed number) are appended to the IBM fair-share queue, and, once a task is concluded, another task is immediately added. For each task the testbed circuit is run 8000 times for each one of the 9 different steps, and the probabilities to get the measurement outcomes are computed over 1000 shots among the total 8000 to obtain 8 different outcome probabilities times 9 steps per task. Here, we recall that the outcomes recorded in the 9 consecutive measurement steps \(k=1,\dots ,9\) are obtained by locally measuring, after each CNOT and Toffoli gate, the quantum circuit composed repeating 3 times the one depicted in Fig. 1 . Moreover, the outcome probabilities of each step are obtained by sampling the outcomes from 1000 shots of the full quantum circuit, stopped each time in correspondence of the considered k th measurement step. For the sake of clarity, at the first measurement step the quantum circuit is composed only of the Hadamard gates in q0 and q1 and the CNOT linking q0 and q2; at the second step, the quantum circuit contains all the previous gates that are then followed by the CNOT from q1 to q3; at the third step, the circuit is composed by the circuit at the second step and the additional NOT gates on q0, q1 and the Toffoli gate, and so on for the subsequent measurement steps.

Conversely, in the third ML experiment series, we perform a robustness analysis by making stricter the time constraints on the employed datasets. Specifically, in those experiments, and in part of the previous ones, we employ a second dataset, called SLOW, which is composed of measurement distributions extracted from executions in two different quantum machines (‘Belem’ and ‘Quito’) more ‘slowly’ than the data in the first dataset. As represented in Fig. 2, more ‘slowly’ means that only one task per time is appended to the queue and then run, waiting at least 2 min from the conclusion of the previous task. Moreover, for each task the testbed circuit is executed, for each one of the 9 steps, 1000 times that corresponds to the number of shots set to compute the outcome distributions.

Fig. 2
figure 2

Elapsed hours to collect all the measurement outcomes on the IBM machines ‘Belem’ and ‘Quito’ (solid blue and dashed red lines, respectively) for the dataset SLOW. Each point of the curves, obtained over 1000 executions of the testbed quantum circuit for each measurement step k = 1,…,9, is associated with the relative physical/real time in which the measurement probabilities are computed in a single run. Notice that, if compared with the time scale of the vertical axis (y-axis), which is expressed in hours, the computation of the 9000 executions of each run can be considered practically instantaneous, i.e. in the order of some seconds. Moreover, the anomalous behaviour of the curves after 1500 runs has to be attributed to the policy of the IBM fair-share queue

Overall, for each machine, we have collected 2000 sequences of 9 probability distributions built with the measurement outcomes from the qubits q3 and q2 of the testbed circuit. This means that a total of 2000000 × 9 single executions have been run on each quantum machine that we employed to generate the FAST dataset, and similarly for the SLOW one.

As final remark, let us note that the FAST dataset is employed for the experiments illustrated in Section 6 and part of Section 7, while the SLOW dataset to complete the experiments in Section 7 and perform in Section 8 a robustness analysis at different time scales.

6 Quantum devices classification

As first, we present binary classification experiments. For each pair of IBM machines, a SVM model is trained using the dataset FAST (introduced in Section 5) with the aim to identify on which device the executions of the testbed quantum circuit are run. The inputs of the SVM model are the distributions of the measurement outcomes from qubits q3 and q2 recorded at the discrete measurement steps \(k=1,\dots ,9\). Specifically, two different kinds of inputs are set: In the first we consider only the outcome distributions measured at the single step k with k ∈ [1,9], while in the second we concatenate all the measurement probabilities in ordered sequences \(1,\dots ,k\). Then, our ML experiments are performed by alternatively taking the two types of inputs; we will report below the resulting accuracy values for both of them.

From the results of our experiments — reported in Table 1 — we observe that it is sufficient to use only the outcome probabilities corresponding to the first three measurements at k = 1,2,3 to reach more than 99% of accuracy in discriminating all the pairs of tested machines. This implies that, in a realistic deployment scenario, one needs less data than the amount acquired here to reach good classification performances. An additional observation we can make is that the accuracy is not monotonic in k when considering the classifier using single measurement data. This can be due to the fact that, at various measurement steps, to distinguish the noise fingerprint from a single measurement probability might be easier or harder. On the other hand, we can also observe that the accuracy is steadily increasing when as input is set the sequence of all outcome distributions up to any measurement step k. Hence, from this we can deduce that, to identify the noise fingerprint of IBM quantum devices, sequences of outcome distributions recorded at more than one measurement steps need to be taken into account. This is also the reason why we deem important to frame the issues addressed in this paper as belonging to a noise fingerprint in time instead of single run measurements.

Table 1 Classification accuracy, denoted as α(⋅), of all the possible binary SVMs trained with the measurement probabilities collected in the dataset FAST. For each experiment, a large number of executions are run on two different IBM machines (whose names are in the 1st column and in the 1st row of the table) in correspondence of the measurement steps k (1st column of each sub-table). Then, two different inputs are tested: Outcome distributions at single steps (whose accuracy values are in the 2nd column of the sub-tables) and sequences of measurement probabilities obtained at each k (accuracy values in the 3rd column of each sub-table). A color gradient representing the accuracy is given to facilitate the reading of the table (in red the lowest, green the highest)

Let us now extend the binary SVM algorithms to multiclass classification problems, in which more quantum devices are simultaneously discriminated. In our experiments, the so-called one-vs-rest strategy is adopted (Bishop 2006), where for n distinct classes we train n different binary classifiers that discriminate the elements of a class from the others. In particular, our multiclass SVM is trained with the aim to identify to which IBM quantum machines, among the 7 that have been used, belongs a given set of measured outcome probabilities (from the testbed quantum circuit) of the FAST dataset. The results in Table 2 report the test accuracy values returned by the models that are trained with different input data. As in binary classification, for one kind of input data, the model is trained with the outcome distributions obtained at single step k with k ∈ [1,9] (3rd column of Table 2), while another set of input data is provided by concatenated measurement probabilities \(1,\dots ,k\) (8th column). Moreover, for the purpose of multiclass classification, further input are also adopted: At each step k the model is trained not only with the outcome distributions at the k th step, but also with a window of preceding measurement probabilities belonging to [ks,k] with s integer number. Regarding s, the range from 1 (4th column of Table 2) to 5 (7th column of Table 2) is considered. As for the binary case, the SVM is able to successfully discriminate between the tested machines just by using the measurement outcomes taken in few measurement steps. While the accuracy using the outcomes at single-time measurement steps oscillates, the time-ordered sequence monotonically increases. That confirms our previous observations about the need of a time sequence to have a reliable fingerprint. In addition, the models trained with the input data on sliding windows allow us to understand the effective need of outcome distributions taken from more than a single measurement step for the classification of the noise fingerprint. In such case, we observe that the accuracy at each step k steadily increases with the size of the set of considered steps, and this holds also by looking at the average of the accuracy values computed over all the measurement steps. It is worth noting that the last column on Table 2 expresses a similar strategy, where the single accuracy values are provided as output of the models trained on a window (with increasing dimension) that always starts from the 1th to the k th step. In other words, the first accuracy values on top of columns from 3 to 7 correspond to the elements of the last columns for k from 1 to 5.

Table 2 Classification accuracy, denoted as α(⋅), of multiclass SVMs trained with the measurement probabilities collected in the dataset FAST. A large number of executions are run on 7 different IBM machines (1st column of the table) in correspondence of the measurement steps k ∈ [1,9] (2nd column). Different inputs are tested: Outcome distributions at single steps (3rd column), sequences of measurement probabilities computed on windows of width from 2 to 5 steps before each k (from 4th to 7th columns), and the sequences of all measurement probabilities obtained from the 1st to the k th step (8th column). Finally, the last row of the table reports, for all the k s, the averages of the accuracy values in the rows above; the average of the last column is omitted since the accuracy values therein are calculated on models with different numbers of input measurement steps. A color gradient representing the accuracy is given to facilitate the reading of the table (in red the lowest, green the highest)

The high-level of accuracy (even more than 99%) in carrying out binary and multiclass classification of the IBM quantum machines is an evidence for the presence of a strong underlying noise fingerprint in the dynamics of NISQ devices. Indeed, this is the key feature that can allow one to identify, basically in a deterministic way, from which quantum machine a given set of measurement has been obtained.

7 Noise fingerprint at different time scales

Since the environment of the IBM quantum devices changes quite often (e.g. the machines are calibrated up to multiple times in an hour), we have slightly modified our experiments to prove also the existence of a noise fingerprint that pertains to the temporal evolution of the chip on which a given quantum circuit is executed.

To confirm this hypothesis, we have designed a temporal classification setting that we employ with data from both the FAST and SLOW datasets.

Regarding the experiments using the FAST dataset, two sets of measurement outcome distributions are collected for the machine ‘Casablanca’, one temporally separated from the other by 24 h. After that, similarly to what done in the previous experiments, a SVM model is used to discriminate the executions implemented the first day on the IBM device from the ones performed on the second day. From these experiments, whose results are shown in Table 3, we observe that the designed ML algorithms are able to detect a characteristic fingerprint, still induced by the presence of noise sources, in a single quantum device but in measurement steps separated by a quite long (24 h) time interval. In such classification tasks, an accuracy of 95% is achieved by the ML models, just by taking as input the sequence of outcome distributions at the first measurement steps k = 1,2,3.

Table 3 Classification accuracy, denoted as α(⋅), of SVM- — trained with two sets of outcome distributions from the dataset FAST, temporally separated by 24 h — to predict in the IBM machine ‘Casablanca’ which executions were implemented the first day and which the second day. Also in this case, the inputs to the SVMs at the measurement steps k (2nd column of the table) are the outcome distributions at single steps (3rd column) or the sequences of measurement probabilities computed at each k (4th column). A color gradient representing the accuracy is given to facilitate the reading of the table (in red the lowest, green the highest)

Analogously to the previous experiments, the single measurement outcomes do not seem to carry enough information on the noise fingerprint and the classification accuracy depends on the choice of k. Instead, when we consider the sequences of outcomes for all the steps, we can observe that the noise fingerprint in the first window of runs can be much better distinguished from the corresponding fingerprint in all the subsequent windows, except the neighbouring one. The window from run 1401 to run 1600 seems more challenging to classify with respect to the others. One possible reason for this can be that, as one can see from Fig. 2, around the run 1500 the policies of the IBM fairshare queue caused a discontinuity in time. This means that the data distribution inside the aforementioned window has more variance with respect to the data in the other windows and for the ML models can be find more difficult to classify the data. However, even in that case the classification accuracy reaches 100% when using the sequence of measurement probabilities for all the steps \(k=1,\dots ,9\).

In order to better quantify the evolution in time of the noise fingerprint, we use data from the runs of ‘Belem’ in the SLOW dataset. Respect to the previous dataset, the data from the runs in SLOW are more evenly distributed in time so that we have decided to split the data in 10 adjacent windows, each of them containing 200 consecutive runs. Subsequently, the SVMs models are trained to classify if a run has been computed on the first window (from run 1 to run 200) or in another window of the remaining 9. From the results in Table 4, we can observe that is difficult to distinguish the runs pertaining to the first window from the runs in the adjacent window (i.e. runs from 201 to 400 in the third column), either considering as input the single outcome distributions at the k th measurement step (the top part of Table 4) or the sequences of measurement probabilities from step 1 to step k (bottom part). As a matter of fact, we do not reach 90% in neither case. Conversely, when we consider the subsequent windows (runs after 400 on the next columns), thus at a greater distance from the first window, the classification task becomes easier.

Table 4 Binary classification accuracy, denoted as α(⋅), of SVM trained to classify the outcome distributions belonging to distinct two sets of data. One set is composed by the runs of ‘Belem’ in the SLOW dataset by numbering them from 1 to 200 in temporal ordering. Also the other set is composed by runs of ‘Belem’ in the SLOW dataset, but collected within temporal windows specified on the columns title (from run 201 to run 400, from run 401 to run 600, etc…). In the top sub-table the models are trained with the outcome distributions taken at the k th measurement step, while in the bottom sub-table the inputs are the sequences of measurement probabilities from step 1 to step k. A color gradient representing the accuracy is given to facilitate the reading of the table (in red the lowest, green the highest)

In these experiment, the execution times for all the runs in each window is approximately 12 h (except for the previously-discussed window from run 1400 to 1600). Thus, we can deduce that 12 h of time distance between the windows are sufficient to distinguish the noise fingerprint at different times with 100% of accuracy. To find the minimum necessary hours gap, in Fig. 3 we report the reached accuracy of a SVM model trained to distinguish the runs in the first window (from run 0 to 200) of ‘Belem’ within the SLOW dataset from the runs in another window with an increasing time gap among them. We can observe that, in this case, already after 6 h the noise fingerprint is distinguishable with an accuracy of 100%. In general, we can observe that even starting from different windows in time, and using different window sizes, more than 95% of accuracy is reached after a few hours (in the order of one day).

Fig. 3
figure 3

Maximum reached accuracy for SVM models trained on sequences of measurement outcomes for all the steps \(k=1,\dots ,9\) taken from the ‘Belem’ quantum machine and collected in the dataset SLOW. The model is trained to classify the executions in the window of runs from 1 to 200 from the ones in a subsequent window of 200 runs. Initially, the latter is adjacent to the first window, then it is moved by increasing the gap between the two windows. The plotted curve is then obtained by drawing the accuracy values for the corresponding gaps, expressed in hours. Note that a gap of 6 h correspond to approximately 90 runs

Overall, we can thus conclude that a clear temporal dependence of noise fingerprint is present in our experiments, even when the same quantum machine is taken into account.

8 Robustness analysis

Finally, we investigate the robustness of the learned fingerprint at different time scales. For this purpose, taking the IBM machines ‘Belem’ and ‘Quito’, we temporally order all the executions of the testbed quantum circuit, by dividing them in 10 distinct windows of 400 consecutive runs, i.e. 200 runs per machine. The elapsed time between runs has been already reported in Fig. 2. In this way, after have generated the SLOW dataset (introduced in Section 5) with 2000 runs per machine, the SVMs are trained to classify on which device, among ‘Belem’ or ‘Quito’, the testbed quantum circuit has been executed. Specifically, in any experiment designed for the robustness analysis, the ML model is trained over the data collected in a time window of 200 consecutive runs (overall, we consider 10 distinct time windows), and then tested in all the considered time windows including the one used for the training.

All the obtained results — summarized in Table 5 — point out the following peculiar feature. Unsurprisingly, the SVM reaches 100% of accuracy in the time window used for the training of the ML model (corresponding to the diagonal of the table), and then, in proximity of the time windows on the diagonal, the accuracy decreases monotonically. This corresponds to the intuition that the machine-related noise fingerprint ‘fade’ with time, due to the evidence — discussed in the previous section — that the noise fingerprint of the IBM quantum devices exhibits a quite prominent time dependence. However, surprisingly, we observe that the accuracy returns to 100% for time windows of runs far from the training one. We conjecture that this counter-intuitive phenomenon may be due either to the periodic calibration of the machines or to the slowdown induced by the fair-share queue. The latter, indeed, may be also observed in the last part of the SLOW dataset in Fig. 2, and is supported by the evidence that, if we restrict the experiment to the runs from 1 to 1000 (i.e. the range where the execution times of the tested machines are more homogeneous as shown in Fig. 2), the resulting accuracy values decrease with time.

Table 5 Classification accuracy of SVMs trained to classify on which quantum device, among ‘Belem’ or ‘Quito’, a given set of data has been generated. The training of the models is performed with the outcome distributions collected in the dataset SLOW, and then divided in 10 distinct time windows of 200 runs (the first window includes the runs from 1 to 200, the second from 201 to 400, etc). We recall that each run contains the outcomes from all the 9 measurement steps in each execution. The row and column indexes denote, respectively, the number of time windows whose data are used to train and test the ML model. Finally, the reported accuracy values are calculated by using the outcome distributions computed at all the measurement steps k = 1,…,9. A color gradient representing the accuracy is given to facilitate the reading of the table (in red the lowest, green the highest)

The general result that can be deduced from the experiments of the robustness analysis is that, by training our ML model on just 200 runs (corresponding to the diagonal time windows of the table), we are able to identify the device-related noise fingerprint with high accuracy for all the 1 800 remaining ones. In this regard, it is worth noting that, between the training samples and the last test ones, there is up to a week in real-time execution (as one can see in Fig. 2). This means that we can consider our classifier to be fairly robust in time, despite the changes in the environment and calibration of the machines that might occur even at time-scales of weeks.

9 Conclusions

In this work we prove the existence of a noise fingerprint — also admitting a clear time-dependent profile — in the tested IBM quantum machines, which are just a particular class of NISQ devices. We have also demonstrate that such noise fingerprints can be exploited to reliably distinguish the machines by means of SVM models. As general results, our experiments confirm that (i) all the analysed quantum devices exhibit a clear machine-related noise fingerprint that is robust, in the sense that the fingerprint is highly predictable over time in windows of consecutive runs; (ii) the noise fingerprint has also a time dependence, namely it changes over time and after few hours becomes different enough to be distinguished from fingerprint in the past; (iii) in each quantum device, sequences of measurement outcome distributions are required for the accurate learning of the corresponding noise fingerprint. One may conjecture that possible reason behind the latter aspect may be that the noisy dynamics in the IBM machines can be non-Markovian due to the presence of time-correlations among consecutive samples of the noise field. However, it is worth observing that the SVMs we successfully used in this work are memory-less ML models, which thus ignores possible temporal relations across the measurement steps. Therefore, the gathered data and the adopted ML models are not indicated to validate any hypothesis on non-Markovianity. These aspects, deserving further investigations, will be addressed in another contribution in which memory-less ML models will be compared with other ML architectures processing time series data with variable memory length. In conclusion, despite the microscopic reasons for the existence of a machine-related noise fingerprint are still unknown (indeed, the IBM machines are partly inaccessible), we can now affirm that one can reliably leverage such noise profiles to distinguish, and possibly in the future characterize, different NISQ quantum devices.

As an outlook, learning the noise fingerprint of quantum devices from time-ordered measurements of testbed quantum circuits is expected to open the way, in the next future, to many other experiments and ideas. The proposed methodology, indeed, may be applied not only to IBM quantum machines, but even to a larger class of quantum devices, both in commercial or laboratory scenarios. In all of them, classification ML model, exploiting the presence of intrinsic noise sources that give rise to an identifiable noise fingerprint in the devices, may be employed to predict on which machine, and at which time, a given quantum circuit or algorithm was executed. Moreover, our procedures could be adopted to predict if and when the noise fingerprint of a specific quantum device changes over time, e.g. due to calibration actions. Such a knowledge will help in mitigating (time-varying) errors occurring in the computation and, possibly, performing ad hoc error corrections.