1 Introduction

The development of practically usable quantum computing technologies is in full swing involving global players like Alibaba, Atos, Google, IBM, and Microsoft and specialists in this field such as Rigetti Computing and D-Wave. These parties compete for technology lead and, finally, simply the raw number of qubits they can provide through their quantum processing units (QPUs), which can be either hardware quantum computers or quantum computer simulators running on classical high-performance computing hardware. This situation resembles the very early days of GPU-accelerated computing when the first generation of general-purpose programmable graphics cards became available but their productive use in scientific applications was largely hindered by the non-availability of software development kits (SDKs) and easy-to-use domain-specific software libraries and, even more severe, the lack of standardized non-proprietary development environments that would lower the dependence on a particular GPU vendor.

Today’s quantum software landscape can be grouped into three main categories: quantum SDKs [1, 6, 15, 19, 22], stand-alone quantum simulators [5, 11, 13], and quantum assembly (QASM) [2, 3, 12] or instruction languages (QUIL) [21]. A recent overview and comparison of gate-based quantum software platforms by LaRosa [14] shows that the field is highly fragmented making it impossible to perform a fair quantitative performance comparison. Moreover, the tools focus on quantum computing experts who are mainly interested in the development of stand-alone quantum algorithms rather than their use as computational building blocks within a possibly hybrid classical-quantum solution procedure.

In our opinion, practical quantum computing has the highest chances to become a game-changer for the computational sciences if it is positioned as special-purpose accelerator technology that will become available in future heterogeneous compute platforms equipped with GPUs, QPUs and other emerging accelerators like field-programmable gate arrays (FPGAs). Researchers and scientific application developers will then have the free choice between, say, running the HHL-algorithm [9] on a QPU accelerator and adopting one of the many classical numerical methods for solving linear systems of equations on CPUs, GPUs or FPAGs depending on problem sizes and matrix characteristics. In [17] we have outlined a conceptual framework for QPU-accelerated automated design optimization that builds on the HHL-solver as main computational driver.

We believe that end-users from the community of computational science and engineering would be interested in giving QPU-accelerated computing a try with the right software tools at hand. With this vision in mind, we created the -project [16] (pronounced Lib-Ket), which is a cross-platform programming framework that aims at making QPU-accelerated computing as easily accessible for the masses as GPU computing is today through frameworks like CUDA [18].

The remainder of this paper is structured as follows: Sect. 2 discusses the design principles underlying the framework, which is introduced in Sect. 3. Implementation details are discussed in Sect. 4 followed by a brief demonstration of ’s capabilities in Sect. 5. Section 6 completes the paper with a conclusion and an outlook on functionality planned for future releases.

2 Design Principles

To achieve our set-out vision, is designed based on the following principles:

  • QPU-accelerated computing: Quantum computers are used as special-purpose accelerator devices within a heterogeneous computer system that can host multiple accelerator technologies (GPUs, FPGAs, ...) side by side.

  • Concurrent task offloading: Quantum algorithms are implemented as compute kernels describing concurrent tasks launched on QPU devices.

  • Single-source quantum-classical programming: Classical and quantum code is implemented in a single source file, which is compiled into one hybrid binary executable executed on the host computer, who offloads certain parts of the computation to the accelerator devices.

  • Write once run anywhere: Quantum algorithms are implemented once and for all as generic expressions, which can be executed on current and future QPU-device types. Support for a particular type is realized by a small set of conversion functions between ’s unified interface layer and the device-specific low-level application programming interface (API).

  • Standing on the shoulders of giants: is developed on top of existing vendor-specific tools and libraries to exploit their full optimization potential.

  • Seamless integration into status quo: does not create new standards that need to be implemented by others but utilizes the available tools.

The first three principles suggest a conceptual design in the spirit of CUDA [18] or OpenCL [23], which are de-facto standards for GPU computing. To underline the postulated similarity between QPU- and GPU-accelerated computing and to make quantum computing more accessible to experts in classical accelerator technologies, we will utilize a GPU-inspired terminology such as host (the CPU and its memory) and device (the QPU and its memory), kernels and streams, as well as asynchronous execution and synchronization throughout this paper.

The write-once-run-anywhere principle has led us to adopt template meta-programming techniques to implement quantum algorithms as generic expressions, whose evaluation for a particular QPU type is delayed until the program flow has reached the point, where its actual value is really needed. This approach is also known as lazy evaluation or call-by-need principle in programming language theory and is used successfully in linear algebra libraries [4, 7, 8, 10, 20, 24].

The last two principles are mainly based on pragmatic considerations. Firstly, introducing yet another approach to quantum programming incompatible to the existing ones would escalate the fragmentation of the quantum software landscape instead of improving the situation for the potential end-users. Moreover, the chosen approach allows for exploiting the expertise and manpower of scientists worldwide working on different aspects of quantum computing and their expert knowledge of non-disclosed technical details of QPU devices to create an open software ecosystem that immediately benefits from any improvement in one of the underling core components. Finally, most human beings are more open to emerging technologies if they come as evolutionary increments of the status quo instead of radical paradigm shifts that call for dumping all previous work.

3 The Programming Framework

The open-source, cross-platform programming framework is designed as header-only C++14 Footnote 1 with minimal external dependencies, namely, an embedded Python interpreter and, possibly, header and/or library files from the respective quantum backends. It can be downloaded free-of-charge from the GitLab repository https://gitlab.com/mmoelle1/LibKet, which provides documentation in form of a wiki and an API documentation and several tutorial examples to get started. In addition to the primary C++ API, C and Python APIs are being implemented, which adopt just-in-time compilation techniques to exploit the full potential of C++ template meta-programming internally and expose ’s functionality in C and Python-style to the outside.

A comprehensive overview of the programming framework is given in Fig. 1. It consists of three layers that provide components for application programmers (high-level (HL) API), quantum algorithm developers (mid-level (ML) API), and QPU providers (low-level (LL) API), respectively.

Fig. 1.
figure 1

Overview of the cross-platform programming framework.

Before we describe the different software layers in more detail we give a short example on ’s general usage. Consider the C++ code snippet given in Listing 1 which puts the first and third qubit of a quantum register into the maximally entangled first Bell state, where A is qubit 1 and B is qubit 3:

(1)

The easiest way to achieve this is to start from the computational basis and apply a Hadamard gate to one qubit followed by a controlled-NOT (CNOT) gate

figure a

This is realized by the quantum expression that is constructed in lines 8–9 of the code snippet, thereby demonstrating two of ’s most essential components, namely, Quantum Filters and Quantum Gates, which are implemented in the namespaces LibKet::filters and LibKet::gates, respectively.

As the name suggests, filters select a subset of the quantum register; see Sect. 4.1 for more details. Here, sel\(<1>\)() selects the first qubit for applying the Hadamard gate. This sub-expression serves as first argument, the control, to the binary CNOT gate, whose action is applied to the third qubit ( ). The gate puts all qubits of the quantum register into the computational basis . More information on gates is given in Sect. 4.2. It should be noted that the resulting quantum expression is generic, that is, object expr holds an abstract syntax tree (AST) representation of the Bell state creation algorithm that can be synthesized to any of ’s quantum backends. For the cloud-based Quantum-Inspire (QI) platformFootnote 2, this is accomplished by lines 15 and 18. In short, line 15 creates a deviceobject that holds 6 qubits and specializes the generic quantum expression exprinto common QASM code v1.0 [12], the programming language for the QI backend. The internally stored quantum kernel code as well as the quantum expression exprcan be printed as illustrated in lines 21 and 12, respectively; see Listing 2. The probability amplitudes resulting from 1024 runs of the quantum algorithm are presented in the same diagram.

figure d

The actual execution of the quantum kernel is triggered in line 24, which starts an embedded Python interpreter as sub-process to communicate with the cloud-based quantum simulator platform via the vendor-specific QI-SDKFootnote 3. This call performs blocking execution and returns a JSON object upon successful completion, from which the result can be retrieved. More details on how to customize the execution process, run multiple quantum kernels concurrently and perform non-blocking asynchronous kernel execution are given in Sect. 4.5.

figure e

4 Implementation Details

In what follows, we address the individual components and shed some light on their internal realization and ways to extend them to support new backends.

4.1 Quantum Filter Chains

As stated before, ’s quantum filters are meant to select subsets of qubits from the global quantum register to which the following quantum operation is being applied, which is comparable to matrix views in the Eigen library [8].

Since today’s and near-future quantum processors have a very limited number of qubits, typically, between 5–50, we consider the assumption of a single global quantum register and the absence of dynamic memory (de)allocation capabilities most practical. Moreover, quantum computing follows the in-memory computing paradigm, that is, data is stored and manipulated at fixed locations in memory. This is in contrast to the classical von-Neuman computer architecture, where data is transported between the randomly accessible main memory (RAM) and the central processing unit (CPU), the latter performing the computations.

Table 1.  ’s quantum filters.

Table 1 lists all quantum filters supported by . All filtering operations are applied relative to the given input, which makes it possible combine multiple filters to so-called filter chains. Consider, for instance, the filter chain qubit \(2\) (shift \(2\) (range \(2,5\) ())), which selects the 6-th qubit from the global register, more precisely, the pre-selected set of qubits passed as input.

Thanks to the use of C++ template meta-programming techniques, quantum filters are evaluated at compile time and, hence, even complex filter chains cause no overhead costs at run time. With the aid of gototag<Tag>() it is possible to restore a previously stored filter configuration that has been tagged by the tag<Tag>() function. It is generally recommended to safeguard quantum expressions that should be used as building blocks in larger algorithms by tag-gototagpairs to prevent side effects from internal manipulation of the qubit selection.

All components listed in Table 1 come in two flavours, a class whose instantiated objects span the abstract syntax tree (AST) of the expression and a creator function that returns an object of the respective type. Classes are required to implement the for all expressions that should be supported; see Listing 3 for an example. Here and below the universal-reference variant, i.e. is omitted due to space limitations but it is implemented for all types to support C++11 move semantics.

Though not foreseen in the current implementation, the just described quantum filter mechanism can be easily extended to support rudimentary stack memory based on a reserved region of the global quantum register. Together with ’s just-in-time (JIT) capabilities (see below) even dynamic memory (de)allocation would be possible with the adopted concept once a sufficiently large number of qubits and circuit depths are reliably supported in quantum hardware to make this feature relevant for practical applications.

4.2 Quantum Gates

 ’s implementation of quantum gates follows the same programming paradigm (class with overloaded and gate-creator function) as described above. Additionally, the class provides an overloaded apply(QData \(...\) & data)method, which is specialized for each supported backend type. Listing 4 illustrates how the application of the Hadamard gate appends QASM code to the data’s internal quantum kernel for the cQASMv1backend; see lines 4–13. The static range()method is one of several filter utility functions that returns the actual list of selected qubits based on data’s concrete register size at compile time.

Invoking the Hadamard function (lines 16–19) returns a UnaryQGateobject (see below) that stores the current sub-expression, the gate to be applied next, and the filter selection internally. The specialized overload in lines 21–25 ensures that the immediate double-application of the Hadamard gate gets eliminated. makes extensive use of this type of rule-based optimization to eliminate gate-level expressions of the form t(tdag(...))as well as entire quantum circuits followed immediately by their inverse, e.g., qft(qftdag(...)).

figure i

To orchestrate the interplay of expressions, filters and gates, implements unary, binary, and ternary gate containers that hold the aforementioned information as types except for the actual sub-expression which is stored by-value. Instantiations of these nearly stateless classes span the quantum expression’s AST (see Listing 2 (left)), whereby an overloaded method dispatches between the different variants to apply quantum gates to expressions.

Next to the set of quantum gates that are typically supported by most QPU backends, comes with a special hook-gate that can be used to implement common quantum building blocks, e.g., the first Bell state from Listing 1

figure k

4.3 Quantum Circuit

The main advantage of ’s generic quantum-expression approach becomes visible for circuits, which represent compile-time parametrizable algorithms like the well-known Quantum Fourier transform, invoked via the qft()function. The implementation follows the same programming paradigms (class with overloaded and corresponding creator function with rule-based optimization) but, typically, with a generic apply() method, whose synthetization to device-specific instructions is handled by the gates. Our approach makes it, however, possible to also specialize full circuits for selected QPU backends, e.g., to use Qiskit’s [1] internal realization of the HHL-solver [9] for the IBM Q platform.

figure m

To ease the development of generic quantum circuits, implements a static for-loop that accepts the body as functor being passed as template argument together with loop bounds and step size as illustrated in Listing 5.

Moreover, comes with just-in-time (JIT) compilation capabilities making it possible to generate quantum expressions dynamically from user input. Quantum expressions that are given in string format are JIT compiled into dynamically loaded libraries that are cached across multiple program runs.

figure n

4.4 Quantum Devices

The synthetization of generic quantum expressions into device-dependent quantum instructions that can be executed on a specific QPU is realized by the many specializations of the QDevice class, which brings together a particular backend type with device-specific details, such as credentials and parameters for connecting to cloud-based services, the maximum number of qubits, the native gate set, and the lattice structure, which might require internal optimization passes.

Lines 15 and 18 of Listing 1 create a device instance for running the quantum algorithm remotely on the Quantum-Inspire simulator platform and populate its internal quantum kernel with the expression given by Eq. (1) for creating the first Bell state, respectively. Next to providing methods for executing the kernel as described in the next section, some device types support extra functionality such as the transpilation of the generic quantum circuit into device-optimized quantum instructions and the export of the resulting circuit to . The quantum circuits depicted in Fig. 2 were produced by the following code snippet

figure o

We consider this functionality helpful for getting a better understanding of the actual circuit – possibly with extra swap gates added to enable two-qubit operations on non-neighboring qubits – that is executed on the device rather than its idealized textbook version. The transpilation step can be bypassed by choosing generic simulators such as ibmq_qasm_simulator and cirq_simulator.

Fig. 2.
figure 2

Quantum circuits for producing the first Bell state, cf. Eq. (1), optimized for (a) IBM’s 5-qubit London chip and (b) Google’s 22-qubit Foxtail chip.

4.5 Quantum Kernel Execution

Once the generic expression has been synthesized into device-dependent instructions it can be executed on the respective QPU device. As explained before, our aim is to ease the transition from GPU programming to QPU-accelerated computing. therefore adopts a CUDA-inspired stream-based execution model, which enables concurrent quantum kernel execution on multiple QPU devices.

The device’s eval()method called in line 24 of Listing 1 accepts a so-called QStream<QJobType::Python> object as optional parameter and so do the methods execute()and execute_async()as shown in the following code snippet

figure p

While the eval()method waits until the execution has finished and returns the result as JSON object or throws an exception upon failure, the execute()method returns a pointer to a job object QJob<QJobType::Python> that supports query(),wait() and get() operations. Its non-blocking counterpart execute_async() can be used to hide the latency stemming from the execution of the quantum kernel on remote QPUs and the overhead costs due to invoking the embedded Python interpreter with other computations on the CPU or other accelerator devices. It is even possible to execute multiple quantum algorithms concurrently on multiple QPUs by launching their kernels in different streams.

Use of an embedded Python interpreter as interface between classical host code and quantum kernels has the advantage that the full potential of vendor-specific SDKs can be exploited to perform circuit optimization and other pre- and post-processing tasks including possible validity checks on the host side before communicating the quantum kernel to the remote QPU device for execution.

The three unused parameters in line 2 of the above code snippet can be used to inject user-defined code preceding the import of Python modules and right before and after the execution of the quantum circuit, respectively. A possible application of this feature is the internal post-processing of measurement results with the functionality provided by a particular SDKFootnote 4, e.g., to visualize the measurement outcome as histogram and write it to a graphics file

figure q

While retrieving the outcome of a quantum experiment as JSON object is most flexible it requires backend-specific post-processing steps to extract the desired information. For widely used data such as job identifier and duration, histogram of results, and the state with highest likelihood, each QDevice class specialization provides functionality to extract information from the JSON object and convert it into -specific or intrinsic C++ types, e.g.

figure r

Let us finally remark that also supports the native execution of quantum kernels written in C++, e.g., for quantum simulators like QX [13] and QuEST [11], using the multithreading capabilities that come with C++ 11.

Fig. 3.
figure 3

Run times for the Quantum Fourier transformation executed with 1–12 qubits (per group from left to right) on five different QPU simulator backends.

5 Demonstration

is a rather young project that is under continuous development. The correct functioning of the core framework described in this paper has been verified by extensive unit tests. A comprehensive presentation of computational examples is beyond the scope of this paper and not possible within the given page limit. We therefore restrict ourselves to a single test case, namely, the quantum expression qft(init()) and apply it to a quantum register consisting of 1–12 qubits as a first benchmark to measure the performance of different QPU backends.

Figure 3 depicts the run times measured for the following QPU backends: Cirq [6] (v0.7.0, generic simulator), pyQuil [21] (v2.19.0, 9q-square-simulator), QI [13] (v1.1.0), Qiskit [1] (v.0.17.0, qasm-simulator), and QuEST [11] (v3.1.1, CPU-OpenMP simulator). All runs were performed with 1024 shots on a dual-socket Intel Xeon E5-2687W Sandy Bridge EP system with 2 \(\times \) 8 cores running at 3.1 GHz with 128 GB of DDR3-1600 memory except for the QI runs, which were executed on a remote system with unknown hardware specification.

For some backends, such as pyQuil and Qiskit, increasing the number of qubits and the circuit depth results in significantly longer run times, while others are less sensitive to these parameters. It should be noted that the run times measured for the pyQuil backend include the transformation of the quantum circuit into executable code by the Quil Compiler, which might explain the higher values. The QuEST backend does not allow repeated evaluation of the circuit so that the measured run time might be dominated by overhead costs.

We would like to stress that the presented results are preliminary and should not be considered a comprehensive performance analysis of the QPU backends under consideration. Systematic benchmarking of many more simulator and hardware backends for quantum circuits of different depth and level of entanglement is underway and will be presented in a forthcoming publication.

6 Conclusion

In this paper we have introduced our novel cross-platform programming framework , which aims at facilitating the use of quantum computers (and their simulators) for accelerating the solution of scientific problems. Primarily addressing today’s GPU programmers as early adopters, our framework is largely inspired by Nvidia’s CUDA toolkit and offers a similar programming model based on quantum kernels that can be executed concurrently using multiple streams. As a unique feature, does not focus on one particular QPU backend but adopts C++ template meta-programming techniques to enable the development of quantum algorithms as generic expressions that can be synthesized to various QPU-backend types, following the write-once-run-anywhere principle.

Ongoing developments focus on the extension of the algorithm library (mid-level API; cf. Fig. 1), especially, variants of the HHL-solver [9] and its computational ingredients such as eigenvalue estimation. Another line of research work addresses the implementation of basic arithmetic routines, which are also used inside the HHL-algorithm to invert eigenvalues. Finally, the extension of the low-level API to support additional QPU backends and to reduce the computational overhead incurred by the use of the embedded Python interpreter and the conversion from JSON objects to C++ types is a permanent quest.

Despite the early development stage of the framework, we would like to encourage the scientific computing community to report their experience with it and express feature requests for forthcoming releases to the authors.