1 Introduction

This paper comes as a completion of the work described in [27], in which a deterministic and physically accurate solver for Double-Gate Metal Oxide Field-Effect Transistors (DG MOSFETs) was implemented on a high-performance platform in order to alleviate the computational weight of such a high-dimensional model. Nanoscale DG MOSFETs are a key element in modern integrated circuits, and their modeling and simulation aim at contributing to their downscaling following Moore’s law. Figure 1 sketches the geometry and spatial dimensions of the particular 2D DG-MOSFET device.

Fig. 1
figure 1

Geometry and spatial dimensions of the nanoscale 2D DG-MOSFET

The deterministic model consists of a set of collisional Boltzmann equations to describe electron transport inside the structure, and a 1D Schrödinger–2D Poisson block to compute the eigenstates, which read, in its dimensionless form (after a cartesian-to-ellipsoidal change of variables in the impulsion space) as:

$$\begin{aligned}&\frac{\partial \varPhi _{\nu ,p}}{\partial t} + \frac{\partial }{\partial x} \left[ a^1_{\nu } \, \varPhi _{\nu ,p} \right] + \frac{\partial }{\partial w} \left[ a^2_{\nu ,p} \, \varPhi _{\nu ,p} \right] + \frac{\partial }{\partial \phi } \left[ a^3_{\nu ,p} \, \varPhi _{\nu ,p} \right] = {\mathcal {Q}}_{\nu ,p}[\varPhi ] \, s_\nu (w) \end{aligned}$$
(1)
$$\begin{aligned}&\quad - \frac{1}{2} \frac{\textrm{d}}{\textrm{d}z} \left( \frac{1}{m_{z,\nu }} \frac{\textrm{d} \psi _{\nu ,p} }{\textrm{d}z} \right) - \left( V + V_c \right) \psi _{\nu ,p} = \epsilon _{\nu ,p} \, \psi _{\nu ,p} \end{aligned}$$
(2)
$$\begin{aligned}&\qquad \qquad - \nabla \cdot \left( \varepsilon _\textrm{R} \, \nabla V \right) = - \left( N - N_D \right) . \end{aligned}$$
(3)

where \(z \in [0,1]\) is the electron confinement dimension (transversal dimension) and \(x \in [0,1]\) is the electron transport dimension (longitudinal dimension), \(w\in [0,\infty [\) is a dimensionless energy, \(\phi \in [0, 2\pi [\) is the azimuthal angle, \(\nu \in \{0,1,2\}\) indexes the valley (we consider three valleys in the silicon band structure) and \(p \in \{0,\ldots ,5\}\) indexes the subband (energy level).

Here, \(\varPhi _{\nu ,p}(t,x,w,\phi )\) is the probability of finding an electron of the \(\nu ^{\textrm{th}}\) valley, \(p^{\textrm{th}}\) subband, at time t, at position x, with energy-angle \((w,\phi )\) in the 2D impulsion space.

The presence of several valleys inside the Si band structure, plus the confinement due to the oxide layers make that we have as many Boltzmann Transport Equations (BTEs) (1) as \((\nu ,p)\)-pairs; for each BTE, the electrons are advected through the fluxes given by

$$\begin{aligned} a^1_{\nu }(w,\phi )= & {} \frac{\sqrt{2 w (1 + \alpha _\nu w)}\cos (\phi )}{\sqrt{m_{x,\nu }} (1 + 2 \alpha _{\nu } w)},\\ a^2_{\nu ,p}(x,w,\phi )= & {} - \frac{\partial \epsilon _{\nu ,p}}{\partial x}(x) \ a^1_{\nu }(w,\phi ) \\ a^3_{\nu ,p}(x,w,\phi )= & {} \frac{\partial \epsilon _{\nu ,p}}{\partial x}(x) \ \frac{1}{\sqrt{2w(1+\alpha _\nu w) }} \ \frac{ \sin (\phi ) }{ \sqrt{m_{x,\nu }}} \end{aligned}$$

where \(\epsilon _{\nu ,p}(x)\) are the energy levels, \(\alpha _{\nu }\) is the Kane’s non-parabolicity factor for the \(\nu ^{\textrm{th}}\) valley, and \(m_{x,\nu }\) is the electron effective mass along dimension x for the \(\nu ^{\textrm{th}}\) valley (see Appendix A for the details about \(\alpha _{\nu }\) and \(m_{x,\nu }\)).

The scattering operator \({{\mathcal {Q}}}_{\nu ,p}[\varPhi ]\) describes the electron–phonon interactions and \(s_\nu (w)\) is a given function due to the change of variables in the impulsion space. Refer [27, 38] for the details about these terms.

In the Schrödinger equations (2), which describe the confinement, \(\psi _{\nu ,p}(x,z)\) are the wave functions and V(xz) is the electrostatic potential. Additionally, \(V_c(z)\) represents the MOSFET’s confinement potential and \(m_{z,\nu }\) is the electron effective mass along dimension z for the \(\nu ^{\textrm{th}}\) valley (see Appendix A for the details about \(V_c(z)\) and \(m_{z,\nu }\)). Since dimension x acts only as a parameter, we have to solve as many eigenproblems as Si valleys times the discretization points along the x-dimension.

In the Poisson equation (3), the divergence and the gradient operators are meant for both the transport and the confinement dimensions (for (xz)). The surface density \(\varrho _{\nu ,p}\) and the volume density N in (3) are given by:

$$\begin{aligned} \varrho _{\nu ,p}(x)= & {} \int _{w'=0}^{+\infty } \int _{\phi '=0}^{2\pi } \varPhi _{\nu ,p}(w',\phi ') \, \textrm{d}\phi ' \, \textrm{d}w',\\ N(x,z)= & {} \sum _{\nu ,p} \varrho _{\nu ,p} \cdot \left| \psi _{\nu ,p} \right| ^2. \end{aligned}$$

In (3), \(\varepsilon _{\textrm{R}}\) represents the dielectric constant and \(N_D(x,z)\) is the doping profile which takes into account the injected impurities in the semiconductor lattice (see Appendix A for the details about \(\varepsilon _{\textrm{R}}\) and \(N_D(x,z)\)).

The numerical solver described in [27] fully ports onto GPU the transport phase (called BTE phase) where the Boltzmann Transport Equations (BTEs) (1) are solved, while the goal of the present paper is to describe how we fully port onto GPU the phase corresponding to the solution of the Schrödinger-Poisson block (2)-(3) (called iter phase). We hence achieve a twofold improvement:

  • to exploit the higher computational power of modern GPUs to accelerate this computational phase and

  • to avoid definitively costly data transfer between the host and the device RAM in the heterogeneous platform.

In order to solve the Schrödinger-Poisson block (2)-(3), whose input is the surface densities \(\varrho _{\nu ,p}(x)\) and whose outputs are the energy levels \(\epsilon _{\nu ,p}(x)\), the wave functions \(\psi _{\nu ,p}(x,z)\) and the electrostatic potential V, a Newton-Raphson iterative algorithm is used, as was the case in the previous works (we address the reader to [27] and references therein for more details). An iteration in the Newton-Raphson algorithm consists of two main computational phases, which will be described separately in the following (see Fig. 2):

  1. a)

    Updating of the guess for the potential V through a Poisson-like equation (unlike the Poisson equation (3) it contains an additional non-local term). The linear system deriving from the Poisson-like equation, and whose solution is the update for the guess on the potential V, is solved by means of a Scheduled Relaxation Jacobi (SRJ) scheme [2, 3, 39]: it consists of a sequence of relaxed Jacobi schemes with different relaxation factors, constructed in such a way to boost convergence to the solution.

  2. b)

    Updating of the eigenstates \(\{\epsilon _{\nu ,p}(x)\}\) and \(\{\psi _{\nu ,p}(x,z)\}\) through the Schrödinger equation (2). The computation of the energy levels \(\{\epsilon _{\nu ,p}(x)\}\), i.e. the eigenvalues of the Schrödinger matrix, is achieved by using a multi-section algorithm [24] in the initial time step and a Newton-Raphson iterative algorithm in the following steps. Once the energy levels have been computed, the wave-vectors \(\{\psi _{\nu ,p}(x,z)\}\), which are the eigenvectors of the Schrödinger matrix, are computed by means of the Inverse Power Iterative Method (IPIM) [16], which in turn exploits the Thomas Algorithm [40] for the solution of the tridiagonal linear systems appearing at each iteration.

The parallel implementation of the numerical solution for the Schrödinger-Poisson block to simulate semiconductor devices has been tackled using different approaches and programming technologies. Initially, numerical solvers for shared-memory parallel architectures were derived using OpenMP [10]. In this way, an OpenMP implementation of a numerical solver for a drift-diffusion-Schrödinger-Poisson model is described in [33] and a 2D multi-subband ensemble Monte Carlo simulator of 2D MOSFET devices which solves the Poisson-Schrödinger block is described in [37]. Subsequently, versions of solvers of the Poisson-Schrödinger block for distributed-memory machines were obtained using the Message Passing Interface (MPI) to describe the interprocessor communication. Thus, the development of the nanoelectronics modeling tool NEMO5 [35] includes a Schrödinger-Poisson simulation and the parallelization of the simulations in NEMO5 is based on geometric partitioning techniques using MPI and several portable open-source packages. A parallel 1D Schrödinger-3D Poisson solver is implemented with a Gummel iterative method [17] using MPI and the PETSC library [5, 6] in [20]. In [22], a parallel implementation to simulate a metal-oxide-semiconductor (MOS) device, where a set of 1D Schrödinger-Poisson equations are solved, is described. In this implementation, a parallel divide-and-conquer algorithm is developed to solve the Schrödinger equation while the Poisson equation is solved with a parallelization of a monotone iterative method. Additionally, an MPI implementation of a resolution scheme of 2D Schrödinger equation-based corrections compatible with an existing parallel drift-diffusion model was derived in [14] to simulate 3D semiconductor devices in the simulation framework VENDES [34].

The present work is of interest also for other kinds of solvers which also require the solution of the Schrödinger-Poisson block but using a less accurate description of the carriers in nanoscale semiconductors. The solver for the Schrödinger-Poisson equations, seen as a blackbox, receives as input the surface electron densities and returns as output the eigenstates, and in particular the force field that drives the electrons along the device thanks to the applied voltage. Therefore, this machinery and its efficient implementation on CUDA-enabled platforms, can be adapted to macroscopic models, that are in general preferred in industrial simulations because of their lower computational cost, like drift-diffusion solvers [13, 19, 29, 33], Monte Carlo solvers [12, 32, 37], solvers based on the maximum-entropy-principle energy transport model [8, 26] and Spherical Harmonics Expansion (SHE) solvers [21].

The paper is organized as follows: in Sect. 2, we summarize the model and the equations on which we focus; in Sect. 3, we describe the solvers and the strategy implemented to achieve a solution of the Poisson-like equation on GPU; in Sect. 4, we describe the solvers and the procedure employed to compute the eigenstates on GPU; in Sect. 5, we show the numerical results we have obtained on a dual processor server equipped with powerful modern GPUs; in Sect. 6, we draw some conclusions and sketch the future work in this promising research line.

2 The Schrödinger-Poisson solver

From an algorithmical point of view, the Schrödinger-Poisson block (2)-(3) receives as entry the surface densities \(\{\varrho _{\nu ,p}\}\) and returns as result the energy levels \(\left\{ \epsilon _{\nu ,p} \right\}\), the wave vectors \(\left\{ \psi _{\nu ,p} \right\}\) and the electrostatic potential V [7, 27, 38], such as it is shown in Fig. 2. In this figure, \(\nu \in \{ 0,1,2 \}\) denotes the valley, \(p \in \{ 0, \dots , N_{\textrm{sbn}}-1 \}\) denotes the subband (we consider \(N_{\textrm{sbn}}=6\)), \(i=0,\dots ,N_x-1\) denotes the index for a discretization point in the longitudinal dimension (x) of the physical 2D device, being \(N_x\) the number of discretization points in that dimension, \(j=0,\dots ,N_z-1\) represents an index for a discretization point in the transversal dimension (confined) of the device (\(N_z\) is the number of discretization points in that dimension) and s denotes the particular stage (\(s=0,1,2\)) of the third-order Total-Variation Diminishing Runge–Kutta method [9] used for time integration.

From now on, we refer to the energy levels \(\left\{ \epsilon _{\nu ,p} \right\}\) as the eigenvalues (of the Schrödinger matrix) and the wave vectors \(\left\{ \psi _{\nu ,p} \right\}\) as the eigenvectors.

Fig. 2
figure 2

Structure of the iterative solver for the Schrödinger-Poisson block. Two main phases appear: the update of the electrostatic potential V, and the diagonalization of the Schrödinger matrix to keep consistency with V

Equations (2)-(3) have to be seen as a block because:

  • The 1D steady-state Schrödinger equation (2) takes as entry the potential \(\{V_{i,j}\}\) and returns as many eigenvalues \(\left\{ \epsilon _{\nu ,p} \right\}\) and corresponding eigenvectors \(\left\{ \psi _{\nu ,p} \right\}\) as needed for the sake of precision, and this must be done for each fixed position \(x_i\) and each fixed band \(\nu \in \{ 0,1,2 \}\). As an example, in our solver, by using \(N_{x} = 65\) and \(N_{\textrm{sbn}} = 6\), this means that we have to compute 1170 eigenvalues and eigenvectors.

  • The 2D Poisson equation (3) receives as input the eigenvectors \(\{\psi _{\nu ,p,i,j}\}\) and provides as output the potential \(\{V_{i,j}\}\).

So, as can be seen, the output of (2) is the input of (3) and vice versa. In the following we describe the strategy to solve this block.

The idea is to restate (3) as seeking for the zero of functional

$$\begin{aligned} P[V] := - \nabla \cdot \left( \varepsilon _\textrm{R} \, \nabla V \right) + \sum _{\nu ,p} \varrho _{\nu ,p}(x) \cdot \left| \psi _{\nu ,p} \right| ^2 - N_D \end{aligned}$$
(4)

under the constraints of the Schrödinger equation (2) via a Newton-Raphson iterative scheme:

$$\begin{aligned}&V^{(0)} \hbox { is given} \nonumber \\&P\left[ V^{(k)} \right] + \textrm{d}P \left( V^{(k)} , V^{(k+1)} - V^{(k)} \right) = 0 \qquad \text{ for } k \ge 0. \end{aligned}$$
(5)

Obviously, stage \(k+1\) is a refinement of the previous stage k. The derivative is meant in a directional sense (Fréchet derivative). Details of the computations can be found in [7].

The scheme is sketched in Fig. 2: starting from an initial guess, we refine the guess on the potential, and keep consistency with the eigenstates.

From a computational point of view, this means that we have to be prepared for an alternate solution of the Schrödinger eigenproblem (2) and the linear system (5). The strategies to deal with this process are described in the following.

2.1 Schrödinger diagonalization

We can rewrite the steady-state Schrödinger equation in terms of the V-dependent linear operator \({\mathcal {L}}\):

$$\begin{aligned} S[V](\varPsi ) = - \frac{1}{2} \frac{\textrm{d}}{\textrm{d}z} \left( \frac{1}{m_{z,\nu }} \frac{\textrm{d} \varPsi }{\textrm{d}z} \right) - \left( V + V_c \right) \varPsi =: {\mathcal {L}} (\varPsi ). \end{aligned}$$

We wish to compute the first \(N_{\textrm{sbn}}\) eigenvalues and relative eigenvectors (we recall they will be equivalently referred to as energy levels and wave functions).

In order to do this, we take into account the uniform grid described in [27] for the spatial dimensions (x and z) and discretize the operator using finite differences. As a result, a symmetric tridiagonal matrix of order \(\textrm{n}:= N_z-2\) is obtained:

$$\begin{aligned} {\mathcal {L}}_{\nu ,i} = \left( \begin{array}{ccccccc} d_0 &{} e_0 &{} &{} &{} &{} &{}\\ e_0 &{} d_1 &{} e_1 &{} &{} &{} &{}\\ &{} e_1 &{} d_2 &{} e_2 &{} &{} &{}\\ &{} &{} e_2 &{} d_3 &{} e_3 &{} &{}\\ &{} &{} &{} \ddots &{} \ddots &{} \ddots &{}\\ &{} &{} &{} &{} e_{\textrm{n}-3} &{} d_{\textrm{n}-2} &{} e_{\textrm{n}-2}\\ &{} &{} &{} &{} &{} e_{\textrm{n}-2} &{} d_{\textrm{n}-1} \end{array} \right) \end{aligned}$$
(6)

being: for \(j=1,\dots ,N_z-2\)

$$\begin{aligned} d_{j} := \left( \frac{\frac{1/4}{m_{\textrm{z},\nu ,i,j-1}} + \frac{1/2}{m_{\textrm{z},\nu ,i,j} } + \frac{1/4}{m_{\textrm{z},\nu ,i,j+1}}}{\varDelta z^2} - V_{i,j} \right) \end{aligned}$$

the elements in the diagonal, and for \(j=1,N_{z}-3\)

$$\begin{aligned} e_{j} := \left( -\frac{ \frac{1/4}{m_{\textrm{z},\nu ,i,j}} + \frac{1/4}{m_{\textrm{z},\nu ,i,j+1}} }{\varDelta z^2} \right) \end{aligned}$$

the elements in the sub-diagonal (and the super-diagonal).

The values of the effective masses \(m_{\textrm{z},\nu }\), for the particular case of the DG MOSFET device, depend on the material:

$$\begin{aligned} m_{\textrm{z},\nu ,i,j}= \left\{ \begin{array}{ll} 0.5 &{} \ \text{ if } (i,j) \text{ is } \text{ in } \text{ the } \textrm{SiO}_2 \text{ region } \\ 0.19 &{} \ \text{ if } \nu <2 \text{ and } (i,j) \text{ is } \text{ in } \text{ the } \text{ Si } \text{ region }\\ 0.98 &{} \ \text{ if } \nu =2 \text{ and } (i,j) \text{ is } \text{ in } \text{ the } \text{ Si } \text{ region. } \end{array} \right. \end{aligned}$$
(7)

From this matrix we extract by some method the first (lowest) \(N_{\textrm{sbn}}\) eigenvalues \(\displaystyle \left\{ \epsilon _{\nu ,p,i} \right\} _{p \in \{ 0,\dots ,N_{\textrm{sbn}}-1 \}}\) and relative eigenvectors \(\displaystyle \left\{ \psi _{\nu ,p,i,j} \right\} _{(p,j) \in \{ 0,\dots ,N_{\textrm{sbn}}-1 \} \times \{ 0,\dots , N_z-1 \}}\).

We take into account the boundary condition

$$\begin{aligned} \psi _{\nu ,p,i,0} = \psi _{\nu ,p,i,N_z-1} = 0 \end{aligned}$$

and the normalization of the eigenvectors

$$\begin{aligned} \left( \psi _{\nu ,p,i,j} \longleftarrow \frac{\psi _{\nu ,p,i,j}}{\sqrt{\varDelta z \sum _{j'=1}^{N_z-2} \left| \psi _{\nu ,p,i,j'} \right| ^2}} \right) _{j=1,\dots ,N_z-2}. \end{aligned}$$

2.2 Evaluation of the directional derivative and construction of the linear system

One stage of the Newton-Raphson scheme on (4) translates into solving (5). (More details about the derivation can be found in [7].) This scheme boils down to the linear system on \(V^{(k+1)}\)

$$\begin{aligned} L^{(k)} \, V^{(k+1)} = R^{(k)}, \end{aligned}$$
(8)

where

$$\begin{aligned} L^{(k)} \, V^{(k+1)} =&-\textrm{div} \left[ \varepsilon _\textrm{R} \, \nabla V^{(k+1)} \right] + \int {\mathcal {A}}^{(k)}(x,z,\zeta ) \, V^{(k+1)}(x,\zeta ) \, \textrm{d}\zeta \nonumber \\ R^{(k)} =&- N^{(k)}(x,z) + \int {\mathcal {A}}^{(k)}(x,z,\zeta ) \, V^{(k)}(x,\zeta ) \, \textrm{d}\zeta , \end{aligned}$$
(9)

being \({\mathcal {A}}^{(k)}(x,z,\zeta ):= {\mathcal {A}}[V^{(k)}](x,z,\zeta )\) basically the directional derivative of the density \(N^{(k)}:= N[V^{(k)}]\) [15].

2.2.1 Evaluation of the directional derivative (Fréchet derivative)

The evaluation of \({\mathcal {A}}^{(k)}(x,z,\zeta )\) at the grid points reads:

$$\begin{aligned} {\mathcal {A}}^{(k)}_{i,j,j'} = 2 \sum _{\nu ,p} \sum _{p' \ne p} \frac{\varrho ^{s+1}_{\nu ,p,i} - \varrho ^{s+1}_{\nu ,p',i}}{\epsilon ^{(k)}_{\nu ,p',i} - \epsilon ^{(k)}_{\nu ,p,i}} \times \psi ^{(k)}_{\nu ,p,i,j'} \, \psi ^{(k)}_{\nu ,p',i,j'} \, \psi ^{(k)}_{\nu ,p',i,j} \, \psi ^{(k)}_{\nu ,p,i,j}. \end{aligned}$$
(10)

We recall that, here, the surface densities \(\{\varrho ^{s+1}_{\nu ,p,i}\}\) are the entry for the whole Schrödinger-Poisson block, seen as a blackbox, where s indexes the external Runge–Kutta stage governed by the time integrator, while index k refers to the Newton-Raphson stage.

2.2.2 Construction of the linear system

The Laplacian in the linear operator (9) reads

$$\begin{aligned} \textrm{div} \left[ \varepsilon _\textrm{R} \, \nabla V^{(k+1)} \right] =&\frac{\partial }{\partial x} \left( \varepsilon _\textrm{R} \, \frac{\partial V^{(k+1)}}{\partial x} \right) +\frac{\partial }{\partial z} \left( \varepsilon _\textrm{R} \, \frac{\partial V^{(k+1)}}{\partial z} \right) \end{aligned}$$

and is discretized using the following finite-difference approximation:

$$\begin{aligned} \left( {{\text{div}}\left[ {\varepsilon _{{\text{R}}} {\mkern 1mu} \nabla V^{{(k + 1)}} } \right]} \right)_{{i,j}} = & \left( {\frac{{\frac{1}{2}(\varepsilon _{R} )_{{i - 1,j}} + \frac{1}{2}(\varepsilon _{R} )_{{i,j}} }}{{x^{2} }}} \right)V_{{i - 1,j}}^{{(k + 1)}} + \left( {\frac{{\frac{1}{2}(\varepsilon _{R} )_{{i,j - 1}} + \frac{1}{2}(\varepsilon _{R} )_{{i,j}} }}{{z^{2} }}} \right)V_{{i,j - 1}}^{{(k + 1)}} \\ - & \left( {\frac{{\frac{1}{2}(\varepsilon _{R} )_{{i - 1,j}} + (\varepsilon _{R} )_{{i,j}} + \frac{1}{2}(\varepsilon _{R} )_{{i + 1,j}} }}{{x^{2} }} + \frac{{\frac{1}{2}(\varepsilon _{R} )_{{i,j - 1}} + (\varepsilon _{R} )_{{i,j}} + \frac{1}{2}(\varepsilon _{R} )_{{i,j + 1}} }}{{z^{2} }}} \right)V_{{i,j}}^{{(k + 1)}} \\ + & \left( {\frac{{\frac{1}{2}(\varepsilon _{R} )_{{i,j}} + \frac{1}{2}(\varepsilon _{R} )_{{i,j + 1}} }}{{z^{2} }}} \right)V_{{i,j + 1}}^{{(k + 1)}} + \left( {\frac{{\frac{1}{2}(\varepsilon _{R} )_{{i,j}} + \frac{1}{2}(\varepsilon _{R} )_{{i + 1,j}} }}{{x^{2} }}} \right)V_{{i + 1,j}}^{{(k + 1)}} . \\ \end{aligned}$$
(11)

The integral is discretized by means of trapezoid rule

$$\begin{aligned} \left( \int {\mathcal {A}}^{(k)}(x,z,\zeta ) \, V^{(k+1)}(x,\zeta ) \, \textrm{d}\zeta \right) _{i,j} = \frac{\varDelta z}{2} \left[ \sum _{j'=0}^{N_z-2} {\mathcal {A}}^{(k)}_{i,j,j'} \, V^{(k+1)}_{i,j'} + \sum _{j'=1}^{N_z-1} {\mathcal {A}}^{(k)}_{i,j,j'} \, V^{(k+1)}_{i,j'} \right] . \end{aligned}$$
(12)

For the right hand side \(R^{(k)}\), the integral is computed in a similar way to (12), and the density is simply

$$\begin{aligned} N^{(k)}_{i,j} = 2 \sum _{\nu ,p} \sum _{p' \ne p} \varrho ^{s+1}_{\nu ,p,i} \left| \psi ^{(k)}_{\nu ,p,i,j} \right| ^2. \end{aligned}$$

As for the boundary conditions, Dirichlet is imposed at metallic contacts (source, drain and the two gates), while homogeneous Neumann is taken elsewhere.

As a remark, these Dirichlet conditions at the source and drain contacts represent the potential applied through the device, and the Dirichlet conditions applied at the gates represent the control on the opening and closing of the channel, thus switching the device between the on and the off phases.

3 Highly-parallel methods for the linear system

The matrix \(L^{(k)}\) representing the linear system (8) is of order \(N_x \times N_z\), and contains \(N_x\) square blocks of size \(N_z\) on the diagonal.

An approach to solve this linear system is to employ strategies to significantly accelerate the convergence of Jacobi method without losing its simplicity and locality [2, 30, 39]. Following this approach, in this work, we have implemented on GPU a Scheduled Relaxation Jacobi (SRJ) [2, 39] method to solve efficiently this type of systems. SRJ methods extend the Jacobi method for linear systems which result from elliptic PDEs and present several important advantages for our particular case:

  • they exhibit excellent convergence behaviour while preserving the simplicity and the straightforward parallelization of Jacobi method,

  • they are particularly suitable for linear systems which result from discretizing Poisson-like PDEs and

  • they do not require advanced preconditioning (we can use inverse diagonal as in the Jacobi iteration).

An alternative approach would be to use Krylov subspace iterative methods such as the conjugate gradient (CG) and Generalized Minimal Residual (GMRES) methods [31]. However, these methods have a more complex implementation than the Jacobi method, and require the use of effective preconditioners to ensure fast convergence, where the preconditioners usually increase notably the computational cost and may involve significant effort for parallelization. Moreover, in [30] it is shown that approaches based on accelerating the Jacobi iteration can be an efficient alternative to the Krylov subspace methods.

3.1 The Scheduled Relaxation Jacobi (SRJ) method

The Jacobi method for the solution of a linear system provides poor convergence rate but exhibits a high concurrency degree, as each value of the vector solution can be updated totally independently from all the other values of the vector solution.

Suppose, we have to solve the system \({\varvec{A}} {\varvec{u}} ={\varvec{b}}\), where \({\varvec{A}}=\left( a_{ij}\right) _{N\times N}\) (\(i=0,...,N-1, j=0,...,N-1\)) is a square matrix of order N, \({\varvec{b}}\) a vector of size N and \({\varvec{D}}\) is the diagonal component of \({\varvec{A}}\) (\({\varvec{D}} =\textrm{diag} \left( a_{00}, a_{11}, \dots , a_{N-1 N-1} \right)\)).

A classical Jacobi iteration can be rewritten in vector form [1] in order to exploit the matrix–vector product operation:

$$\begin{aligned}&\hbox {take} \ {\varvec{u}} \ \hbox {as initial guess}\\&\hbox {repeat} \ {\varvec{u}} \longleftarrow {\varvec{u}} + {\varvec{D}}^{-1}({\varvec{b}}-{\varvec{A}} \, {\varvec{u}}) \ \hbox {until convergence.} \end{aligned}$$

A significant acceleration of the Jacobi algorithm can be obtained by applying the Scheduled Relaxation Jacobi (SRJ) method. The SRJ method extends the classical Jacobi method by introducing P different relaxation factors \(\omega _i > 0, i=1,\ldots ,P\). In the SRJ method, one relaxed Jacobi step with parameter \(\omega _i\) has the following form:

$$\begin{aligned} {\varvec{u}} \longleftarrow {\varvec{u}} + \omega _i \, {\varvec{D}}^{-1}({\varvec{b}}-{\varvec{A}} \, {\varvec{u}}). \end{aligned}$$
(13)

In SRJ, we complete several cycles until reaching convergence. At each cycle, we perform M relaxed Jacobi steps as (13) where

$$\begin{aligned} M=\sum _{i=1}^P q_i, \end{aligned}$$

being \(q_i\) the number of times we apply the parameter \(\omega _i\).

Therefore, a SRJ cycle consists in defining sequences of M relaxed Jacobi steps. In our experiments, we have obtained good results with \(P=7\) cycles with \(M=93\), using the following relaxation parameters:

$$\begin{aligned}&(\omega _1, q_1) = (370.035,1)\quad (\omega _2, q_2) = (167.331,2)\\&(\omega _3, q_3) = (51.1952,3)\quad (\omega _4, q_4) = (13.9321,7) \\&(\omega _5, q_5) = (3.80777, 13)\quad (\omega _6, q_6) = (1.18727,26)\quad (\omega _7, q_7) = (0.556551,41). \end{aligned}$$

In [2], one can obtain optimal parameters for \(\omega _i, i=1,\ldots ,P\) for several values of both the number of steps P and the number of grid points (taking into account a discretization using 2nd-order central differences of a 2D Laplace equation on a uniform grid). In particular, we have used the values for the case \(P=7\) steps and \(N=32\) points (N must be less than \(max(N_x,N_z)\)), for which we have experimentally obtained very good convergence results. The parameters \(q_i, i=1,\ldots ,P\) and M are easily inferred from the parameters \(\beta _i, i=1,\ldots ,P\) (also shown in [2]) describing the proportion of iterations in which a given weight \(\omega _i\) is applied over the total number of iterations of each cycle.

figure a

3.2 Implementation details

Algorithm 1 describes the CPU-GPU implementation of the SRJ method. As initial value for vector \(\varvec{u_0}\), we use the last known value for the potential vector V (obtained in the previous Newton-Raphson iteration or in the previous Runge-Kutta stage).

The selection of the next \(\omega _i\) in a SRJ cycle does not follow the natural sequence of ascending order where \(\omega _1\) is applied \(q_1\) times, then \(\omega _2\) is applied and so on, but the over-relaxation Jacobi steps (with \(\omega _i>1\)) are evenly spaced over the SRJ cycle to avoid overflow in the numerical experiments (see [39] for more details).

To implement each SRJ step (13) in CUDA we need, among others, a CUDA kernel to perform the sparse matrix–vector product

$$\begin{aligned} {\varvec{x}}={\varvec{A}} \cdot \varvec{u_0}. \end{aligned}$$
(14)

This CUDA kernel uses one-dimensional CUDA blocks and takes into account the narrow-banded structure of the sparse matrix \({\varvec{A}}\). In this kernel, the computation of the i-th element of the vector \({\varvec{x}}\) (by performing the dot product of the i-th row of the sparse matrix \({\varvec{A}}\) by the vector \(\varvec{u_0}\)) is computed by a different CUDA warp (see Fig. 3). We store the matrix \({\varvec{A}}\) in global memory as a rectangular array whose row dimension is equal to the bandwidth of \({\varvec{A}}\). We use one-dimensional CUDA blocks where each CUDA block computes \(\frac{B}{32}\) elements of \({\varvec{x}}\), being B the block size. Initially, all the warps in a CUDA block cooperate to read, in a coalescent way, the required values of \(\varvec{u_0}\) and load them in a shared-memory array \(\varvec{s\_u}\). Then, the j-th warp in the k-th CUDA block read the corresponding non-zero values in the row \(t=\frac{k B}{32}+j\) of \({\varvec{A}}\) and the affected values \(\varvec{s\_u}\) in order to compute the t-th element of \({\varvec{x}}\). For this, each thread in the j-th warp computes one partial value and all the threads in the warp will cooperate following a reduction algorithm based on a warp shuffle operation [23], to add efficiently their previously computed values. In particular, we have used the operation __shuffle_xor_sync (we assume a compute capability higher or equal than 3.x) to perform the addition at warp level.

The components of the vector \({\varvec{x}}\) which are obtained by each block are stored in a shared-memory array \(\varvec{s\_x}\) to be written coalescently in the global memory vector \({\varvec{x}}\).

In our implementation of SRJ, the system is preconditioned by left-multiplying by \({\varvec{D}}^{-1}\) in such a way that the matrix of the linear system contains only values 1 on the diagonal.

We use another CUDA kernel, which also uses one-dimensional CUDA blocks, to complete the SRJ step by computing the residual vector

$$\begin{aligned} {\varvec{x}}={\varvec{b}}-{\varvec{A}} \, \varvec{u_0} \end{aligned}$$
(15)

and updating the next approximation to the solution

$$\begin{aligned} \varvec{u_1}=\varvec{u_0}+\omega _i \, {\varvec{x}}. \end{aligned}$$
(16)

In order to control the convergence after completing an SRJ cycle, we implement an efficient CUDA reduction algorithm based on [18, 25] to jointly perform the infinity norm of two vectors: the residual vector (\(|{\varvec{b}}-{\varvec{A}} \, \varvec{u_0}|_{\infty }\)) and the new approximation (\(|\varvec{u_0}|_{\infty }\)). In the reduction CUDA kernel, one half of the CUDA block processes a chunk of the residual vector and the other half processes the corresponding chunk of the other vector.

Fig. 3
figure 3

Matrix–vector product: \({\varvec{x}}={\varvec{A}} \cdot \varvec{u_0}\). Each CUDA warp computes one element of the output vector \({\varvec{x}}\)

4 Implementation strategies: Diagonalization of the Schrödinger matrix

We need to compute the lowest \(N_{\textrm{sbn}}\) eigenvalues and relative eigenvectors of matrix \({\mathcal {L}}_{\nu ,i}\) in (6). It is known that for a tridiagonal symmetric matrix like \({\mathcal {L}}_{\nu ,i}\), the characteristic polynomial p(X) can be computed via a recursive sequence of polynomials [36]:

$$\begin{aligned} p_0(X)&= 1&\nonumber \\ p_1(X)&= \left( d_{0} - X \right)&\nonumber \\ p_j(X)&= \left( d_{j-1}-X \right) p_{j-1}(X) - e_{j-2}^2 \, p_{j-2}(X)&\text{ for } 2 \le j \le \textrm{n}, \end{aligned}$$
(17)

such that \(p(X) = p_{\textrm{n}}(X)\). In order to seek for the zeros of this polynomial, we shall employ two strategies: either a multi-section iterative algorithm (a generalization of the bisection algorithm) or a Newton-Raphson iterative algorithm. The first one is extremely robust, can unconditionally provide selected eigenvalues, but is costly, whilst the second one is faster but needs proper seeding. Therefore, the strategy will be the following: at the first step of the time evolution we shall use the multi-section algorithm; after that, we shall switch to Newton-Raphson.

4.1 The multi-section algorithm for eigenvalues

The bisection algorithm is a well-known tool for computing eigenvalues, described, for instance, in [11, 36].

In our case, instead of using bisection, we can divide the interval into an arbitrary number of sub-intervals, which we shall call \(N_{\textrm{multi}}\) in the following. If we think of it in a sequential way, the algorithm is less efficient than usual bisection (\(N_{\textrm{multi}}=2\)); nevertheless, this approach could be advantageous on a GPU platform because it better exploits parallelism: we can compute concurrently the \(\sigma\) function

$$\begin{aligned} \sigma (\xi ) := \text{ number } \text{ of } \text{ sign } \text{ changes } \text{ in } \left( p_{\textrm{n}}(\xi ), p_{\textrm{n}-1}(\xi ), \dots , p_{1}(\xi ), p_{0}(\xi ) \right) \end{aligned}$$

at all the intermediate points, and hence use fewer iterations to converge to the desired accuracy. We recall that polynomials \(\{p_j\}_{j=0,\dots ,n}\) represent the (reversed, backward indexed) Sturm chain (17), for which the following result holds: let \(\alpha\) a real number, then the number of zeros in the interval \(]-\infty ,\alpha [\) is given by \(\sigma (\alpha )\). Suppose that the eigenvalues are ordered \(\epsilon _{0}< \epsilon _{1}< \epsilon _{2}< \dots < \epsilon _{\textrm{n}-1}\). As eigenvalue \(\epsilon _{p}\) corresponds to the \((p+1)^{\textrm{th}}\) zero of polynomial p, then

$$\begin{aligned} \epsilon _{p} < \xi \Longrightarrow \sigma (\xi ) \ge p+1 \qquad \hbox {and} \qquad \epsilon _{p} > \xi \Longrightarrow \sigma (\xi ) \le p. \end{aligned}$$
(18)

The situation is sketched in Fig. 4.

Fig. 4
figure 4

Multi-section algorithm for eigenvalues. The discontinuity points of function \(\sigma\) identify the eigenvalues

In order to implement the multi-section algorithm for \(N_{\textrm{multi}}\) sub-intervals, we shall use the following magnitudes (all indices start from zero):

  • Interval \(\left[ Y_{\min }, Z_{\max } \right]\) is such that it contains all the eigenvalues, and \(L:= Z_{\max } - Y_{\min }\). This interval can be easily be obtained via Gershgorin circle theorem.

  • Integer \(n \in {\mathbb {N}} \setminus \{0\}\) indexes the iteration of the multi-section algorithm.

  • Array \(\epsilon _{\nu ,p,i}^{\text{ inf }}\) of size \(N_{\textrm{valleys}} \times N_{\textrm{sbn}} \times N_{x}\) represents a left-approximation of eigenvalue \(\epsilon _{\nu ,p,i}\), in the sense that

    $$\begin{aligned} \epsilon _{\nu ,p,i} \in \left] \epsilon _{\nu ,p,i}^{\text{ inf }} , \epsilon _{\nu ,p,i}^{\text{ inf }} + \frac{L}{(N_{\textrm{multi}})^{n+1}} \right[. \end{aligned}$$
  • \(\sigma _{\nu ,p,i,k}\) of size \(N_{\textrm{valleys}} \times N_{\textrm{sbn}} \times N_{x} \times (N_{\textrm{multi}}-1)\) represents the number of sign changes at point

    $$\begin{aligned} \xi _{\nu ,p,i,k} := \epsilon _{\nu ,p,i}^{\text{ inf }} + (k+1) \frac{L}{(N_{\textrm{multi}})^{n+1}}. \end{aligned}$$

So, the general view of the methods is:

$$\begin{aligned} \text{ init }&\left\{ \begin{array}{ll} 1 \quad &{} \text{ Compute } \text{ Gershgorin } \text{ circles } \left[ Y_{\nu ,i}, Z_{\nu ,i} \right] \text{ on } \text{ the } \text{ GPU }\\ 2 \quad &{} \text{ Compute } \text{ minimum } Y_{\min } \text{ and } \text{ maximum } Z_{\max } \text{ and } \text{ let } L = Z_{\max } - Y_{\min }\\ 3 \quad &{} \text{ Inizialize } \epsilon _{\nu ,p,i}^{\textrm{inf}} = Y_{\min }\\ 4 \quad &{} \text{ Compute } \text{ the } \text{ number } \text{ of } \text{ iterations } \displaystyle n_{\textrm{iters}} := \left\lfloor \frac{\ln \left( \frac{L}{\varepsilon _{\textrm{tol}}} \right) }{\ln \left( N_{\textrm{multi}} \right) } \right\rfloor + 1\\ \end{array} \right. \nonumber \\ \text{ loop }&\left\{ \begin{array}{ll} 5 \quad &{} \text{ Loop: } \text{ for } \left( n=0 ; n<n_{\textrm{iters}} ; n \leftarrow n+1 \right) \\ 6 \quad &{} \quad \text{ Compute } \sigma _{\nu ,p,i,k} \text{ on } \text{ the } \text{ GPU }\\ 7 \quad &{} \quad \text{ Update } \epsilon _{\nu ,p,i}^{\text{ inf }} \text{ on } \text{ the } \text{ GPU }\\ \end{array} \right. \end{aligned}$$
(19)

The last instruction inside the loop part, i.e. instruction 7 of (19), requires a reduction, as we need to compute

$$\begin{aligned} {\tilde{k}} := \max \left\{ k \in \{-1,...,N_{\textrm{multi}}-2\} \text{ such } \text{ that } \sigma _{\nu ,p,i,k} \le p \right\} \end{aligned}$$

to finally update

$$\begin{aligned} \epsilon _{\nu ,p,i}^{\text{ inf }} \longleftarrow \epsilon _{\nu ,p,i}^{\text{ inf }} + \left( {\tilde{k}}+1\right) \frac{L}{(N_{\textrm{multi}})^{n+1}}. \end{aligned}$$
(20)

4.1.1 Implementation details

We use multi-section with 32 points, i.e. with \(N_{\textrm{multi}} = 33\). It is set like this so that we shall make each warp take care of updating one value of \(\epsilon _{\nu ,p,i}\). As \(N_{\textrm{sbn}} = 6\), it seems reasonable to use either 1 or 2 or 3 or 6 warps per block, to load only one matrix \({\mathcal {L}}_{\nu ,i}\) per block. Blocks, therefore, will also be of size \(\{32, 64, 96, 192\}\). Let \(N_\textrm{w}\) the number of warps per block, the block will be of size \(32 \times N_\textrm{w}\). As dimensions are ordered \(i> \nu > p\), the \(32 \times N_\textrm{w}\) threads will take care of computing (for fixed \((\nu ,i)\))

$$\begin{aligned} \underbrace{\left\{ \sigma _{\nu ,p,i,k} \right\} _{k=0}^{31}}_{0}, \quad \underbrace{\left\{ \sigma _{\nu ,p+1,i,k} \right\} _{k=0}^{31}}_{1}, \quad ..., \quad \underbrace{\left\{ \sigma _{\nu ,p+N_{\textrm{w}}-1,i,k} \right\} _{k=0}^{31}}_{N_{\textrm{w}}-1}. \end{aligned}$$

By using a device of Compute Capability (CC) higher or equal than 3.x, we can exploit warp shuffle functions to perform the reduction (19)-7 at warp level. In particular, we use __shuffle_xor_sync to compute the maximum \({\tilde{k}}\) of a vector \(\varSigma _{\nu ,p,i,\cdot }\) stored in shared memory and containing

$$\begin{aligned} \varSigma _{\nu ,p,i,k} = \left\{ \begin{array}{ll} k &{} \text{ if } \sigma _{\nu ,p,i,k} \le p\\ -1 &{} \text{ otherwise } \end{array}\right. \end{aligned}$$

in such a way that we can update \(\epsilon _{\nu ,p,i}^{\text{ inf }}\) following (20).

In order to perform a coalescent reading from global memory of matrix \({\mathcal {L}}_{\nu ,i}\), whose entries are used several times by each thread, we use shared memory. Matrix \({\mathcal {L}}\) is stored as described in Fig. 5, so that each block loads SCHROED_MATRIX_ROW elements, i.e. 128 doubles with our standard parameters, out of which only 125 are really useful and 3 are just used for padding with zeros.

Fig. 5
figure 5

Schrödinger matrices. Storage format of matrices \({\mathcal {L}}\)

4.2 Newton-Raphson iterative method for eigenvalues

The Newton-Raphson algorithm can also be found in the classical book [36]. In our implementation the iteration is controlled by the CPU, and each call to a kernel updates the guess for the eigenvalues. We use one CUDA thread per eigenvalue. The implementation does not need any sophisticated technique; therefore, we do not give further details here.

4.3 Inverse Power Iterative Method (IPIM) for the approximation of the eigenvectors

Once the eigenvalues have been computed, it is the turn of the relative eigenvectors. In order to do that, we have used the IPIM (aka “inverse iteration” algorithm) [16] to approximate the eigenvector \(\psi _{\nu ,p,i}\) of \({\mathcal {L}}_{\nu ,i}\) using the previously obtained eigenvalue \(\epsilon _{\nu ,p,i}\). The algorithm is described in Table 2.

figure b

We have to approximate \(N_{\textrm{valleys}} \times N_{\textrm{sbn}} \times N_{x}\) eigenvectors (with \(N_z\) elements) using this algorithm. We have used a different CUDA thread \({\mathcal {T}}_{\nu ,p,i}\) to approximate the eigenvector \(\psi _{\nu ,p,i}\). Each thread solves locally the tridiagonal linear system using the Thomas algorithm [40]. Since each tridiagonal coefficient matrix \({\mathcal {L}}_{\nu ,i}\) is symmetric, it is represented using two vectors with \(N_z-2\) double precision elements. Several CUDA threads work with the same coefficient matrix (threads \({\mathcal {T}}_{\nu ,p,i}\) with \(p \in \{0,\ldots ,5\}\), \(\nu =\gamma\) and \(i=\delta\) use the matrix \({\mathcal {L}}_{\gamma ,\delta }\)).

We have implemented two different CUDA kernels to implement this algorithm on GPU:

  • Kernel A, where these vectors are read from global memory for each value of k in Algorithm 2, and

  • Kernel B, which stores these vectors in shared memory. In this version, all the vectors which are needed for the threads in a CUDA block are loaded coalescently from global memory.

In both cases, we use one-dimensional CUDA blocks with 32 threads to avoid excessive spilling of available registers in the multiprocessor.

Table 1 shows the average runtime (measured in seconds) spent by both CUDA kernels in a time step for several values of \(N_z\), using grid \(N_{x} = 65\), \(N_{w} = 300\), \(N_{\phi } = 48\) (\(N_{dim}\) (\(dim\in \{x,z,w,\phi \}\)) is the number of discretization points for dimension dim in the grid), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V. We can see that both kernels lead to very similar execution times. However, since kernel A achieves better times in all cases except for \(N_z=129\), we have opted for this version.

Table 1 Average runtimes (seconds) spent by both CUDA kernels implementing IPIM scheme for one time step

5 Numerical results

We have analyzed the performance and accuracy of the parallel solver, focusing on the GPU implementation of the Schrödinger-Poisson block (herein called the iter phase).

5.1 Description of the platform and solvers

The numerical experiments have been performed on a computing server with dual Intel Xeon Silver 4210 CPUs (in total, 20 physical cores with a base frequency of 2.2 GHz each and 40 logical processors) with 96 GB RAM, and 4 TB solid state hard drive. The system includes a NVIDIA Tesla V100 GPU (5120 cuda cores, 7 TFlops of double-precision peak performance and 32 GB DDR5 SDRAM) with CUDA Compute Capability (CC) 7.0 and an NVIDIA GeForce RTX 3090 GPU (5248 cores, 556 GFLOPS of double-precision peak performance and 24 GB GDDR6X) with CUDA CC 8.6. The operating system is Linux Debian 10.9 with GCC version 10.2.1 and the CUDA 11.2 runtime.

We have developed two implementations of the solver:

  • OpenMP solver: This solver only exploits the cores of the CPUs in the platform by using OpenMP directives and functions (see [38] for additional details). In the experiments, this solver is run using 40 threads (two per physical core). To compile the OpenMP solver, we have used the GNU compiler g++ version 10.2.1 using the switches -fopenmp -O3 -m64 -use_fast_math.

  • CUDA solver: This heterogeneous code performs all the relevant computing phases on one of the available GPUs (Tesla V100 or RTX 3090) under the control of a CPU thread which invokes the corresponding CUDA kernels. In the compilation with nvcc, we have used the switches -O3 -m64 -use_fast_math and the options necessary to generate PTX code and object code optimized to the particular GPU architecture.

In the OpenMP solver, we use exactly the same numerical methods as in the CUDA solver.

5.2 Experimental validation of convergence

The convergence of the Boltzmann-Schrödinger-Poisson solver has been experimentally validated by studying the results obtained with different grids at \(t=0.1\) picoseconds using a CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V. In order to avoid excessive complexity, two macroscopic magnitudes that capture characteristics of the solution at a time point are analyzed: the total current density j(x) and the total surface density \(\varrho (x)\). These magnitudes are computed as follows:

$$\begin{aligned} j(x) = 2 \sum _{\nu ,p} \int _{w'=0}^{+\infty } \int _{\phi '=0}^{2\pi } a^1_{\nu }(w',\phi ') \varPhi _{\nu ,p}(w',\phi ') \, \textrm{d}\phi ' \, \textrm{d}w', \qquad \qquad \varrho (x) = 2 \sum _{\nu ,p} \varrho _{\nu ,p}(x). \end{aligned}$$

As reference solutions for these magnitudes, the numerical results obtained by the solver for a very fine grid, given by \(N_x=129\), \(N_z=129\), \(N_w=600\) and \(N_{\phi }=96\), are used, being \(N_{dim}\) (\(dim\in \{x,z,w,\phi \}\)) the number of discretization points for dimension dim in the grid.

For each magnitude, the reference solution is compared with respect to the numerical solutions obtained for several coarser grids. These coarser grids have fewer discretization points in all dimensions. Figure 6 shows how the numerical solutions of the quantities vary as the number of points in all grid dimensions increases. It is very evident that as grids with a higher number of points are used, the solution obtained is closer to the reference solution for both quantities.

Fig. 6
figure 6

Convergence to the reference solution (\(129\times 129\times 600 \times 96\)). Numerical solutions at \(t=0.1\) ps for the total surface density and the total current density obtained with several grids

5.3 General view

In Fig. 7, we draw the average runtime cost for one time step of both computational phases (BTE and iter) and also show the speedup obtained with both solvers (for the full simulation of one time step) with respect to the sequential version (for only one thread) of the OpenMP solver. These results have been obtained by averaging the execution time of 10 time steps using grid \(N_{x} = 65\), \(N_{z} = 65\), \(N_{w} = 300\), \(N_{\phi } = 48\), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V.

For the OpenMP solver, the bottleneck is the integration of the Boltzmann Transport Equations (BTE phase). The port to GPU of this phase has already been described in [27], where the iter phase was solved on the multiprocessor host platform by using OpenMP. In the following, we shall analyze the impact of the CUDA port of this phase.

Table 2 shows the speedup obtained with both solvers with respect to the sequential version in the main computing phases (BTE and iter). We can observe that the speedup obtained on Tesla V100 GPU in the BTE phase is significantly higher than the one obtained on RTX 3090 GPU (416.4 on Tesla V100 and 57.5 on RTX 3090). Conversely, for the iter phase, the CUDA solver on both GPUs achieves a closer speedup (129.6 on Tesla V100 and 93.2 on RTX 3090). We claim that this is because the BTE phase is much more intense in double-precision arithmetics than the iter phase and exhibits a higher degree of data parallelism (see section 5.4.1).

Fig. 7
figure 7

Phases. Comparison of the cost of both computational phases between the OpenMP solver and the CUDA solver. The speedup is obtained for the full simulation of one time step using \(N_{x} = 65\), \(N_{z} = 65\), \(N_{w} = 300\), \(N_{\phi } = 48\), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V

Table 2 Speedup obtained in the main computing phases with a typical grid (\(65\times 65\times 300 \times 48\)), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V

5.4 The iter phase

In Fig. 8, we sketch the cost of each computational section inside the iter phase and show the speedup obtained with respect to the sequential version of the OpenMP solver.

The dominant part in all cases is the solution of the linear system (8) (section iter.solvelinsys). The evaluation of the directional derivative (10) (section iter.Frechet) is the second costliest section in the OpenMP solver, but it is not so in the CUDA solver because it scales better than the implementations of the other sections. In the CUDA solver, the computation of the eigenstates (2) (section iter.eigen) also has a dominant role in the runtime. This section does not produce high runtime improvements on GPU because it does not exhibit a high arithmetic intensity and the CUDA kernels for this section spend a long time accessing global memory and short time computing with those data (see information about the kernels cuda_tridiag_Thomas and cuda_eigenvalues_NR in Table 5). Finally, the construction of the linear system (section iter.constrlinsys) is clearly the least expensive part in the iter phase.

Figure 8 also shows that the runtimes obtained on both GPUs are similar in all the sections of the iter phase, except for the evaluation of the directional derivative where the Tesla V100 GPU achieves considerably shorter runtimes.

Fig. 8
figure 8

Iter. Comparison of the cost of the computational phases inside the iter phase between OpenMP and a full GPU execution, when it is used a the grid \(N_{x} = 65\), \(N_{z} = 65\), \(N_{w} = 300\), \(N_{\phi } = 48\), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V

In Fig. 9, we sketch the speedups achieved by the CUDA solver (on both GPUs) and the OpenMP solver with respect to a sequential version of the OpenMP solver for each of these four main sections (inside the iter phase). Table 3 shows the particular data sketched in Fig. 9. These data confirm that the CUDA kernel for the evaluation of the directional derivative efficiently exploits the double-precision computational power of the Tesla V100 GPU. For the other sections, the exploitation of the Tesla V100 power is not so efficient because of the much lower double-precision arithmetic intensity.

Fig. 9
figure 9

Speedups. Speedup of the main sections inside the iter phase with respect to a sequential version of the OpenMP solver, when it is used the grid \(N_{x} = 65\), \(N_{z} = 65\), \(N_{w} = 300\), \(N_{\phi } = 48\), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V

Table 3 Speedups for the main sections inside the iter phase with a typical grid \(N_{x} = 65\), \(N_{z} = 65\), \(N_{w} = 300\), \(N_{\phi } = 48\), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V

5.4.1 Behavior of the CUDA kernels

Table 4 shows the average runtime (measured in microseconds) spent by the main CUDA kernels in the iter phase per time step.

Table 4 Total runtimes (microseconds) spent by the main CUDA kernels for one time step, when it is used the grid \(N_{x} = 65\), \(N_{z} = 65\), \(N_{w} = 300\), \(N_{\phi } = 48\), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V

More into details, we are analyzing the behavior of six kernels, playing a role in four computational phases:

  • the phase computing the eigenstates (eigenvalues and eigenvectors) of the Schrödinger matrices, labeled iter.eigen, involves the CUDA kernels

    • cuda_eigenvalues_NR for the computations detailed in Sect. 4.2;

    • cuda_Tridiag_Thomas implementing the algorithm in Table 2;

  • the phase computing the directional derivative, labeled iter.dirderiv, involves the CUDA kernel

    • cuda_compute_frechet for the computation of (10);

  • the phase constructing the linear system, labeled iter.constrlinsys, involved the CUDA kernel

    • cuda_constrlinsys for the computation of (11)-(12);

  • the phase solving the linear system, labeled iter.solvelinsys, involves the CUDA kernels:

    • cuda_matvec_product for the computation of (14);

    • cuda_update_x for the computation of (15)-(16).

Additionally, a comparison of the most relevant CUDA kernels in the solver has been made, taking into account the throughtput achieved in the CUDA multiprocessors and in the memory access. For this purpose, we have used the NVIDIA Nsight Compute tools [28] to collect data about the following metrics:

  1. 1.

    gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed: measures the throughput of internal activity within caches and DRAM (as a percentage with respect to the peak throughput).

  2. 2.

    sm__throughput.avg.pct_of_peak_sustained_elapsed: measures the multiprocessor throughput assuming ideal load balancing across the multiprocessors of the GPUs (as a percentage with respect to the peak throughput).

Table 5 shows the values obtained for these metrics in the most relevant CUDA kernels of the phases iter and BTE. Table 6 shows the averaged values (taking into account the runtime of each CUDA kernel) of these metrics for each phase (BTE and iter) and GPU (RTX-3090 and Tesla V100). We can see that, for the Tesla V100 GPU, while the memory throughput (metric 1) is similar for both phases, the multiprocessor throughput (metric 2) is considerably higher for the BTE phase than for the iter phase. This shows that the BTE phase performs a much higher number of arithmetic operations in double precision per data read than the iter phase.

Table 5 Metrics provided by Nsight profiler. We have used: (1) = gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed (2) = sm__throughput.avg.pct_of_peak_sustained_elapsed
Table 6 Averaged values for the metrics (1) and (2) at each phase (BTE and iter)

5.5 Scaling with different grids

In this subsection, we analyze how the change in the number of discretization points at a particular dimension affects the runtime performance of the solvers. The main goal is to determine the role played by the different dimensions. Obviously, there are dimensions which affect more the performance because most of the numerical schemes depend strongly on those dimensions from the point of view of the algorithmic complexity.

In Figs. 10, 11, 12 and 13, we double the points along dimensions x, z, w and \(\phi\) and observe how this modifies the speedup in the iter phase. It is observed that the iter phase does not really depend on w and weakly depends on \(\phi\), and we actually see that the speedup obtained with respect to the speedup in Fig. 8 is very similar. The same applies to x, which acts as a parameter for this computational phase.

On the opposite, where we do observe a larger speedup is when we add points along the z-dimension: in Fig. 11, we remark a more significant speedup, because the GPU multiprocessors are better exploited by feeding them with a larger amount of computations. The number of discretization points in the z-dimension affects more strongly the computational cost because increasing this variable further increases the computational cost of the numerical methods related to confinement.

Fig. 10
figure 10

Scaling with different grids. Doubling the points along x-dimension. Grid \(N_{x} = 129\), \(N_{z} = 65\), \(N_{w} = 300\), \(N_{\phi } = 48\), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V

Fig. 11
figure 11

Scaling with different grids. Doubling the points along z-dimension. Grid \(N_{x} = 65\), \(N_{z} = 129\), \(N_{w} = 300\), \(N_{\phi } = 48\), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V

Fig. 12
figure 12

Scaling with different grids. Doubling the points along w-dimension. Grid \(N_{x} = 65\), \(N_{z} = 65\), \(N_{w} = 600\), \(N_{\phi } = 48\), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V

Fig. 13
figure 13

Scaling with different grids. Doubling the points along \(\phi\)-dimension. Grid \(N_{x} = 65\), \(N_{z} = 65\), \(N_{w} = 300\), \(N_{\phi } = 96\), CFL condition 0.6, a source-drain voltage of 0.1 V and a source-gate voltage of 0.5 V

Table 7 shows the speedup obtained in the iter phase by both the OpenMP solver (using different number of threads) and the CUDA solver (on both GPUs) with respect to a sequential version of the OpenMP solver (for only one thread) when using different grids for the simulation of one time step. In this table, we can see how the speedup increases as the number of points in the grid increases, which shows the trends of scalability for the different solvers.

Table 7 Speedup obtained in the iter phase with different grids

6 Conclusions and perspectives

In this work, a simulator of nanoscale DG MOSFETs which solves the Boltzmann-Schrödinger-Poisson system performing all the computing phases on a NVIDIA GPU is described. Now all the computing phases of the simulator can be fully performed on GPU and show good performance, and reasonable computational times, taking into account the huge computational cost of this deterministic solver.

The port to GPU of the iterative section, solving the Schrödinger-Poisson block, has required adapting to GPU many techniques and methods such as the Scheduled Relaxation Jacobi method, the multi-section algorithm and the inverse power iteration.

This CUDA implementation of the Schrödinger-Poisson block provides satisfactory results, as it significantly reduces the execution times obtained on a modern dual processor server with 40 logical cores. As a result, we obtain one order of magnitude speedup with the full GPU version on a Tesla V100 GPU and a very close speedup is also obtained on a RTX 3090 GPU, which is much less powerful for double-precision computing.

Regarding the future extensions of this exploratory research, several topics can be explored. Firstly, it would be of interest to test the techniques described here in another kind of solver, in particular in a macroscopic solver, which is a goal of great interest for the semiconductor industry as it could provide significant improvement for commercial TCAD simulators. Secondly, no Monte-Carlo solver for the Boltzmann-Schrödinger-Poisson system has been ported to GPU so far, at the best of our knowledge. It would be interesting to see how the performance of such numerical methods improves. Thirdly, on a broader scale, we are working on improving the description of the MOSFET device at physical level, for example by introducing other scattering phenomena into the collisional operator, and in particular the surface roughness and the Coulomb interaction. Additionally, devices composed of different materials and heterostructures can be simulated and, when the semiconductor device must be simulated in the 3D physical space, the high number of points of the resultant mesh suggests deriving an implementation for multiple GPUs.