1 Introduction

Silicon device architectures have been developed rapidly during the last few decades, the FinFET being the current standard for mass production. However, IRDS predictions [1] point toward a change in the leading architectures for the next technology nodes, with promising candidates such as the GAA NW FET and the GAA nanosheet. This shift in the industry standard calls for new architecture designs able to predict the performance and reliability of new devices. For the current transistor dimensions, the traditional process to develop and optimize devices based on trial and error during the manufacturing stage is unviable. Consequently, Technology Computer Aided Design (TCAD) has become an indispensable tool to both predict future device architectures and optimize the present ones [2,3,4,5,6].

Traditionally, when devices were in the micron scale, the resolution of classical models was the most popular choice to predict device behavior, for instance the drift-diffusion (DD) method. Currently, device dimensions have been scaled down to the nanometer regime what requires the use of more complicated and more time-consuming simulation methods. Some of these are the particle-based semiclassical Monte Carlo (MC) [7, 8], or the purely quantum non-equilibrium Green's function (NEGF) [9] [10, 11]. A common alternative, as it is a good compromise between simulation time and accurate results, is to include quantum corrections to classical models such as those based on the solution of the density gradient (DG) [12,13,14] or the Schrödinger (SCH) [15] [16,17,18] equations.

As billions of transistors coexist in a die, hundreds or thousands of devices need to be simulated to obtain realistic results in variability studies [4, 19,20,21], making this a very computationally demanding job regardless of the simulation method. Therefore, the optimization and parallelization of simulation frameworks play a key role in current device design, enabling us to obtain valid physical results at a fraction of the computational cost.

In this paper, we propose and evaluate a resolution scheme of SCH equation-based corrections compatible with a highly parallel DD model explained in [22], used in the resolution of 3D finite element (FE) meshes. By implementing these corrections in a scalable manner, we will be able to obtain accurate results without perpetuating the simulation time. In the literature, there are several simulation toolboxes that also implement the resolution of the SCH equation to include quantum corrections in classical simulators, following a similar methodology. These simulators either solve the SCH equation in 2D [18] or in 1D [23]. The main advantage of our proposal is that the 3D unstructured mesh can be divided into separate domains and executed in parallel with MPI in distributed systems, decreasing simulation times considerably. We have used as a benchmark device a 12-nm gate-all-around nanowire FET, one of the main contenders to replace FinFETs as the preferred device architecture for mass production [24,25,26,27]. The structure of this paper starts in Sect. 2, where we describe the simulation software and the benchmark device used in this study. In Sect. 3, we present the 2D SCH scheme and the implemented parallelization. In Sect. 4, different simulation results and an efficiency discussion are presented. Finally, in Sect. 5 we draw the main conclusions.

2 Simulation framework and benchmark device

VENDES [22] is a semiconductor simulation framework which includes several tools for modeling complex ultrascaled semiconductor devices such as FinFETs [8], GAA NW FETs [25] or vertically stacked nanosheets [28].

Fig. 1
figure 1

VENDES flowchart. It can be seen the different steps needed to simulate, from the geometry input and FE mesh generation, to the calculations performed to solve the Poisson equation and the transport models

An overall look of the VENDES framework workflow and its capabilities is shown in Fig 1. The first step in the simulation process consists of generating a 3D finite element mesh made of tetrahedra that represents the structure of the device. The modeling of the device has been done via an open-source software called Gmsh [29]. The geometry is constructed using parametric design according to the device dimensions (see Fig. 2 a). Then, using a Delaunay triangulation technique, the 3D FE mesh is generated and optimized as shown in Fig. 2b. Every node of the 3D FE mesh is given a set of physical properties needed for the simulation, such as material permittivities, doping profiles or carrier mobilities. Note that a FE scheme has been chosen, opposed to the finite difference method, since it is able to describe complex 3D irregular geometries. At this point, the user can choose to apply to the device several built-in variability models that can either modify the properties of the nodes such as random discrete dopants and metal grain granularity or modify the geometry of the device like line edge roughness or gate edge roughness.

Fig. 2
figure 2

The first step in the device modeling process is to design the geometry in Gmsh [29] (a) and generate the FE mesh (b). Using the spatial coordinates of the FE nodes, 2D slices are extracted using user-fixed criteria to obtain a 2D mesh (c). One of the FE 2D slices is presented in (d)

The device that has been used in this work is a state-of-the-art 12-nm Si gate-all-around (GAA) nanowire (NW) FET based on experimental data from [30]. The main dimensions and doping profile characteristics are given in Table 1. The device channel has been uniformly doped, whereas the source/drain (S/D) regions have a Gaussian doping. These Gaussian doping profiles, reverse-engineered from experimental data [7], are characterized by the S/D doping lateral straggle (\(\sigma _{x}\)), which describes the slope of Gaussian profile, and the S/D doping lateral peak (\(\hbox {x}_{\mathrm {p}}\)), which indicates the position where the Gaussian decay starts measured from the middle of the channel (see Table 1).

Table 1 Device dimensions for the 12-nm-gate-length GAA NW FET

After the device geometry and properties have been defined, in the beginning of every simulation, an initial solution is calculated using only the Poisson equation at equilibrium at \({\mathrm {V}_{\mathrm {G}}}=0.0\) V and \({\mathrm {V}_{\mathrm {D}}}=0.0\) V. The output of this initial routine is a purely classical electrostatic potential. However, a solution that seeks a compromise between sound results and short simulation times has been to include quantum corrections [18, 31, 32]. Some of the most commonly used quantum corrections are the DG corrections that require calibration against a quantum mechanical simulation as explained in [12] or the SCH corrections, based on solving the FE 2D SCH equation that does not demand any fitting parameters.

The calibration of the \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\) characteristics is shown in Fig. 3, comparing the classical DD simulations against SCH quantum-corrected DD and MC simulations. Note that the inclusion of quantum corrections is essential to properly characterize the device behavior. The DD simulation matches the MC+SCH in the ON region extraordinarily well, thanks to the fitting of the saturation velocity and the mobility. Finally, it is worth mentioning that at very low gate biases the MC+SCH is too noisy, and it is unable to obtain the correct off-current.

In VENDES, the simulation of the transport inside the channel of the device can be done with two different simulation schemes. The DD approach is a vastly implemented scheme used to calculate carrier transport in semiconductor device simulations [12, 33, 34]. This simulation method describes the relation between the electrostatic potential and the density of carriers in the device. Secondly, the ensemble MC technique solves the Boltzmann transport equation (BTE) using the charge distribution throughout the device to calculate both the electrostatic potential and electric field. A more detailed explanation of the implementation and solution of the DD model can be found in [12] and for the ensemble MC in [7].

Fig. 3
figure 3

\({\mathrm {I}_{\mathrm {D}}}\)\({\mathrm {V}_{\mathrm {G}}}\) characteristics comparing 3D drift-diffusion (DD) simulations against: i) SCH quantum-corrected drift-diffusion (DD+SCH) and ii) Monte Carlo quantum-corrected simulations (MC+SCH)

3 Parallel Schrödinger-based quantum corrections

3.1 2D Schrödinger workflow

After the first Poisson iteration has been completed and an initial solution for the potential has been found, the SCH quantum corrections routine can be selected as shown in Fig. 4. Even though the device mesh is three-dimensional, the SCH algorithm is based on solving the 2D time-independent form of the equation to ensure a good compromise between accuracy and speed in the results since by removing one dimension from the SCH equation, the complexity and resolution time drastically decrease. The 2D SCH equation is solved in YZ planes perpendicular to the transport direction. To create these planes, several X coordinates along the transport direction are defined. Since a 3D unstructured FE mesh is used in this study, the GAA NW FET has been designed in Gmsh by extruding the mesh specifically in the x-coordinates where the 2D SCH will be solved. Depending on the device and the concentration profiles, the number of 2D slices can vary in order to properly capture the quantum confinement. Using this method allows to solve the SCH equation in the 2D slice nodes and interpolate the results in the 3D nodes placed in between the 2D slices at a fraction of time compared to solving the 3D SCH equation.

Then, all 3D nodes with the same x-coordinate are grouped into a single 2D slice. In Fig. 2c, it can be seen the full 2D mesh of a GAA NW FET containing 41 cuts transversal to the transport direction. In Fig. 2d, the node structure of a single slice can be seen showing the 2D triangular mesh where the SCH equation is solved.

Fig. 4
figure 4

SCH subroutine flowchart. In this scheme, the different processes of the SCH methodology are presented. First, the 3D information from the FE mesh is adapted to a 2D mesh for less complex computations. After the solution has been obtained, the results are used to interpolate the quantum corrections to the nodes in the FE mesh

Prior to the resolution of the 2D time-independent SCH equation, it is necessary to read the mass matrix and to generate the stiffness matrix. This is done using the Galerkin method [35] [36]. However, since the SCH equation is solved continuously through the simulation, the mass lumping technique was implemented to diagonalize the mass matrix which makes the numerical resolution faster [35, 37, 38]. Once the FEM system has been generated, the eigenproblem is solved using EB13 [39], a library routine based on Arnoldi's iterative resolution method. The results of the standard generalized eigenproblem are both the eigenstates of the 2D SCH equation \(\psi _{i}(y,z)\) and the eigenenergies \(E_{i}(y,z)\). These two parameters are obtained for every node of the 2D slice and are used to calculate the quantum mechanical electron density \(n_{sc}(y,z)\) following the Boltzmann approximation:

$$\begin{aligned}&n_{sc}(y,z)=g \frac{\sqrt{2\pi m^{*} k_{B}T} }{\pi \hbar } \mathrm {x} \nonumber \\&\quad \quad \quad \quad \mathrm {x} \sum _{i}\vert \psi _{i}(y,z)\vert ^{2} \exp \left[ \frac{E_{F_{n}}-E_{i}}{k_{B}T}\right] \end{aligned}$$
(1)

where g is the degeneracy factor, \(k_{B}\) is the Boltzmann constant, T is the temperature, \(m^*\) is the electron effective transport mass and \(E_{F_{n}}\) is the quasi-Fermi level as shown in [40].

Now that the quantum-corrected electron concentration has been calculated in the nodes of the 2D slices, it is used to correct the electron concentration in the nodes of the 3D mesh. This process is done using high-order Lagrange polynomials. When the value of the quantum concentration is needed for a 3D node that is not contained in a 2D slice, the position of the node is projected into the closest slices, four in the case of cubic interpolation. Then, using the shape functions from the finite element method, the quantum concentration is calculated at the triangular element of the 2D cut where the 3D node projection is. Once the quantum concentration is known at the node projections, the concentration profile at those coordinates can be obtained using the interpolation polynomials, and the quantum concentration at the original 3D point is calculated. A visual depiction of the 3D interpolation process is shown in Fig. 5.

Fig. 5
figure 5

Lagrange interpolation of the 3D quantum concentration. For a cubic interpolation, four slices are taken and the 2D quantum concentration in each one of them is calculated in the node projection using the FEM shape functions

The end result is a quantum-corrected potential assigned to every node of the full 3D mesh (\(V_{node} (\mathbf {r})\)) that is calculated using the following formula:

$$\begin{aligned}&V_{node} (\mathbf {r}) = V_{sc}(\mathbf {r}) + V_{cl}(\mathbf {r})\nonumber = \\&\quad \quad \quad = \frac{k_{B}T}{q} \log (n_{sc}(\mathbf {r})/n_i(\mathbf {r})) \end{aligned}$$
(2)

where \(n_i(\mathbf {r})\) is the effective intrinsic carrier concentration of electrons and holes, \(V_{sc}(\mathbf {r})\) is the 3D SCH quantum correction to the potential and \(V_{cl}(\mathbf {r})\) is the classical potential obtained as the solution of the Poisson equation.

A full in-depth description of the SCH quantum corrections inclusion in the main DD scheme can be found in [13].

3.2 2D Schrödinger parallel implementation

Prior to this work, most of VENDES routines were implemented including MPI directives and could be executed in a distributed manner speeding up the simulation process, except for the SCH routines. In VENDES, the Poisson and electron continuity equations are first decoupled using Gummel methods and then linearized using Newton's algorithm. The resulting linear system is solved using a domain decomposition technique [12]. The problem domain is partitioned into several subdomains that can be solved in a parallel manner. In order to do this, the linear system is rearranged as shown for a two-domain example in Fig. 6. The whole domain is divided into internal nodes, whose solution is only stored locally in every process, internal boundary nodes, with a solution that is obtained locally and then shared with adjacent processes, and then external boundary nodes, whose solution is calculated locally in a different process and then shared from this process with its neighbors. The obtained solution in the internal and external nodes is interchanged among adjacent domains after every iteration to assure the consistency between the adjacent information stored in all the processes (see Fig. 6).

Fig. 6
figure 6

a Example of a two-domain partitioned mesh. After the different domains have been determined, the node indexing is rearranged according to the local domain. Note that a halo of nodes is kept so that after every iteration of the solver, the results are communicated between domains for self-consistency. b Image showing a mesh divided into four domains. Note that the rough boundaries are made of the duplicated external boundary nodes between two adjacent processes

This node scheme is implemented not only as a way of distributing the problem domain, but also it has been shown that placing the external boundary nodes at the end of the linear matrices improves performance [41].

One of the first issues arising when trying to parallelize the SCH algorithm was that the SCH quantum concentration is obtained in 2D slices in several coordinates along the transport direction, some of which can be exactly located where the domain boundary is and be split between two adjacent processes. For a GAA NW FET with approximately 100k nodes, the SCH routine is solved 164 times in a single full IV curve simulation and the additional synchronization after every iteration of the routine would have too much impact on the resolution performance. To avoid having a 2D slice divided between two processors, a block partitioning scheme was implemented (see Fig. 7).

Fig. 7
figure 7

Block partitioning scheme for the benchmark GAA NW FET. By using this decomposition method, the domain boundaries are not divided between adjacent processes and additional synchronization is avoided

After the 3D mesh is divided into the desired number of processors, the first step is to generate the 2D SCH mesh. As mentioned earlier, the user defines a set of coordinates along the x-axis where the 2D cuts will be performed. During this process, the boundary slices are duplicated so that both adjacent processes contain the slice with the external boundary nodes and do not have to share any information after every iteration in the SCH solution. After the duplication of the external boundary slice, all the 3D nodes of the mesh are in between slices that allow to interpolate the quantum-corrected concentration after solving the SCH equation. A visual representation of the parallel resolution is shown in Fig. 8.

Fig. 8
figure 8

Parallel Schrödinger (SCH) scheme. a In the parallel resolution of the SCH algorithm, the boundary slice between two domains is duplicated. This was done as the MPI communication between domains after every iteration was more time-consuming than the extra computing expense of an extra slice per domain. b After the quantum density (\(n_{SC}\)) is calculated in every node of the 2D slices, a set of high-order interpolation Lagrange polynomials are generated to evaluate \(n_{SC}\) at every node of the 3D mesh

Once all the slices are distributed, each process solves all the 2D slices it contains sequentially. Next, every node of the 2D cut is assigned a value of the quantum concentration and later interpolated to all the nodes of the local 3D mesh. With the coupled Poisson-SCH equation solver, using the quantum-corrected value of the concentration in the 3D mesh, the quantum-corrected potential is calculated in all the internal nodes of every domain. Then, the continuity equation is solved in the 3D mesh using Newton method until convergency is reached.

The linear solver algorithm is based on the additive Schwarz method that first solves the linear system locally. After every local iteration, the global linear system needs to be updated. This prompts that after every iteration all the processors exchange information, and the higher the number of processors, the more information that has to be synchronized. Therefore, the scalability of the code will be dependent on a compromise between the node communication network speed and the efficiency of the linear systems solver to avoid additional synchronization overheads.

4 Results and model efficiency

In this section, we analyze the performance of the proposed parallel SCH methodology. The first step to validate the effectiveness of this scheme is that it generates a \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\) consistent with the sequential execution. This is shown in Fig. 9, where \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\) curves simulated with a number of processors ranging from 1 to 32 are almost identical.

Fig. 9
figure 9

Simulated \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\) characteristics for the GAA NW FET in comparison with Schrödinger quantum-corrected drift-diffusion (DD-SCH) using 1, 2, 4, 8, 16 and 32 processes at \(\mathrm {V_{D,sat}}\) = 0.70 V

The small disagreement between the different output currents is due to discrepancies in the quantum density (\(\mathrm {n}_{sc}\)). Once the quantum density is obtained, it is used to calculate the quantum potential (\(\mathrm {V}_{sc}\)) which can be seen in Fig. 10. The \(\mathrm {n}_{sc}\) variable is calculated in each one of the 2D slices, and then its value is obtained in all the 3D nodes of the FE mesh of the device using a high-order Lagrange interpolation method. Depending on the number of slices per process, the interpolation can be linear, quadratic or cubic. For instance, if a process contains only two or three slices, the interpolation polynomials must be first or second order, whereas if a process contains four or more slices, a cubic interpolation can be used. For the GAA NW FET, 40 x-coordinates, every 1 nm approximately, are defined where the 2D slices are extracted during the simulation. In the case of 32 running processes, usually each processor computes up to 2 slices and therefore the interpolation order can only be linear. Consequently, using more processes to decrease the simulation time induces a loss in accuracy because a lower order of interpolation has to be used.

Figure 11 shows the current percentage error with respect to the sequential execution that has been calculated at \({\mathrm {V}_{\mathrm {D,sat}}}\) = 0.70 V versus the number of processors. The disagreement between the sequential and parallel \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\) values reaches a maximum error for the 32 processes case with up to 10% of difference between the extracted current at \({\mathrm {V}_{\mathrm {G}}}=\) 0.0 V. At this gate bias, the potential barrier is at its maximum and the interpolation mechanism is not able to capture the steep slopes of the potential profile. When \({\mathrm {V}_{\mathrm {G}}}\) increases, the potential barrier decreases, smoothing the potential profile and minimizing the interpolation error. Note that for high values of \({\mathrm {V}_{\mathrm {G}}},\) the percentage error of the current decreases and becomes less than 1% in the range from 0.6 to 1.0 V for all cases.

Interpolated quantum potential profile (\(\mathrm {V}_{quan}\)) in the transport direction with 1, 2, 4, 8, 16 and 32 processes. The middle of the gate corresponds to the zero value in the x-axis

Fig. 10
figure 10

Interpolated quantum potential profile (Vquan) in the transport direction with 1, 2, 4, 8, 16 and 32 processes. The middle of the gate corresponds to the zero value in the x-axis

Fig. 11
figure 11

Current percentage error with respect to the sequential case versus the gate voltage. Note that at low gate voltages for 32 processes a maximum of up to 10 % error is produced whereas, as the gate voltage is increased the percentage error vanishes

When using 16 or a lower number of processors, the current error percentage is always lower than 2%; therefore, for the rest of this work only cases from 1 to 16 processes have been simulated. Note that the use of more processes was not considered as the decrease in accuracy in the results due to the interpolation would not be negligible. Moreover, several simulations have been performed increasing the number of 2D slices while keeping the number of processors constant to obtain the optimum number of slices. With a low number of slices, the Schrödinger solution does not provide the correct quantum confinement of the device. When using 40 or more slices, we obtain the same drain current results.

4.1 Simulation times

After checking the integrity and validity of the new parallel SCH routines, the performance has been tested. The simulation results obtained for the S, M and L meshes (40k, 100k and 150k nodes, respectively) can be seen in Table 2. The time measurements correspond to the execution times needed to obtain a current point at \({\mathrm {V}_{\mathrm {D}}}=0.05\) V and \({\mathrm {V}_{\mathrm {G}}}=0.0\) V. These data have been obtained using a HP ProLiant BL685c G7 @ 3.40GHz with 64 cores and 256 GB of RAM.

Table 2 Execution time measurements of parallel VENDES using 1, 2, 4, 8 and 16 processes and the S, M and L meshes with 40k, 100k and 150k nodes, respectively, needed to obtain a current point at \({\mathrm {V}_{\mathrm {D}}}=0.05\) V and \({\mathrm {V}_{\mathrm {G}}}=0.0\) V

The output simulation times show that the code is highly parallel with a time reduction using 16 processes of 97.1%, 97.9% and 98.1% w.r.t the sequential times for the S, M and L meshes. Similar results were found for the measurements of a full \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\), with times up to 98.2% faster using 16 processes as shown in Fig. 12.

Fig. 12
figure 12

Simulation times results of a full \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\) curve at \({\mathrm {V}_{\mathrm {D}}}=0.7\) V with a 150k node mesh executed with 1, 2, 4, 8 and 16 processes in a parallel system

The increase in performance has two main reasons. First, the parallelization of the SCH routines, which are run several hundreds of times during a full \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\) curve, reduced the overall simulation times. On the other hand, the VENDES framework was already parallelized with MPI. Therefore, with the proposed parallelization, the resolution of linear systems that are the most computationally expensive part of main scheme can now be executed in a distributed manner, decreasing simulation times drastically.

4.2 Efficiency of the SCH equation resolution

In order to calculate the efficiency results, the standard definition has been followed:

$$\begin{aligned} E(p) = \frac{t_{1}}{t_{p} \cdot p} \end{aligned}$$
(3)

where \(t_{1}\) is the sequential execution time, p is the number of processes and \(t_{p}\) is the execution time using p processes. The parallel efficiency results for one point of the \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\) curve are shown in Fig. 13. These data have been simulated with and without activating the resolution of the SCH and the continuity equation routines so that it can be estimated the influence in the efficiency of the different routines that solve the Poisson, SCH and continuity equations. First, if only the Poisson equation is solved to reach a solution for the electrostatic potential in equilibrium (i.e., \({\mathrm {V}_{\mathrm {D}}},{\mathrm {V}_{\mathrm {G}}}=0.0\) V), the parallel efficiency increases with the number of processors up to approximately 2.8 for the 16 processes case. In this scenario, VENDES reaches super-linear efficiency values (higher than 1). As explained in [41], this behavior appears because when the number of processors increases the size of the local subdomains decreases, reducing the computational time; the linear solver takes more than the increment of the communication time. If instead both the Poisson and continuity equations are solved to calculate the current flow at \({\mathrm {V}_{\mathrm {D}}},{\mathrm {V}_{\mathrm {G}}}=0.0\) V, the efficiency decreases to a maximum of 2.6, which indicates the routines responsible of solving the continuity equation scale worse than the Poisson equation solver. Moreover, if routines that solve the Poisson and the SCH equation are utilized, the parallel efficiency drops to 1.9, meaning that the SCH implementation is not as efficient as the Poisson. Finally, the three functions that solve the Poisson, SCH and continuity equations are executed at the same time, yielding a parallel efficiency of 2.2. The addition of the resolution of the continuity equation increases the efficiency w.r.t the case of only using Poisson and SCH which indicates that the routines used to solve the SCH equation are the less efficient out of the three simulation methods.

Fig. 13
figure 13

Efficiency results using a 40k node mesh considering the different types of simulations VENDES has available: only solving the Poisson equation, solving both Poisson and continuity equations, solving Poisson plus SCH equations and solving Poisson, SCH and continuity equations

We have also found that the results can vary depending on the number of subdomains the device is split into. For example, it has been seen that for four domains, the boundaries between these domains are placed closer to different material interfaces where the physical behavior changes, like gate-source or gate-drain boundaries. This domain-interface coincidence forces additional process synchronizations, increasing the time to achieve the convergency of the linear systems resolution.

Moreover, when more than 8 processes are used, the efficiency begins to saturate. After every iteration of the Poisson and SCH, a synchronization between the results of the boundary nodes in adjacent domains is performed. When using a large number of processes, the overhead of this communication can take a toll on the efficiency.

4.3 Simulation scalability

A similar analysis of the parallel efficiency of the code has been performed changing both the number of running processors and the number of mesh nodes to show how scalable VENDES is. The results are shown in Fig. 14. The efficiency increases with the number of processes, particularly with higher node density meshes. The larger the simulated mesh, the more use it makes out of executing the code in a parallel manner. This is why the maximum efficiency for the 40k mesh is 2.2, whereas for the 100k and 150k the efficiency reaches 2.9 and 3.2, respectively. As explained before, beyond 8 processes, the synchronization between processors after every iteration of the solver produces a significant overhead and starts to decrease the scalability of the code.

Fig. 14
figure 14

Efficiency results using three different size meshes (40k, 100k and 150k). These data have been obtained by solving the Poisson, SCH and the continuity equations for a single point of the \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\) curve

Finally, to assess the impact this parallelization has on the performance, we are going to consider a typical variability study, where 300 simulations are performed. Considering that the sequential execution of a full IV curve is close to 11 hours, to obtain the results for 300 curves, it would take more than 137 days of computational time. Now, with the parallel software, since the results for a simulation can be obtained within approximately 12 minutes using 16 processes, the time it would take to simulate the 300 devices drops to 2.5 days using 16 processes for the L mesh.

5 Conclusions

A fully parallel implementation of 3D finite element (FE) Schrödinger (SCH) quantum corrections was developed for physical modeling of ultrascaled semiconductor devices such as GAA NW FETs. The incorporation of the SCH equation-based quantum corrections into the simulation framework increases the overall accuracy of the results, and its execution in distributed systems allows to keep reasonable time frame.

The new routines were tested in a state-of-the-art 12-nm Si GAA NW FET showing an almost perfect agreement w.r.t the sequential simulation in the \({\mathrm {I}_{\mathrm {D}}}\)-\({\mathrm {V}_{\mathrm {G}}}\) curve up to 16 processes. Also, in order to reduce synchronizations stalls, in the current nested multilevel domain decomposition scheme, the boundary nodes have been duplicated and an additional one or two 2D SCH slices have been computed per processor. This extra computation avoids a considerable overhead because after every iteration the SCH solution needs to be shared between adjacent processors to check for consistency. Finally, a comparison of simulation times and code efficiency was made. For three different 40k, 100k and 150k node meshes, S, M and L, respectively, the new parallel code was tested. The execution times dropped up to 98.2% for L meshes using 16 processors with respect to the sequential case. The efficiency is maximum when the simulator is running in a distributed system with 16 processes, reaching a super-linear result of 3.2.

Overall, the new implementation allows for rapid simulation of ultrascaled devices, a much desired characteristic in research studies such as variability analysis, where hundreds or thousands of non-ideal devices need to be simulated. For instance, a typical variability study with 300 device simulations, each one running for approximately for 11 hours, may take over 137 days of computational time. With the proposed parallel software, the total elapsed time drops to 2.5 days using 16 processes for a mesh of 150k nodes.