Accelerated Simulation of Passive Analog Circuits Over GPU Using Explicit Integration Methods

Doménech-Asensi, Ginés; Kazmierski, Tom J.

doi:10.1007/s00034-024-02780-5

Accelerated Simulation of Passive Analog Circuits Over GPU Using Explicit Integration Methods

Open access
Published: 16 July 2024

Volume 43, pages 6115–6131, (2024)
Cite this article

Download PDF

You have full access to this open access article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Accelerated Simulation of Passive Analog Circuits Over GPU Using Explicit Integration Methods

Download PDF

242 Accesses
Explore all metrics

Abstract

Analog circuits composed by large number of nodes in a tightly coupled structure pose significant challenges due to their prohibitive CPU simulation time. This work describes a method to speed up the simulation of such circuits by means of the combination of space state formulation of circuit equations with explicit integration methods parallelized over a many-core processor such as a GPU. Although stability of explicit techniques require smaller integration steps compared to implicit methods, the proposed method employs a fast estimate of the maximum allowed step size to guarantee numerical stability, which yields a shorter simulation time for increasing complexity circuit architectures. Moreover, the proposed technique can be straightforward parallelized on a many core architecture. The proposed method is demonstrated with two examples using constant and variable coefficients respectively: an RLC interconnect and a MOS-C network to perform Gaussian filtering of medium resolution images. The results obtained have been compared to a parallel version of SPICE and show improvements up to two orders of magnitude for transient simulations depending of the circuit size.

A holistic fast and parallel approach for accurate transient simulations of analog circuits

Article Open access 08 December 2017

Clustering and Parallel Processing on GPU to Accelerate Circuit Transient Analysis

Circuit analysis

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The arise of the SPICE simulator half a century ago, meant a key advance in the design cycle of analogue circuits. Since then, different versions of this simulator have been developed in order to perform the analysis of analogue circuits behavior using techniques based on the modified nodal analysis. This technique relies on implicit integration techniques based on the Newton–Raphson method to solve the circuit analogue equations at each time step. However, although such methods are numerically stable and reliable, they are computationally expensive and require long CPU times when large circuits are simulated. The reason is that the Jacobian matrix of the analogue circuit must be built and factorized multiple times at each time step. The computational workload can be reduced if explicit integration methods are used instead. However, these techniques need to limit the step size to ensure numerical stability. In a general case of a non-linear analogue system, where equations are stiff due to the large disparity of time constants, this limitation can be very severe and in such cases it is preferable to use implicit numerical methods. Thus, the use of one method or the other ultimately depends on the nature of the analogue circuit to be evaluated. Different works have proved that the use of state-space equations combined with explicit integration methods is a suitable technique to speed up transient simulations of linear analogue circuits whose equations are not stiff [13], or [12] which employs estimates of the maximum allowed step-size to obtain faster simulation of mixed-nature analogue systems. However, new techniques are required to speed up transient simulations of increasing complexity analogue circuits. In the last years, many efforts have been done to fit analogue integration methods into parallel computer architectures.

Alongside this, the Compute Unified Device Architecture (CUDA) [17] has provided design engineers with software tools to use Graphical Processing Units (GPUs), as relatively cheap and accessible parallel architecture computers to perform fast simulations to solve many types of scientific and engineering problems. In the case of analogue circuits, there are different proposals in the literature of methods developed to speed up simulations using GPUs, as shown in Table 1. [6, 19] describe methods to increase the simulation speed by using the GPUs to perform fast device model evaluation. Other works have focused on LU factorization matrix solvers [1, 7, 8, 15], achieving solid speedups compared with traditional parallel sparse solvers like PARDISO [20] or KLU [2]. In the case of [16], a sparse matrix solver is proposed to be used on GPUs and many-core processors in general, which has been tested on an Intel Xeon Phy architecture. Another sparse matrix solver, cuSolver [18] has been released by NVIDIA, although the LU factorization part is still performed on the CPU rather than on the GPU.

Table 1 Proposals of speed up simulations using GPUs

Full size table

All these works have focused on the traditional implicit integration methods used in SPICE-like simulators. Regarding explicit numerical methods, an explicit integration method for constant coefficients parallelizable over a many-core processor has been recently proposed [3]. This method combines space state equations with a variable-step explicit scheme based on the Adams–Bashforth integration formula to speed up the simulation of passive RC circuits. In this paper, this method is further explored. On the one hand, the fast estimate of the maximum allowed step size is applied to an RLC interconnect. This circuit represents a harder challenge on the step-size requirements than the RC interconnect considered in [3]. On the other hand, the technique is extended to operate with variable coefficients. This yields a system of nonlinear equations which must be subsequently linearized in order to be solved using the proposed integration technique. Moreover, this adds new couplings between the equations to be solved which increases the synchronization efforts of the parallel processes being executed on the GPU. The method is then applied to a CMOS image sensor circuit described in [4], using variable voltages instead of the simpler model described in that work. The paper is organized as follows. After this introduction, section 2 discusses the stability analysis for fixed and variable step Adams–Bashforth methods and describes the fast step-size estimation technique. Section 3 demonstrates the technique with the two mentioned examples of analog passive circuits. Finally, conclusions are discussed in Sect. 4.

2 Stability Analysis

The state equation of a nonlinear, passive dynamic system is a continuous function defined by (1):

$$\begin{aligned} \dot{x}(t)=f(x_t,t); x(0)=x_0 \end{aligned}$$

(1)

The linearized form of (1) at time point $t_k$, $k = 0, 1\ldots $ is given by:

$$\begin{aligned} \dot{x}(t_k)=A_kx(t_k)+Be_x(t_k) \end{aligned}$$

(2)

where X is the vector of N state variables, $e_x$ is a vector of excitations, $A_k$ is the Jacobian of the linearised model at the time point $t_k$ and B is a coefficient matrix. In this equation, explicit methods can be applied to provide a fast integration process, given that the state equation (1) describes a passive system and thus the eigenvalues of the Jacobian $A_k$ are expected to have negative real parts. However, explicit integration methods require that the step size is limited not only to control the accuracy of the numerical solution but, more important, to ensure the stability of the method itself. This maximum allowed step size is obtained from the computation of the spectral radius of |A|, a process for which time-consuming operations such as matrix multiplications and eigenvalue calculations are performed. This drawback can be overcome using the fast method to calculate approximate step-size bounds for stability [12]. The use of such approximate techniques yields step sizes which are smaller than the maximum allowed step-sizes obtained from the exact values of the Jacobian’s eigenvalues, but this is compensated by the speed of obtaining these estimates.

Figure 1 shows the well known stability plots of explicit fixed-step (h) Adams–Bashforth methods of different orders. These set of methods use the previously computed values at time steps $t_{k-n}$ to $t_{k-1}$ to interpolate the value at $t_k$ [9]. In the figure, the values of maximum $\lambda h$ which guarantee stability are plotted in the complex plane, being $\lambda $ the eigenvalues of the Jacobian matrix A. It shows that the maximum acceptable absolute value of $\lambda h =2$ is achieved for a first order method, while this value is decreased as the order increases, representing a balance between accuracy for higher order methods and stability for lower order ones.

2.1 Stability of Variable-Step Methods

Stability becomes harder to achieve when variable-step integration is used. Figure 2 shows a finite-difference grid for a q-order Adams–Bashforth method, being $t_k$ the current time point and $t_{k+1}$ the next time point, $x_k$ the value of x at time step $t_k$, $P_{q(t)}$ the interpolation polynomial of order q, and $\Delta x$ the unknown in the integration problem.

The time step between two consecutive time points $t_k$ and $t_{k+1}$ is defined as $h_k=t_{k+1}-t_k$. In a variable-step method, its values changes with time so, in order to obtain the general expressions for the variable-step method, the divided difference polynomial approximation between the current variable value $x_k \equiv x(t_k)$ and the predicted one $x_{k+1} \equiv x(tk+1)$ must be integrated as follows [9]:

$$\begin{aligned} \begin{aligned} I=&\int _{x_{k}}^{x_{k+1}}dx =\int _{t_k}^{t_{k+1}}P_q(t_{k+1})dt\Rightarrow \Delta x = x_{k+1}-x_{k} \\ =&\int _{t_k}^{t_{k+1}} \Big (f_k+(t-t_k)f_k^{(1)}+ ...+(t-t_{k-p})f_k^{(q)}\Big )dt + \zeta \end{aligned} \end{aligned}$$

(3)

where $f_k^{(q)}$ is the $q^th$ divided difference of function f at time point $t_k$ (and $f_k=f_{k0}$), and $\zeta $ is the truncation error. Taking as example the second order method and solving the integration yields:

$$\begin{aligned} x_{k+1}-x_{k}= f_k(t_{k+1}-t_{k})+\bigg [\frac{t^2}{2}-t_kt\bigg ]_{t_k}^{t_{k+1}} \frac{f_k-f_{k-1}}{t_k-t_{k-1}} \end{aligned}$$

(4)

and replacing $t_{k+1}-t_k$ by $h_0$ and $t_k-t_{k-1}$ by $h_1$ the following variable-step integration formula is obtained:

$$\begin{aligned} x_{k+1}-x_{k}=f_kh_0\left( 1+\frac{h_0}{2h_1}\right) -f_{k-1}h_0\left( \frac{h_0}{2h_1}\right) \end{aligned}$$

(5)

Similar equations can be obtained for different polynomial orders q. Figure 3 shows the stability plots of second order method (5) and third order method for different ratios between the integration step sizes $h_i$. The plots show how the variation of the ratio $r=h_i/h_{i+1}$ affects the stability for both the second and the third order Adams–Bashforth methods. The values of integration step $h_i$ decrease as the ratio is increased, being clearly smaller for the third order method. So, in order to be able to manage larger values of $h_i$, in this work the second order variable step integration method has been used.

2.2 Fast Approximation of Maximum Step Size

In [12] a fast numerical integration of state equations representing many passive systems is presented. Given a set of linear ordinary differential equations (ODEs):

$$\begin{aligned} \dot{X}(t_k)=A_kX(t_k) \end{aligned}$$

(6)

where A is negative definite and diagonally dominant, the integration method is numerically stable if the integration step size h is

$$\begin{aligned} h \le \frac{1}{\underset{i=1,...,N}{\max }(\beta _{max}|a_{r,r}|)} \end{aligned}$$

(7)

where $a_{r,r}$ the diagonal elements in row r of A and $\beta _{max} =max(|\beta _0|,\ldots ,|\beta _p|)$ the modulus of the maximum coefficient of the $p^{th}$ order Adams-Bashforth formula.

However, in the case the Jacobian A is not negative definite, the following estimation can be used instead [4]. The Frobenius norm of an m x m matrix $\Vert A \Vert _F$ defined as:

$$\begin{aligned} \Vert A \Vert _F =\sqrt{\sum _{i=1}^{m}\sum _{j=1}^{m}|a_{i,j}|^2} \end{aligned}$$

(8)

is such that $\Vert A \Vert _F \ge \Vert A \Vert $. Then the step size value which guarantees stability is bounded by:

$$\begin{aligned} h\le \frac{L_r}{\sqrt{\sum _{i=1}^{m}\sum _{j=1}^{m}|a_{i,j}|^2}} \end{aligned}$$

(9)

being $L_r$ the absolute value of the intersection of the stability plot in Fig. 3 with the negative real axis for a given r. In both cases, the proposed fast estimate of the step size h guarantees stability, but there is a trade off: step sizes obtained are expected to be smaller than the maximum allowed step sizes that would be obtained from the exact calculation of the Jacobian’s eigenvalues. However, the overall CPU simulation time is expected to be shorter.

2.3 Parallel Implementation of the Explicit Integration Method

According to the programming model described in CUDA, a GPU can be defined as a computing device with its own memory, capable to run many threads in parallel. Following GPUs terminology, a program running on a GPU is referred to as a kernel, whose instructions can be executed in parallel over different streaming multiprocessors. There are different levels of parallelism inside a GPU: the threads launched by a kernel are grouped into thread blocks. and inside every block, threads are grouped into warps, each one containing 32 threads. In a typical GPU, a thread block may contain up to 1024 threads. All threads in a same block can access a common shared memory, while threads from different blocks can access a global memory. This hierarchy of memories and levels of parallelism offer different possibilities to program a same algorithm, although some considerations must be made, in order to achieve the best performance of high-performance parallel algorithms [17]. It is preferable to make extensive use of the shared memory because it is faster that the global one, according to the possibilities of the algorithm. Moreover, each thread can access its own registers for local variables, which are on-chip and are the fastest among the memories in GPU. However, they are very limited in size. On the other hand, threads inside a same warp run following a single-instruction-multiple-threads (SIMT) pattern. In the case of divergences of instructions inside a warp, threads corresponding to different instructions are serially executed which decreases the overall efficiency. Finally, the interaction between the GPU and the CPU, required to launch the first one, requires a great portion of the overall computational resources and so, data movements between CPU and GPU should be minimized.

The advantage of explicit methods is that the linearized system of equations described by (2) can be computed in a parallel architecture at each time point $t_k$, since each state variable x can be worked out at each time point independently of the rest of the state variables.

Figure 4 shows the implementation of the integration algorithm proposed in this work on a GPU. The figure is a simplified schema of the device architecture. Threads are grouped into thread blocks, which also contain a shared memory, while all the blocks have access to a common global memory. In the figure, blocks 2 to M have the same architecture of that shown for block 1. The integration schema for each single state variable x is run on a single thread. This allows all the threads access to the same shared memory and achieve a higher efficiency. This distribution allows the processing of up to 1024 variables in (2) in a same thread block. Larger matrices require additional thread blocks, which decreases the memory bandwidth as the slower global memory needs to be accessed more frequently. The figure shows the interaction between the CPU and the GPU. The CPU executes a loop until the time point $t_k$ reaches the final simulation time. There are three calls to the GPU parallel execution inside each loop step. The first one is executed to compute the values of Jacobian and coefficient matrices (2). The second one is launched to compute the incremental values of the state variables, and to compare these values with the predefined simulation allowed tolerances. Finally, the third GPU call is executed to check if the simulation tolerances have been violated and to update the values both of the state variables and the time steps h. These last two consecutive calls to both parallel executions provide a synchronization barrier mechanism before evaluating if there is a violation of tolerance in any of the GPU threads.

In order to compare the variable incremental values at each time step ($\Delta x_i$) with the allowed tolerance, a step control routine is run inside the second parallel GPU execution call. Three values of $\Delta x_i$ are computed: $\Delta x_{i,1}$ for the current step size, $\Delta x_{i,0}$ for a lower step size, and $\Delta x_{i,1}$ for a larger one. The three values of $\Delta x_i$ obtained are then compared with predefined tolerance values to determine which increment must be added to state variables value at that step time. The results of the comparisons are annotated in a control register. In the third GPU call, if the value of $\Delta x_{i,1}$ is within the limits of the specified tolerance, the three step values are increased by a ratio r while the state variable $x_i$ is updated an amount equal to $\Delta x_{i,1}$ in every GPU parallel process. However, if the value of $\Delta x_{i,0}$ is larger that the specified tolerance, a lower step h is required. Thus, the value of $\Delta x_{i,2}$ is assessed to determine the rate of decrease of the integration step, r for a slow decrease or $r^2$ for a faster one. If the value of $\Delta x_{i,2}$ is lower than the tolerance, then it can be used to update the state variables. Otherwise, the values of of $\Delta x_{i}$ computed in the current iteration step cannot be used. This provides a fast mechanism to adapt the rate of decrease of h to the rate of change of the system variables and to reduce the number of parallel executions.

3 Examples

3.1 Interconnect with Inductance

In this first example, an interconnect modelled as a series of finite RLC segments is shown in Fig. 6. Both the voltage across the capacitors and the currents through inductors are state variables and thus, the total number of state variables is twice the number needed for a RC interconnect model. The matrix formulation of the RLC transmission line is given by (10):

$$\begin{aligned} \frac{d}{dt} \begin{bmatrix} i_1 \\ v_1 \\ i_2 \\ v_2 \\ \vdots \\ v_n \\ \end{bmatrix} = \begin{bmatrix} \frac{-R_1}{L_1} &{} \frac{-1}{L_1} &{} 0 &{} 0 &{} \cdots &{} 0 \\ \frac{1}{C_1} &{} \frac{-1}{C_1G_1} &{} \frac{-1}{C_2} &{} 0 &{} \cdots &{} 0 \\ 0 &{} \frac{1}{L_2} &{} \frac{-R_2}{L_2} &{} \frac{-1}{L_2} &{} \cdots &{} 0 \\ \vdots &{} &{} &{} &{} &{}\vdots \\ 0 &{} 0 &{} 0 &{} 0 &{} \cdots &{} \frac{-1}{C_nG_n} \\ \end{bmatrix} \begin{bmatrix} i_1 \\ v_1 \\ i_2 \\ v_2 \\ \vdots \\ v_n \\ \end{bmatrix} + \begin{bmatrix} \frac{1}{L_1} \\ \vdots \\ 0 \\ \end{bmatrix} v_i \end{aligned}$$

(10)

The equation has been obtained through nodal analysis and manual transformation, although the method detailed in [11] may be useful for more complex circuits. Simulation for RLC interconnects of different lengths have been performed in order to compare the proposed method with CUSPICE [14], a parallelized version of SPICE for GPUs. The following component values per discrete section have been used: C = 1 fF, L = 100 pH, $R = 10\,\Omega $, $G = 400\,\Omega ^{-1}$, being the excitation a 1V step. The GPU used has been a general purpose NVIDIA GeForce GTX 1080 GPU, 3584 core, 1531 MHz and 11 GB of RAM.

The step-size in the explicit integration was $10^{-13}$ sec and CUSPICE simulations were performed using two different step size limits: $10^{-10}$ sec and $10^{-11}$ sec. Figure 7 shows the transient step response of the RLC interconnect line 100th node, when a 1V signal is applied to its input. The red plot shows the result obtained using the proposed method while the blue one displays the results obtained from CUSPICE. As shown in the figure, the spurious numerical ringing is notorious in the CUSPICE results while it is negligible, after the initial transient period, in the method proposed in this work which very fast converges to the expected 1V final value. Table 2 shows the CPU times for both methods, the proposed method with the step size of $10^{-13}$ sec and CUSPICE with the step size of $10^{-11}$ sec, given that this latter one offers less ringing. CUSPICE, despite using a step size larger by two orders of magnitude than that of our method, is significantly slower with the proposed method reaching a speed up of almost an order of magnitude for 10,000 RLC segments.

Table 2 GPU simulation time for proposed method and CUSPICE

Full size table

3.2 MOS vision chip

The second example focuses on a MOS-C network aimed to perform Gaussian filtering on a CMOS image sensor [5]. Such sensor is composed by an array of $m \times n$ pixels, as shown in Fig. 8.a, where each pixel typically contains a photodiode and a reduced number of transistors. A simplified model of an $m \times n$ CMOS image sensor is shown in Fig. 8.b.

In the figure, each capacitor models the storage of the light intensity value captured by a single pixel in a real circuit, while the MOS transistors replace the resistors of a conventional RC network. The MOS devices work like a voltage controlled resistors to perform an analog time-controlled Gaussian filtering on the image captured by the image sensor. So, when the image is being captured by a sensor circuitry, the gate voltage $V_G$ is null, keeping the devices in the cutoff region. Once the capture is done, a given $V_G$ is applied to the MOS gates to allow them to operate in the triode region. The variation of the voltage along time is a Gaussian function with $\sigma = (2t_{ON}/RC)0.5$ [10], being $t_{ON}$ the time during which the MOS devices are driving current. The current flowing into a pixel (i, j) is given by the contribution of the currents of the 4-neighbour devices following:

$$\begin{aligned} \begin{aligned} C\frac{dV_{i,j}}{dt}=\frac{V_{i,j+1}-V_{i,j}}{R_{i,j+1}}+\frac{V_{i,j-1}-V_{i,j}}{R_{i,j-1}} +\frac{V_{i+1,j}-V_{i,j}}{R_{i+1,j}}+\frac{V_{i-1,j}-V_{i,j}}{R_{i-1,j}} \end{aligned} \end{aligned}$$

(11)

being $V_{i,j} \equiv V_{i,j+1}$, the state variable in this example and $R_{i,j+1}\equiv R_{i,j+1}(v)$ the voltage-dependent MOS channel resistance between nodes (i,j) and (i,j + 1), given by:

$$\begin{aligned} R_{i,j+1}= \frac{L}{KW\Big (V_G-V_{th}-0.5(V_{i,j}+V_{i,j+1} )\Big )} \end{aligned}$$

(12)

A simplification of this problem is solved in [4] using a simplified model which allows the use of initial values in (12) for $V_{i,j}$ and $V_{i,j+1}$, and so, transforming $R_{i,j+1}$ into a constant value, which also allows the use of a fixed integration-step. In this work, a more complex problem is addressed, where variable values for $R_{i,j+1}$ are used, which leads to a Jacobian matrix with variable coefficients as well as a variable integration step needed to adapt the simulation speed to the voltage values at each time point. These characteristics require additional computations for the step-size estimations. Thus, replacing (12) in (11), and defining $V_{GT}=V_G-V_{th}$ and $K_R=L/KW$, equation (11) can be rewritten as:

$$\begin{aligned} \begin{aligned} K_R C \frac{dV_{i,j}}{dt} =&-\Big [4V_{GT} V_{i,j}-V_{GT} (V_{i,j+1} +V_{i,j-1} +V_{i+1,j} +V_{i-1,j})\\ \quad&-2V_{i,j}^2 +0.5(V_{i,j+1}^2 + V_{i,j-1}^2 +V_{i+1,j}^2 +V_{i-1,j}^2) \Big ] \end{aligned} \end{aligned}$$

(13)

where the derivative of voltage $V_{i,j+1}$ with respect to time is a function of linear and quadratic terms of the own and the adjacent voltage nodes. The quadratic terms are then replaced by their linearised form at points $V_{u,v}(0)$ using a second order Taylor approximation for $u=i-1,i,i+1$ and $v=j-1,j,j+1$ to obtain:

$$\begin{aligned} \begin{aligned} K_R C \frac{dV_{i,j}}{dt}=&-4\big (V_{GT}-V_{i,j} (0)\big ) V_{i,j}\\ \quad&+\big (V_{GT}-V_{i,j+1} (0)\big ) V_{i,j+1}+\big (V_{GT}-V_{i,j-1} (0)\big ) V_{i,j-1}\\ \quad&+\big (V_{GT}-V_{i+1,j} (0)\big ) V_{i+1,j}+\big (V_{GT}-V_{i-1,j} (0)\big ) V_{i-1,j}\\ \quad&-\!2V_{i,j} (0)^2\!+\!0.5\big (V_{i,j+1} (0)^2\!+\!V_{i,j-1} (0)^2\!+\!V_{i+1,j} (0)^2\!+\!V_{i-1,j} (0)^2 ) \big ) \end{aligned} \end{aligned}$$

(14)

The matrix formulation of (14) applied to the whole $m \times n$ image sensor is:

$$\begin{aligned} \frac{d}{dt} \begin{bmatrix} V_{1,1} \\ V_{1,3} \\ \vdots \\ V_{\begin{array}{c} m,n \end{array}} \\ \end{bmatrix} = \begin{bmatrix} A_{1,1} &{} A_{1,2} &{} 0 &{} \cdots &{} 0\\ A_{2,1} &{} A_{2,2} &{} A_{2,3} &{} \cdots &{} 0\\ 0 &{} A_{3,2} &{} A_{3,3} &{} \cdots &{} 0\\ \vdots &{} &{} &{} &{} \vdots \\ 0 &{} 0 &{} 0 &{} \cdots &{} A_{\begin{array}{c} m,m \end{array}}\\ \end{bmatrix} \cdot \begin{bmatrix} V_{1,1} \\ V_{1,2} \\ \vdots \\ V_{\begin{array}{c} m,n \end{array}} \\ \end{bmatrix} +B \end{aligned}$$

(15)

where $A_{i,j}$ are $n \times n$ submatrices and B is a column vector of size $m \times n$ including the independent terms. The set of submatrices $A_{r,r}$ $(r=2,\ldots ,m-1)$ which belong to the diagonal of A are defined as:

$$\begin{aligned} A_{r,r}= \begin{bmatrix} -3\big (V_{GT}-V_{r,1}(0)\big ) &{} 0 &{} \cdots &{} 0\\ V_{GT}-V_{r,1}(0) &{} -4\big (V_{GT}-V_{r,2}(0)\big ) &{} \cdots &{} 0\\ 0 &{} V_{GT}-V_{r,2}(0) &{} \cdots &{} 0\\ \vdots &{} &{} &{} \vdots \\ 0 &{} 0 &{} \cdots &{} V_{GT}-V_{r,n-1}(0)\\ 0 &{} 0 &{} \cdots &{} -3\big (V_{GT}-V_{r,n}(0)\big )\\ \end{bmatrix} \end{aligned}$$

(16)

Submatrices $A_{r,r}$ $(r=1,m)$ refer to pixels placed in the first and the last row of the image sensor. In such case the matrices are similar to the previous one, but replacing in the main diagonal constants - 3 and - 4 by - 2 and - 3 respectively. The set of submatrices $A_{r,r\pm 1}$ are diagonal matrices defined as:

$$\begin{aligned} A_{r,r\pm 1}= \begin{bmatrix} V_{GT}-V_{r\pm 1,1}(0) &{} 0 &{} \cdots &{} 0 \\ 0 &{} V_{GT}-V_{r\pm 1,2}(0) &{} \cdots &{} 0 \\ \vdots &{} &{} &{} \vdots \\ 0 &{} 0 &{} \cdots &{} V_{GT}-V_{r\pm 1,n}(0)\\ \end{bmatrix} \end{aligned}$$

(17)

Finally the column vector B, which includes the current linearised values $V_{u,v}(0)$ is obtained as:

$$\begin{aligned} B=\begin{bmatrix} V_{1,1}(0)^2+0.5\big (V_{1,2}(0)^2+V_{2,1}(0)^2\big ) \\ 1.5V_{1,2}(0)^2+0.5\big (V_{1,1}(0)^2+V_{1,3}(0)^2+V_{2,2}(0)^2\big ) \\ 1.5V_{1,3}(0)^2+0.5\big (V_{1,2}(0)^2+V_{1,4}(0)^2+V_{2,3}(0)^2\big ) \\ \vdots \\ V_{m,n}(0)^2+0.5\big (V_{m-1,n}(0)^2+V_{m,n-1}(0)^2\big ) \\ \end{bmatrix} \end{aligned}$$

(18)

The simulation technique proposed in this work has been applied to simulate the system described in equations (15) to (18) using the same GPU than in the previous RLC example. The image sizes used in this example are those used in small CMOS smart image sensors, whose dimensions are restricted to include additional electronics close to the pixels. The simulations have been done modelling the capacity of the CMOS sensor pixels as $C=10$ pF and using the following device parameters: $K=50$ $\mu A/V^2$, $W/L=1$, $V_{th}=0.5$ V and $V_G=3$ V. Processor times required for each transient simulation is detailed in Table 3, where they are compared with these required by Spectre and CUSPICE. It can be noted how the throughput of the three methods varies in a different way when the complexity of the problem is increased. For the explicit method, when the image size increases from 8 $\times $ 8 to an image approximately 300 times larger, the processing time is increased 2.27 times. However, for the implicit method this time is increased 128 times using Spectre simulator and 104 times using CUSPICE. Thus, although for small images, CUSPICE and Spectre are faster, for images larger than 64 $\times $ 64 pixels the proposed explicit method offers smaller simulation times and the speedup is larger for bigger image sizes.

As an example of the image filtering simulation method described above, Fig. 9 shows the result of when a MOS conduction time of 25 ns is applied to an image extracted for the CIFAR-10 image dataset. The intensity value of each pixel is coded between 0 and 1 V. Compared to the simulation using CUSPICE, the explicit one presents an absolute root mean square error (RMSE) of $1.96\cdot 10^{-3}$ units.

Table 3 GPU simulation time for proposed method and CUSPICE

Full size table

4 Conclusion

This paper has presented a numeric method based on a variable step explicit integration technique useful to speed up transient simulations of complex analogue circuits on GPUs. The method has shown promising results when solving large VLSI interconnect models, whose inclusion in digital VLSI simulations is increasingly important as clock frequencies reach 10 GHz and more. Given that interconnect may constitute a large part of an analogue or a mixed-signal VLSI system, this method can be a useful approach in the development of modern VLSI design tools. The method has also been used to simulate a CMOS-C network aimed to perform Gaussian filtering on small and medium size images, a common operation in computer vision applications. Given the strict requirements in terms of speed and power dissipation of embedded vision systems, mixed signal smart vision chips combining processing in both analog and digital domains are expected to increase their market share. Thus, the proposed method can contribute to shorten the time to market of such systems. The results of these examples illustrate the potential of parallelized explicit integration methods on to solve of vast numbers of equations representing passive analogue circuits.

Data Availability

No external data sets were analysed during the current study. All the data used to plot the graphs and tables were obtained by direct application of equations listed in the paper.

References

X. Chen, L. Ren, Y. Wang, H. Yang, GPU-accelerated sparse LU factorization for circuit simulation with performance modeling. IEEE Trans. Parallel Distrib. Syst. 26, 786–795 (2015). https://doi.org/10.1109/TPDS.2014.2312199
Article Google Scholar
T.A. Davis, E.P. Natarajan, Algorithm 907: KLU, a direct sparse solver for circuit simulation problems. ACM Trans. Math. Softw. 37, 1–17 (2010). https://doi.org/10.1145/1824801.1824814
Article Google Scholar
G. Domenech-Asensi, T.J. Kazmierski, Stability and efficiency of explicit integration in interconnect analysis on GPUs. in 2020 International Symposium on Circuits and Systems, Seville, Spain, pp. 1-5 (2020) https://doi.org/10.1109/ISCAS45731.2020.9181157
G. Domenech-Asensi, T.J. Kazmierski, High-speed analog simulation of CMOS vision chips using explicit integration techniques on many-core processors, in 2020 Design Automation and Test in Europe Conference, Grenoble, France, 646-649 (2020) https://doi.org/10.23919/DATE48585.2020.9116270.
J. Fernandez-Berni, R. Carmona-Galan, All-MOS implementation of RC networks for time-controlled Gaussian spatial filtering. Int. J. Circuit Theory Appl. 40, 859–76 (2012). https://doi.org/10.1002/cta.759
Article Google Scholar
K. Gulati, J.F. Croix, S.P. Khatri, R. Shastry, Fast circuit simulation on graphics processing units”, in 2009 Asia and South Pacific Design Automation Conference. Yokohama, Japan 403–408 (2009). https://doi.org/10.1109/ASPDAC.2009.4796514
L. Han, Z. Feng, TinySPICE plus: scaling up statistical SPICE simulations on GPU leveraging shared-memory based sparse matrix solution techniques. In: 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, USA, pp. 1–6 (2016). https://doi.org/10.1145/2966986.2967081
K. He, S.X. Tan, H. Wang, G. Shi, GPU-accelerated parallel sparse LU factorization method for fast circuit analysis. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24, 1140–1150 (2016). https://doi.org/10.1109/TVLSI.2015.2421287
Article Google Scholar
J.D. Hoffman, Numerical Methods for Engineers and Sciencists (CRC Press, Boca Raton, 2001)
Google Scholar
B. Jahne, Multiresolutional signal representation, in Handbook of Computer Vision and Applications. ed. by B. Jähne, H. Haußecker, P. Geißler (Academic Press, Cambridge, 1999), pp.67–92
Google Scholar
Y. Kang, J.G. Lacy, Conversion of MNA equations to state variable form for nonlinear dynamical circuits. Electron. Lett. 28, 1240–41 (1992). https://doi.org/10.1049/el:19920783
Article Google Scholar
T.J. Kazmierski, L. Wang, B.M. Al-Hashimi, G.V. Merrett, An explicit linearized state-space technique for accelerated simulation of electromagnetic vibration energy harvesters. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 31, 522–531 (2012). https://doi.org/10.1109/TCAD.2011.2176124
Article Google Scholar
K.C.A. Lam, M. Zwolinski, Circuit simulation using state space equations. in 9th Conference on Ph.D. Research in Microelectronics and Electronics (PRIME), Villach, Austria, pp. 177–180 (2013). https://doi.org/10.1109/PRIME.2013.6603135
F. Lannutti, F. Menichelli, M. Olivieri, CUSPICE The revolutionary NGSPICE on CUDA Platforms. in 12th MOS-AK Workshop at the ESSDERC/ESSCIRC Conference, Venice, Italy, (2014)
W. Lee, R. Achar, M.S. Nakhla, Dynamic GPU parallel sparse LU factorization for fast circuit simulation. IEEE Trans. Very Large Scale Integr. Syst. 26, 2518–2529 (2018). https://doi.org/10.1109/TVLSI.2018.2858014
Article Google Scholar
Y. Liang, W.T. Tang, R. Zhao, M. Lu, H.P. Huynh, R.S.M. Goh, Scale-free sparse matrix-vector multiplication on many-core architectures. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 36, 2106–2119 (2017). https://doi.org/10.1109/TCAD.2017.2681072
Article Google Scholar
NVIDIA Corporation: CUDA C Programming Guide Version 7.0. http://docs.nvidia.com/cuda/cudacprogramming-guide/ (2018). Accessed 21 Nov 2018
NVIDIA Corporation: Cusolver Library, document DU-06709-001 v9.0. NVIDIA Corporation. (2017)
R.E. Poore, GPU-accelerated time-domain circuit simulation, in 2009 IEEE Custom Integrated Circuits Conference, San Jose, CA, USA (2009), pp.629–632. https://doi.org/10.1109/CICC.2009.5280743
O. Schenk, K. Gartner, Solving unsymmetric sparse systems of linear equations with PARDISO. Futur. Gener. Comput. Syst. 20, 475–87 (2004). https://doi.org/10.1016/j.future.2003.07.011
Article Google Scholar

Download references

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was partially supported by the Spanish Ministry of Science, Innovation and Universities under grant RTI2018-097088-B-C33 and PID2021-128009OB-C33 funded by CIN/AEI/10.13039/501100011033/FEDER, UE., Ministerio de Educacion, Cultura y Deporte, Gobierno de España under grant PRX18/00565, Fundación Séneca de la Región de Murcia under grant 21187/EE/19 and Engineering and Physical Sciences Research Council under grant EP/N0317681/1.

Author information

Authors and Affiliations

Universidad Politécnica de Cartagena, Plaza del Hospital, 1, 30202, Cartagena, Spain
Ginés Doménech-Asensi
University of Southampton, University Road, Southampton, SO17 1BJ, UK
Tom J. Kazmierski

Authors

Ginés Doménech-Asensi
View author publications
You can also search for this author in PubMed Google Scholar
Tom J. Kazmierski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ginés Doménech-Asensi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Doménech-Asensi, G., Kazmierski, T.J. Accelerated Simulation of Passive Analog Circuits Over GPU Using Explicit Integration Methods. Circuits Syst Signal Process 43, 6115–6131 (2024). https://doi.org/10.1007/s00034-024-02780-5

Download citation

Received: 10 December 2023
Revised: 21 June 2024
Accepted: 22 June 2024
Published: 16 July 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s00034-024-02780-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Accelerated Simulation of Passive Analog Circuits Over GPU Using Explicit Integration Methods

Abstract

Similar content being viewed by others

A holistic fast and parallel approach for accurate transient simulations of analog circuits