1 Introduction

The arise of the SPICE simulator half a century ago, meant a key advance in the design cycle of analogue circuits. Since then, different versions of this simulator have been developed in order to perform the analysis of analogue circuits behavior using techniques based on the modified nodal analysis. This technique relies on implicit integration techniques based on the Newton–Raphson method to solve the circuit analogue equations at each time step. However, although such methods are numerically stable and reliable, they are computationally expensive and require long CPU times when large circuits are simulated. The reason is that the Jacobian matrix of the analogue circuit must be built and factorized multiple times at each time step. The computational workload can be reduced if explicit integration methods are used instead. However, these techniques need to limit the step size to ensure numerical stability. In a general case of a non-linear analogue system, where equations are stiff due to the large disparity of time constants, this limitation can be very severe and in such cases it is preferable to use implicit numerical methods. Thus, the use of one method or the other ultimately depends on the nature of the analogue circuit to be evaluated. Different works have proved that the use of state-space equations combined with explicit integration methods is a suitable technique to speed up transient simulations of linear analogue circuits whose equations are not stiff [13], or [12] which employs estimates of the maximum allowed step-size to obtain faster simulation of mixed-nature analogue systems. However, new techniques are required to speed up transient simulations of increasing complexity analogue circuits. In the last years, many efforts have been done to fit analogue integration methods into parallel computer architectures.

Alongside this, the Compute Unified Device Architecture (CUDA) [17] has provided design engineers with software tools to use Graphical Processing Units (GPUs), as relatively cheap and accessible parallel architecture computers to perform fast simulations to solve many types of scientific and engineering problems. In the case of analogue circuits, there are different proposals in the literature of methods developed to speed up simulations using GPUs, as shown in Table 1. [6, 19] describe methods to increase the simulation speed by using the GPUs to perform fast device model evaluation. Other works have focused on LU factorization matrix solvers [1, 7, 8, 15], achieving solid speedups compared with traditional parallel sparse solvers like PARDISO [20] or KLU [2]. In the case of [16], a sparse matrix solver is proposed to be used on GPUs and many-core processors in general, which has been tested on an Intel Xeon Phy architecture. Another sparse matrix solver, cuSolver [18] has been released by NVIDIA, although the LU factorization part is still performed on the CPU rather than on the GPU.

Table 1 Proposals of speed up simulations using GPUs

All these works have focused on the traditional implicit integration methods used in SPICE-like simulators. Regarding explicit numerical methods, an explicit integration method for constant coefficients parallelizable over a many-core processor has been recently proposed [3]. This method combines space state equations with a variable-step explicit scheme based on the Adams–Bashforth integration formula to speed up the simulation of passive RC circuits. In this paper, this method is further explored. On the one hand, the fast estimate of the maximum allowed step size is applied to an RLC interconnect. This circuit represents a harder challenge on the step-size requirements than the RC interconnect considered in [3]. On the other hand, the technique is extended to operate with variable coefficients. This yields a system of nonlinear equations which must be subsequently linearized in order to be solved using the proposed integration technique. Moreover, this adds new couplings between the equations to be solved which increases the synchronization efforts of the parallel processes being executed on the GPU. The method is then applied to a CMOS image sensor circuit described in [4], using variable voltages instead of the simpler model described in that work. The paper is organized as follows. After this introduction, section 2 discusses the stability analysis for fixed and variable step Adams–Bashforth methods and describes the fast step-size estimation technique. Section 3 demonstrates the technique with the two mentioned examples of analog passive circuits. Finally, conclusions are discussed in Sect. 4.

2 Stability Analysis

The state equation of a nonlinear, passive dynamic system is a continuous function defined by (1):

$$\begin{aligned} \dot{x}(t)=f(x_t,t); x(0)=x_0 \end{aligned}$$
(1)

The linearized form of (1) at time point \(t_k\), \(k = 0, 1\ldots \) is given by:

$$\begin{aligned} \dot{x}(t_k)=A_kx(t_k)+Be_x(t_k) \end{aligned}$$
(2)

where X is the vector of N state variables, \(e_x\) is a vector of excitations, \(A_k\) is the Jacobian of the linearised model at the time point \(t_k\) and B is a coefficient matrix. In this equation, explicit methods can be applied to provide a fast integration process, given that the state equation (1) describes a passive system and thus the eigenvalues of the Jacobian \(A_k\) are expected to have negative real parts. However, explicit integration methods require that the step size is limited not only to control the accuracy of the numerical solution but, more important, to ensure the stability of the method itself. This maximum allowed step size is obtained from the computation of the spectral radius of |A|, a process for which time-consuming operations such as matrix multiplications and eigenvalue calculations are performed. This drawback can be overcome using the fast method to calculate approximate step-size bounds for stability [12]. The use of such approximate techniques yields step sizes which are smaller than the maximum allowed step-sizes obtained from the exact values of the Jacobian’s eigenvalues, but this is compensated by the speed of obtaining these estimates.

Figure 1 shows the well known stability plots of explicit fixed-step (h) Adams–Bashforth methods of different orders. These set of methods use the previously computed values at time steps \(t_{k-n}\) to \(t_{k-1}\) to interpolate the value at \(t_k\) [9]. In the figure, the values of maximum \(\lambda h\) which guarantee stability are plotted in the complex plane, being \(\lambda \) the eigenvalues of the Jacobian matrix A. It shows that the maximum acceptable absolute value of \(\lambda h =2\) is achieved for a first order method, while this value is decreased as the order increases, representing a balance between accuracy for higher order methods and stability for lower order ones.

Fig. 1
figure 1

Stability of fixed-step Adams–Bashforth methods

2.1 Stability of Variable-Step Methods

Stability becomes harder to achieve when variable-step integration is used. Figure 2 shows a finite-difference grid for a q-order Adams–Bashforth method, being \(t_k\) the current time point and \(t_{k+1}\) the next time point, \(x_k\) the value of x at time step \(t_k\), \(P_{q(t)}\) the interpolation polynomial of order q, and \(\Delta x\) the unknown in the integration problem.

Fig. 2
figure 2

Finite difference grid for the qth-order Adams–Bashforth method

The time step between two consecutive time points \(t_k\) and \(t_{k+1}\) is defined as \(h_k=t_{k+1}-t_k\). In a variable-step method, its values changes with time so, in order to obtain the general expressions for the variable-step method, the divided difference polynomial approximation between the current variable value \(x_k \equiv x(t_k)\) and the predicted one \(x_{k+1} \equiv x(tk+1)\) must be integrated as follows [9]:

$$\begin{aligned} \begin{aligned} I=&\int _{x_{k}}^{x_{k+1}}dx =\int _{t_k}^{t_{k+1}}P_q(t_{k+1})dt\Rightarrow \Delta x = x_{k+1}-x_{k} \\ =&\int _{t_k}^{t_{k+1}} \Big (f_k+(t-t_k)f_k^{(1)}+ ...+(t-t_{k-p})f_k^{(q)}\Big )dt + \zeta \end{aligned} \end{aligned}$$
(3)

where \(f_k^{(q)}\) is the \(q^th\) divided difference of function f at time point \(t_k\) (and \(f_k=f_{k0}\)), and \(\zeta \) is the truncation error. Taking as example the second order method and solving the integration yields:

$$\begin{aligned} x_{k+1}-x_{k}= f_k(t_{k+1}-t_{k})+\bigg [\frac{t^2}{2}-t_kt\bigg ]_{t_k}^{t_{k+1}} \frac{f_k-f_{k-1}}{t_k-t_{k-1}} \end{aligned}$$
(4)

and replacing \(t_{k+1}-t_k\) by \(h_0\) and \(t_k-t_{k-1}\) by \(h_1\) the following variable-step integration formula is obtained:

$$\begin{aligned} x_{k+1}-x_{k}=f_kh_0\left( 1+\frac{h_0}{2h_1}\right) -f_{k-1}h_0\left( \frac{h_0}{2h_1}\right) \end{aligned}$$
(5)

Similar equations can be obtained for different polynomial orders q. Figure 3 shows the stability plots of second order method (5) and third order method for different ratios between the integration step sizes \(h_i\). The plots show how the variation of the ratio \(r=h_i/h_{i+1}\) affects the stability for both the second and the third order Adams–Bashforth methods. The values of integration step \(h_i\) decrease as the ratio is increased, being clearly smaller for the third order method. So, in order to be able to manage larger values of \(h_i\), in this work the second order variable step integration method has been used.

Fig. 3
figure 3

Stability regions for the second and third order AB methods

2.2 Fast Approximation of Maximum Step Size

In [12] a fast numerical integration of state equations representing many passive systems is presented. Given a set of linear ordinary differential equations (ODEs):

$$\begin{aligned} \dot{X}(t_k)=A_kX(t_k) \end{aligned}$$
(6)

where A is negative definite and diagonally dominant, the integration method is numerically stable if the integration step size h is

$$\begin{aligned} h \le \frac{1}{\underset{i=1,...,N}{\max }(\beta _{max}|a_{r,r}|)} \end{aligned}$$
(7)

where \(a_{r,r}\) the diagonal elements in row r of A and \(\beta _{max} =max(|\beta _0|,\ldots ,|\beta _p|)\) the modulus of the maximum coefficient of the \(p^{th}\) order Adams-Bashforth formula.

However, in the case the Jacobian A is not negative definite, the following estimation can be used instead [4]. The Frobenius norm of an m x m matrix \(\Vert A \Vert _F\) defined as:

$$\begin{aligned} \Vert A \Vert _F =\sqrt{\sum _{i=1}^{m}\sum _{j=1}^{m}|a_{i,j}|^2} \end{aligned}$$
(8)

is such that \(\Vert A \Vert _F \ge \Vert A \Vert \). Then the step size value which guarantees stability is bounded by:

$$\begin{aligned} h\le \frac{L_r}{\sqrt{\sum _{i=1}^{m}\sum _{j=1}^{m}|a_{i,j}|^2}} \end{aligned}$$
(9)

being \(L_r\) the absolute value of the intersection of the stability plot in Fig. 3 with the negative real axis for a given r. In both cases, the proposed fast estimate of the step size h guarantees stability, but there is a trade off: step sizes obtained are expected to be smaller than the maximum allowed step sizes that would be obtained from the exact calculation of the Jacobian’s eigenvalues. However, the overall CPU simulation time is expected to be shorter.

2.3 Parallel Implementation of the Explicit Integration Method

According to the programming model described in CUDA, a GPU can be defined as a computing device with its own memory, capable to run many threads in parallel. Following GPUs terminology, a program running on a GPU is referred to as a kernel, whose instructions can be executed in parallel over different streaming multiprocessors. There are different levels of parallelism inside a GPU: the threads launched by a kernel are grouped into thread blocks. and inside every block, threads are grouped into warps, each one containing 32 threads. In a typical GPU, a thread block may contain up to 1024 threads. All threads in a same block can access a common shared memory, while threads from different blocks can access a global memory. This hierarchy of memories and levels of parallelism offer different possibilities to program a same algorithm, although some considerations must be made, in order to achieve the best performance of high-performance parallel algorithms [17]. It is preferable to make extensive use of the shared memory because it is faster that the global one, according to the possibilities of the algorithm. Moreover, each thread can access its own registers for local variables, which are on-chip and are the fastest among the memories in GPU. However, they are very limited in size. On the other hand, threads inside a same warp run following a single-instruction-multiple-threads (SIMT) pattern. In the case of divergences of instructions inside a warp, threads corresponding to different instructions are serially executed which decreases the overall efficiency. Finally, the interaction between the GPU and the CPU, required to launch the first one, requires a great portion of the overall computational resources and so, data movements between CPU and GPU should be minimized.

The advantage of explicit methods is that the linearized system of equations described by (2) can be computed in a parallel architecture at each time point \(t_k\), since each state variable x can be worked out at each time point independently of the rest of the state variables.

Fig. 4
figure 4

Distribution over multiple threads of multiply and accumulation operations to compute a state variable at a given time

Figure 4 shows the implementation of the integration algorithm proposed in this work on a GPU. The figure is a simplified schema of the device architecture. Threads are grouped into thread blocks, which also contain a shared memory, while all the blocks have access to a common global memory. In the figure, blocks 2 to M have the same architecture of that shown for block 1. The integration schema for each single state variable x is run on a single thread. This allows all the threads access to the same shared memory and achieve a higher efficiency. This distribution allows the processing of up to 1024 variables in (2) in a same thread block. Larger matrices require additional thread blocks, which decreases the memory bandwidth as the slower global memory needs to be accessed more frequently. The figure shows the interaction between the CPU and the GPU. The CPU executes a loop until the time point \(t_k\) reaches the final simulation time. There are three calls to the GPU parallel execution inside each loop step. The first one is executed to compute the values of Jacobian and coefficient matrices (2). The second one is launched to compute the incremental values of the state variables, and to compare these values with the predefined simulation allowed tolerances. Finally, the third GPU call is executed to check if the simulation tolerances have been violated and to update the values both of the state variables and the time steps h. These last two consecutive calls to both parallel executions provide a synchronization barrier mechanism before evaluating if there is a violation of tolerance in any of the GPU threads.

Fig. 5
figure 5

Variable step parallelised integration method

In order to compare the variable incremental values at each time step (\(\Delta x_i\)) with the allowed tolerance, a step control routine is run inside the second parallel GPU execution call. Three values of \(\Delta x_i\) are computed: \(\Delta x_{i,1}\) for the current step size, \(\Delta x_{i,0}\) for a lower step size, and \(\Delta x_{i,1}\) for a larger one. The three values of \(\Delta x_i\) obtained are then compared with predefined tolerance values to determine which increment must be added to state variables value at that step time. The results of the comparisons are annotated in a control register. In the third GPU call, if the value of \(\Delta x_{i,1}\) is within the limits of the specified tolerance, the three step values are increased by a ratio r while the state variable \(x_i\) is updated an amount equal to \(\Delta x_{i,1}\) in every GPU parallel process. However, if the value of \(\Delta x_{i,0}\) is larger that the specified tolerance, a lower step h is required. Thus, the value of \(\Delta x_{i,2}\) is assessed to determine the rate of decrease of the integration step, r for a slow decrease or \(r^2\) for a faster one. If the value of \(\Delta x_{i,2}\) is lower than the tolerance, then it can be used to update the state variables. Otherwise, the values of of \(\Delta x_{i}\) computed in the current iteration step cannot be used. This provides a fast mechanism to adapt the rate of decrease of h to the rate of change of the system variables and to reduce the number of parallel executions.

3 Examples

3.1 Interconnect with Inductance

In this first example, an interconnect modelled as a series of finite RLC segments is shown in Fig. 6. Both the voltage across the capacitors and the currents through inductors are state variables and thus, the total number of state variables is twice the number needed for a RC interconnect model. The matrix formulation of the RLC transmission line is given by (10):

$$\begin{aligned} \frac{d}{dt} \begin{bmatrix} i_1 \\ v_1 \\ i_2 \\ v_2 \\ \vdots \\ v_n \\ \end{bmatrix} = \begin{bmatrix} \frac{-R_1}{L_1} &{} \frac{-1}{L_1} &{} 0 &{} 0 &{} \cdots &{} 0 \\ \frac{1}{C_1} &{} \frac{-1}{C_1G_1} &{} \frac{-1}{C_2} &{} 0 &{} \cdots &{} 0 \\ 0 &{} \frac{1}{L_2} &{} \frac{-R_2}{L_2} &{} \frac{-1}{L_2} &{} \cdots &{} 0 \\ \vdots &{} &{} &{} &{} &{}\vdots \\ 0 &{} 0 &{} 0 &{} 0 &{} \cdots &{} \frac{-1}{C_nG_n} \\ \end{bmatrix} \begin{bmatrix} i_1 \\ v_1 \\ i_2 \\ v_2 \\ \vdots \\ v_n \\ \end{bmatrix} + \begin{bmatrix} \frac{1}{L_1} \\ \vdots \\ 0 \\ \end{bmatrix} v_i \end{aligned}$$
(10)

The equation has been obtained through nodal analysis and manual transformation, although the method detailed in [11] may be useful for more complex circuits. Simulation for RLC interconnects of different lengths have been performed in order to compare the proposed method with CUSPICE [14], a parallelized version of SPICE for GPUs. The following component values per discrete section have been used: C = 1 fF, L = 100 pH, \(R = 10\,\Omega \), \(G = 400\,\Omega ^{-1}\), being the excitation a 1V step. The GPU used has been a general purpose NVIDIA GeForce GTX 1080 GPU, 3584 core, 1531 MHz and 11 GB of RAM.

Fig. 6
figure 6

Analog model of a coupled interconnect segment

The step-size in the explicit integration was \(10^{-13}\) sec and CUSPICE simulations were performed using two different step size limits: \(10^{-10}\) sec and \(10^{-11}\) sec. Figure 7 shows the transient step response of the RLC interconnect line 100th node, when a 1V signal is applied to its input. The red plot shows the result obtained using the proposed method while the blue one displays the results obtained from CUSPICE. As shown in the figure, the spurious numerical ringing is notorious in the CUSPICE results while it is negligible, after the initial transient period, in the method proposed in this work which very fast converges to the expected 1V final value. Table 2 shows the CPU times for both methods, the proposed method with the step size of \(10^{-13}\) sec and CUSPICE with the step size of \(10^{-11}\) sec, given that this latter one offers less ringing. CUSPICE, despite using a step size larger by two orders of magnitude than that of our method, is significantly slower with the proposed method reaching a speed up of almost an order of magnitude for 10,000 RLC segments.

Fig. 7
figure 7

Spice simulation results for the RLC interconnect line 100th node

Table 2 GPU simulation time for proposed method and CUSPICE

3.2 MOS vision chip

The second example focuses on a MOS-C network aimed to perform Gaussian filtering on a CMOS image sensor [5]. Such sensor is composed by an array of \(m \times n\) pixels, as shown in Fig. 8.a, where each pixel typically contains a photodiode and a reduced number of transistors. A simplified model of an \(m \times n\) CMOS image sensor is shown in Fig. 8.b.

Fig. 8
figure 8

Analog implementation of a Gaussian filtering MOS-C network

In the figure, each capacitor models the storage of the light intensity value captured by a single pixel in a real circuit, while the MOS transistors replace the resistors of a conventional RC network. The MOS devices work like a voltage controlled resistors to perform an analog time-controlled Gaussian filtering on the image captured by the image sensor. So, when the image is being captured by a sensor circuitry, the gate voltage \(V_G\) is null, keeping the devices in the cutoff region. Once the capture is done, a given \(V_G\) is applied to the MOS gates to allow them to operate in the triode region. The variation of the voltage along time is a Gaussian function with \(\sigma = (2t_{ON}/RC)0.5\) [10], being \(t_{ON}\) the time during which the MOS devices are driving current. The current flowing into a pixel (ij) is given by the contribution of the currents of the 4-neighbour devices following:

$$\begin{aligned} \begin{aligned} C\frac{dV_{i,j}}{dt}=\frac{V_{i,j+1}-V_{i,j}}{R_{i,j+1}}+\frac{V_{i,j-1}-V_{i,j}}{R_{i,j-1}} +\frac{V_{i+1,j}-V_{i,j}}{R_{i+1,j}}+\frac{V_{i-1,j}-V_{i,j}}{R_{i-1,j}} \end{aligned} \end{aligned}$$
(11)

being \(V_{i,j} \equiv V_{i,j+1}\), the state variable in this example and \(R_{i,j+1}\equiv R_{i,j+1}(v)\) the voltage-dependent MOS channel resistance between nodes (i,j) and (i,j + 1), given by:

$$\begin{aligned} R_{i,j+1}= \frac{L}{KW\Big (V_G-V_{th}-0.5(V_{i,j}+V_{i,j+1} )\Big )} \end{aligned}$$
(12)

A simplification of this problem is solved in [4] using a simplified model which allows the use of initial values in (12) for \(V_{i,j}\) and \(V_{i,j+1}\), and so, transforming \(R_{i,j+1}\) into a constant value, which also allows the use of a fixed integration-step. In this work, a more complex problem is addressed, where variable values for \(R_{i,j+1}\) are used, which leads to a Jacobian matrix with variable coefficients as well as a variable integration step needed to adapt the simulation speed to the voltage values at each time point. These characteristics require additional computations for the step-size estimations. Thus, replacing (12) in (11), and defining \(V_{GT}=V_G-V_{th}\) and \(K_R=L/KW\), equation (11) can be rewritten as:

$$\begin{aligned} \begin{aligned} K_R C \frac{dV_{i,j}}{dt} =&-\Big [4V_{GT} V_{i,j}-V_{GT} (V_{i,j+1} +V_{i,j-1} +V_{i+1,j} +V_{i-1,j})\\ \quad&-2V_{i,j}^2 +0.5(V_{i,j+1}^2 + V_{i,j-1}^2 +V_{i+1,j}^2 +V_{i-1,j}^2) \Big ] \end{aligned} \end{aligned}$$
(13)

where the derivative of voltage \(V_{i,j+1}\) with respect to time is a function of linear and quadratic terms of the own and the adjacent voltage nodes. The quadratic terms are then replaced by their linearised form at points \(V_{u,v}(0)\) using a second order Taylor approximation for \(u=i-1,i,i+1\) and \(v=j-1,j,j+1\) to obtain:

$$\begin{aligned} \begin{aligned} K_R C \frac{dV_{i,j}}{dt}=&-4\big (V_{GT}-V_{i,j} (0)\big ) V_{i,j}\\ \quad&+\big (V_{GT}-V_{i,j+1} (0)\big ) V_{i,j+1}+\big (V_{GT}-V_{i,j-1} (0)\big ) V_{i,j-1}\\ \quad&+\big (V_{GT}-V_{i+1,j} (0)\big ) V_{i+1,j}+\big (V_{GT}-V_{i-1,j} (0)\big ) V_{i-1,j}\\ \quad&-\!2V_{i,j} (0)^2\!+\!0.5\big (V_{i,j+1} (0)^2\!+\!V_{i,j-1} (0)^2\!+\!V_{i+1,j} (0)^2\!+\!V_{i-1,j} (0)^2 ) \big ) \end{aligned} \end{aligned}$$
(14)

The matrix formulation of (14) applied to the whole \(m \times n\) image sensor is:

$$\begin{aligned} \frac{d}{dt} \begin{bmatrix} V_{1,1} \\ V_{1,3} \\ \vdots \\ V_{\begin{array}{c} m,n \end{array}} \\ \end{bmatrix} = \begin{bmatrix} A_{1,1} &{} A_{1,2} &{} 0 &{} \cdots &{} 0\\ A_{2,1} &{} A_{2,2} &{} A_{2,3} &{} \cdots &{} 0\\ 0 &{} A_{3,2} &{} A_{3,3} &{} \cdots &{} 0\\ \vdots &{} &{} &{} &{} \vdots \\ 0 &{} 0 &{} 0 &{} \cdots &{} A_{\begin{array}{c} m,m \end{array}}\\ \end{bmatrix} \cdot \begin{bmatrix} V_{1,1} \\ V_{1,2} \\ \vdots \\ V_{\begin{array}{c} m,n \end{array}} \\ \end{bmatrix} +B \end{aligned}$$
(15)

where \(A_{i,j}\) are \(n \times n\) submatrices and B is a column vector of size \(m \times n\) including the independent terms. The set of submatrices \(A_{r,r}\) \((r=2,\ldots ,m-1)\) which belong to the diagonal of A are defined as:

$$\begin{aligned} A_{r,r}= \begin{bmatrix} -3\big (V_{GT}-V_{r,1}(0)\big ) &{} 0 &{} \cdots &{} 0\\ V_{GT}-V_{r,1}(0) &{} -4\big (V_{GT}-V_{r,2}(0)\big ) &{} \cdots &{} 0\\ 0 &{} V_{GT}-V_{r,2}(0) &{} \cdots &{} 0\\ \vdots &{} &{} &{} \vdots \\ 0 &{} 0 &{} \cdots &{} V_{GT}-V_{r,n-1}(0)\\ 0 &{} 0 &{} \cdots &{} -3\big (V_{GT}-V_{r,n}(0)\big )\\ \end{bmatrix} \end{aligned}$$
(16)

Submatrices \(A_{r,r}\) \((r=1,m)\) refer to pixels placed in the first and the last row of the image sensor. In such case the matrices are similar to the previous one, but replacing in the main diagonal constants - 3 and - 4 by - 2 and - 3 respectively. The set of submatrices \(A_{r,r\pm 1}\) are diagonal matrices defined as:

$$\begin{aligned} A_{r,r\pm 1}= \begin{bmatrix} V_{GT}-V_{r\pm 1,1}(0) &{} 0 &{} \cdots &{} 0 \\ 0 &{} V_{GT}-V_{r\pm 1,2}(0) &{} \cdots &{} 0 \\ \vdots &{} &{} &{} \vdots \\ 0 &{} 0 &{} \cdots &{} V_{GT}-V_{r\pm 1,n}(0)\\ \end{bmatrix} \end{aligned}$$
(17)

Finally the column vector B, which includes the current linearised values \(V_{u,v}(0)\) is obtained as:

$$\begin{aligned} B=\begin{bmatrix} V_{1,1}(0)^2+0.5\big (V_{1,2}(0)^2+V_{2,1}(0)^2\big ) \\ 1.5V_{1,2}(0)^2+0.5\big (V_{1,1}(0)^2+V_{1,3}(0)^2+V_{2,2}(0)^2\big ) \\ 1.5V_{1,3}(0)^2+0.5\big (V_{1,2}(0)^2+V_{1,4}(0)^2+V_{2,3}(0)^2\big ) \\ \vdots \\ V_{m,n}(0)^2+0.5\big (V_{m-1,n}(0)^2+V_{m,n-1}(0)^2\big ) \\ \end{bmatrix} \end{aligned}$$
(18)

The simulation technique proposed in this work has been applied to simulate the system described in equations (15) to (18) using the same GPU than in the previous RLC example. The image sizes used in this example are those used in small CMOS smart image sensors, whose dimensions are restricted to include additional electronics close to the pixels. The simulations have been done modelling the capacity of the CMOS sensor pixels as \(C=10\) pF and using the following device parameters: \(K=50\) \(\mu A/V^2\), \(W/L=1\), \(V_{th}=0.5\) V and \(V_G=3\) V. Processor times required for each transient simulation is detailed in Table 3, where they are compared with these required by Spectre and CUSPICE. It can be noted how the throughput of the three methods varies in a different way when the complexity of the problem is increased. For the explicit method, when the image size increases from 8 \(\times \) 8 to an image approximately 300 times larger, the processing time is increased 2.27 times. However, for the implicit method this time is increased 128 times using Spectre simulator and 104 times using CUSPICE. Thus, although for small images, CUSPICE and Spectre are faster, for images larger than 64 \(\times \) 64 pixels the proposed explicit method offers smaller simulation times and the speedup is larger for bigger image sizes.

As an example of the image filtering simulation method described above, Fig. 9 shows the result of when a MOS conduction time of 25 ns is applied to an image extracted for the CIFAR-10 image dataset. The intensity value of each pixel is coded between 0 and 1 V. Compared to the simulation using CUSPICE, the explicit one presents an absolute root mean square error (RMSE) of \(1.96\cdot 10^{-3}\) units.

Table 3 GPU simulation time for proposed method and CUSPICE
Fig. 9
figure 9

a Original and b filtered images obtained through the simulation of the MOS-C image Gaussian filter applied to a 32 \(\times \) 32 pixels image for t=25 ns

4 Conclusion

This paper has presented a numeric method based on a variable step explicit integration technique useful to speed up transient simulations of complex analogue circuits on GPUs. The method has shown promising results when solving large VLSI interconnect models, whose inclusion in digital VLSI simulations is increasingly important as clock frequencies reach 10 GHz and more. Given that interconnect may constitute a large part of an analogue or a mixed-signal VLSI system, this method can be a useful approach in the development of modern VLSI design tools. The method has also been used to simulate a CMOS-C network aimed to perform Gaussian filtering on small and medium size images, a common operation in computer vision applications. Given the strict requirements in terms of speed and power dissipation of embedded vision systems, mixed signal smart vision chips combining processing in both analog and digital domains are expected to increase their market share. Thus, the proposed method can contribute to shorten the time to market of such systems. The results of these examples illustrate the potential of parallelized explicit integration methods on to solve of vast numbers of equations representing passive analogue circuits.