1 Introduction

Recently, generative AI has pushed modern artificial intelligence (AI) to a new level, with the ChatGPT chat robot model released by OpenAI, which combines reinforcement techniques, natural language processing, and machine learning to improve its ability to understand and comprehensively respond to users’ needs (Salvagno et al. 2023; Khowaja et al. 2023). Even generative AI can assist in academic research, a combination of large-scale diverse datasets and pre-trained transformers has emerged as a promising approach for developing Single-Cell Multi-omics (Cui et al. 2023). In the medical field, the trained medical AI model can be used to flexibly explain different combinations of medical modalities, including data from imaging, electronic health records, laboratory results, genomics, charts or medical texts. The model will in turn produce expressive output, such as text description, oral recommendation or image annotation showing advanced medical reasoning ability (Moor et al. 2023); at the same time, biomimetic organs based on modern AI technology, such as prostheses and biomimetic eyes, significantly improve the mobility and quality of life of patients with lower limb amputations and blindness (Tran et al. 2022; Kim et al. 2022).

At the same time, modern AI systems have made significant progress and breakthroughs in many intelligent fields. Such as computer vision based on few-shot 3D point cloud (Ye et al. 2023), Parallel Vision Actualization System (PVAS) use in autonomous vehicles (Wang et al. 2022), Neural Structure Search (NAS) has received widespread attention for its ability to design deep neural networks (Ye et al. 2022) automatically. However, the success of AI systems nowadays cannot be achieved without the support of appropriate hardware systems, which in turn determines the availability and effectiveness of AI algorithms (Ivanov et al. 2022). Mapping complex algorithms to various hardware platforms is challenging, especially from embedded systems to mobile smartphones and wearable devices. To achieve this, hardware-efficient algorithms and customised artificial intelligence accelerators are needed to map these algorithms effectively. Some studies focus on designing hardware-friendly algorithms, conducting Quantitative Aware Training (QAT) on networks (Jacob et al. 2018), and Pruning. Other studies focus more on the design of hardware AI accelerators (Seo et al. 2022).

Previous results (Wermter et al. 2014; Zhang et al. 2015) show that 90\(\%\) of the computation during CNN inference comes from the convolutional computation. So, this work will focus on accelerating convolutional operations. The convolution calculation can be described as a multiply and accumulate (MAC) operation. To accelerate computation, convolutional operations are converted into matrix products, such as the Caffe (Chellapilla et al. 2006) architecture-based convolutional computation algorithm, which converts the input matrix into a one-dimensional vector of the same size as the convolutional kernel and then performs vector MAC operations with the stretched convolutional kernel. The high-speed matrix product computation capability of GPUs and FPGAs (Guo et al. 2017; Owens et al. 2008) dramatically accelerates the training and inference of neural networks.

However, current GPUs and FPGAs-based implementations of CNNs are based on the sequential von Neumann architecture; this architecture computation storage is separated, and with the increase of computation, the demand for bandwidth also increases, and the power consumption consumed by data communication also increases, the traditional “von Neumann architecture” is not sufficient for fast training and recognition tasks, such as real-time training and recognition of images and speech information. This challenge is also known as the “von Neumann bottleneck,” which reduces the overall efficiency and performance of the computing system (Von Neumann 1981).

the same time, the emergence of new devices such as memristors and thyristors, with their in-memory computing (IMC) capacity, high speed, low power consumption, and easy integration, has shown great potential for implementing large-scale neural networks in hardware (Chen et al. 2021; Ebong and Mazumder 2011; Vourkas and Sirakoulis 2012). In paper (Merced-Grafals et al. 2016; Afshari et al. 2022), one transistor and one memristor (1T1M) conductance regulation method is proposed, and literature (Zhang et al. 2018; Krestinskaya et al. 2018; Kwon et al. 2020; Owens et al. 2008) offers a hardware implementation of backpropagation algorithms based on 1T1M arrays to implement neural networks in-situ or ex-situ. Meanwhile, to overcome sneak-path current in memristor crossbar (Linn et al. 2010; Ni et al. 2021), Dong et al. (2012) adopts a two-step adjustment method in writing operation; a better approach is to integrate CMOS switches with gate characteristics on the memristor (Li et al. 2018, 2022). A memristor-based CNN with full hardware implementation was reported in Yao et al. (2020); the training speed of the network can reach a considerable 81.92 GOPS and 11,014 GOPS/W energy efficiency.

Although these research reports can realize CNN and have good performance, they have not changed the limitations caused by traditional convolution computing mode (Khan et al. 2020; Wan et al. 2022). For example, the convolution calculation input size is limited by the convolution kernel size, and the number of convolution calculations increases significantly as the input matrix increases due to the need to reshape the input matrix into an input sequence of the same size as the convolutional kernel in most convolutional computation architectures. The convolution result of each layer requires extra off-chip memory, and the memristor array based on 1T1M cannot effectively overcome the sneak path on simultaneously selected rows or columns due to the series characteristics of gate switches. Thus, the convolution calculation method for the memristor cross array needs to be further optimized to maximize its parallel computing capability.

Motivated by this, we propose an optimized convolution algorithm for the memristor crossbar. The number of convolution calculations of this algorithm is no longer affected by the image size. It can take full advantage of the parallel computing of the memristor crossbar. The rest of this paper is organized as follows. Section 2 introduces typical convolution algorithms, including the GPU-oriented Caffe algorithm and a memristor array-oriented algorithm, and then leads to our full-size convolution algorithm. Section 3 maps the designed FSCA algorithm to the memristor array and builds a circuit simulation system based on the physical memristor model for verification. The demonstration of system scalability and flexibility from both the circuit and system aspects is in Sect. 4 and concludes the paper in Sect. 5.

2 Full-size convolution algorithm

2.1 Basic convolution algorithm

Convolution computation accounts for a large proportion of deep convolutional neural networks, and accelerators based on GPU/FPGA and nonvolatile memory still face many challenges. Before introducing the designed full-size convolution algorithm, reviewing the current mainstream deep learning convolution algorithms on GPU/FPGA and nonvolatile memory are necessary.

Figure 1 is the GPU calculation process based on Caffe architecture (Chellapilla et al. 2006). First, the input data is reshaped into a vector of the corresponding size of the convolutional kernel and then multiplied and added to the expanded convolutional kernel.

Fig. 1
figure 1

Caffe architecture convolutional computation

Assuming that the input Matrix is an image with dimension \(m \times m\), convolution kernel size \(d \times d\), and convolution kernel sliding step size s, one MAC operation corresponding to the size of the convolution kernel region is calculated for each sliding. The single operation that can calculate 1-directional kernel convolution \(d \times d\). We can find that the convolutional computation is strongly influenced by the input size \(m \times m\), since each input can only be an amount that matches the convolutional size \(d \times d\), which means that the number of calculations increases significantly as the input image size increases and the time and energy required to unfold the input data into a usable form cannot be ignored.

Another convolution algorithm for memristor arrays appears in the literature (Wen et al. 2020), named convolution kernel first operated (CKFO), as shown in Fig. 2. A \(m \times m\) input matrix and \(d \times d\) kernel, the size of a sliding window of CKFO is \((m-d+1) \times (m-d+1)\), the total number of MAC operations is \(F=((d-1) / s+1) \times ((d-1) / s+1)\), a single operation can calculate 1-directional kernel convolution, the advantage of this algorithm is that it can expand the sliding window, reduce the number of calculations. When the core element is 0, the corresponding matrix can be omitted.

Fig. 2
figure 2

Diagram of convolution kernel first operated algorithm (CKFO)

2.2 New convolution algorithm: full-size convolution algorithm

It can be seen from the Caffe and CKFO algorithm diagram shown in Figs. 1 and 2 that convolution calculation extracts the matrix with the same size as the sliding window from the input matrix and multiplies it with the elements of the convolutional kernel. The total number of convolutional computations is the sum of the number of moves according to the sliding window from the input matrix. Inspired by the parallel calculation of the memristor, we expect to increase the number of sliding windows extracted from the input matrix in a single operation and then rely on the parallelism of the memristor array to achieve the computation.

Assume that the input matrix A is \(m \times m\) dimensional image data, the size of the convolution kernel G is \(d \times d\), and the sliding step is s when satisfying \(m \ge d + s\); then the full-size input convolution algorithm is represented by Eq. (1):

$${\text{B}}(k,l) = \sum\limits_{{i = 1}}^{{\left\lceil {{d \mathord{\left/ {\vphantom {d s}} \right. \kern-\nulldelimiterspace} s}} \right\rceil }} {\sum\limits_{{j = 1}}^{{\left\lceil {{d \mathord{\left/ {\vphantom {d s}} \right. \kern-\nulldelimiterspace} s}} \right\rceil }} {\sum\limits_{{k = 1 + (i - 1)s}}^{{m - d + 1}} {\sum\limits_{{\;\;l = 1 + (j - 1)s}}^{{m - d + 1}} {{\text{A}}(k:k + d - 1,l:l + d - 1)} } } } *G.$$
(1)

where kl increasing in step \(d + s - 1\), \(\left\lceil \cdot \right\rceil\) is ceil function, \(*\) which indicates convolution computation.

In order to better show the proposed FSCA algorithm flow, the input matrix in Figs. 1 and 2 is extended to \(5\times 5\), as shown in Fig. 3. First, in the vertical shift-0 (Vshift-0) and horizontal shift-0 (Hshift-0) initial position tile the convolution kernel over the entire input matrix, the result is obtained by calculating all the covered red dotted areas simultaneously (\(2\times 2\) convolution kernel is capable of obtaining four parallel convolutional computation areas on a \(5\times 5\) input matrix) then execute horizontal shift operations (Hshift-1). After performing the horizontal shift, perform the vertical shift (Vshift-1) operation and perform the convolution calculation and horizontal shift as before in Vshift-0. In the Caffe convolution algorithm, it takes 16 times to complete the calculation, and our algorithm only needs four times to get the calculation result, and the larger the size of the input matrix, the more significant computation that is reduced.

Fig. 3
figure 3

FSCA algorithm convolution computation flow

In Table 1, we compare the designed FSCA with Caffe convolutional algorithm and CKFO algorithm in terms of single operation performance. We can find that for a \(m \times m\) input matrix with \(d \times d\) convolution kernels, the FSCA algorithm has obvious advantages in reducing the number of calculations, and the performance of a single calculation can reach \(\left\lceil {{{(m - d + 1)} \mathord{\left/ {\vphantom {{(m - d + 1)} {(d + s - 1)}}} \right. \kern-\nulldelimiterspace} {(d + s - 1)}}} \right\rceil\) directional kernel convolution.

Table 1 Performance of FSCA algorithm compared with other algorithms

3 FSCA circuit system based on memristor crossbar

3.1 Memristor array

The memristor crossbar has been verified as an effective hardware structure to accelerate convolutional computation; At present, the design of the crossbar is developing towards high integration (Qin et al. 2020); however, with the increase of array size, the sneak path issue is a problem that should not be ignored (Linn et al. 2010; Ni et al. 2021), leakage current will be generated in the adjacent position of the selected cell; the closer the distance from the chosen cell, the more significant the leakage current is. If the leakage current exceeds a certain threshold, the output result will be affected, as shown in Fig. 4a. One solution is to add a transistor to a single memristor to form a 1T1M architecture, a 1T1M cell architecture that only enables read/write to a particular row. The transistor acts as a selector switch, as shown in Fig. 4b, and the leakage current will only be generated in the selected rows, thus reducing the impact of the leakage current on the output result.

Fig. 4
figure 4

a Leakage current flow in memristor array. b Leakage current in 1T1M arrays

To further reduce the leakage current on selected rows or columns, one transistor and three memristors (1T3M)-based crossbar structure is to be designed, as shown in Fig. 6. Each cross point consists of one transistor and three memristors, as shown in Fig. 5a. The red memristor M1 is used for calculation and storage, and the other two blue memristors M2 and the NMOS device, form a functional device like a bidirectional pulse-triggered switch; M1 and M2 have different device parameters. To achieve high cell density, we model the cell area of MOS accesses concerning the design rules of DRAM [24], and the area of a MOS access cell is calculated as Eq. (2):

$$Area_{{cell,MOS{\text{-}}accessed}} = 3(W/L + 1)(F^{2} )$$
(2)

where F is the transistor feature size and is the width-to-length ratio of NMOS. As shown in Fig. 5b, if we set the size of the unit cell to be the same as that of a transistor (10 \(F^2\)) by integrating two memristors monolithically on top of the transistor. This integration does not result in any additional area increase on the two-dimensional plane. In the vertical direction, the additional two memristors (M2) for computation are in parallel with the memristor (M1) used for calculation, so there is no increase in area compared to the 1T1M structure (Fig. 6).

Fig. 5
figure 5

a 1T3M unit structure b 1T3M schematic diagram

Fig. 6
figure 6

1T3M-based memristor crossbar, current flows only along the solid red line through the yellow selected cell. (Color figure online)

3.2 Bidirectional pulse trigger switch

We use two inverted series memristors, M2, to configure the voltage divider circuit to control the NMOS switch. Using two memristors maintains the current switch state until the next opposite-direction pulse control switch arrives and changes the switch state.

For example, assume that the current NMOS in Fig. 5a is in ON state, i.e., (\({R_{OFF}}\) and \({R_{ON}}\) denote the memristor’s high and low resistance states, respectively) and apply a constant 2 V gate voltage to the Vs terminal, if there is an input signal at vin, there will be current flowing through the memristor, as shown in Fig. 7; when a − 10 V (50 ns) switch-off pulse voltage is applied to the Vs side for 50 ns, the NMOS is in the OFF state, i.e., even if there is an input signal at the vin side, no current will flow through the memristor. After applying a switch on pulse voltage of 10 V (50 ns) to the Vs side again, the NMOS has turned on, and the current will flow through the memristor after the input pulse signal at the vin side. The state changes of gate voltage, memristor, and NMOS are shown in Table 2.

Fig. 7
figure 7

Bidirectional pulse triggering characteristics of 1T3M. a Switch on and off trigger signals applied to the Vs side. b The input signal is applied to the Vin terminal. c Current flowing through the memristor

Table 2 NMOS ON and OFF states change with two memristors

Here we discuss the parameters of the two selected memristors. In the calculation process, M1 is connected in series with the transistor. We choose a high-resistance memristor to reduce the influence of transistors on the calculation results. At the same time, the high \(R_{OFF}/R_{ON}\) ratio of the memristor can help minimize the impact of the leakage current and enable the array’s output to observe a wider voltage range. We use the Voltage-controlled ThrEshold Adaptive Memristor (VTEAM) model (Kvatinsky et al. 2015) realized in SPICE (Radakovits et al. 2020; TaheriNejad et al. 2019) and the differential equation for the state variable of the memristor is:

$$\begin{aligned} D(x) = {\left\{ \begin{array}{ll} k_{off}\cdot \left( \frac{v(t)}{v_{off}}-1\right) ^{\alpha _{off}}\cdot f_{off}(w),&{}0<v_{off}<v \\ 0,&{}v_{on}<v<v_{off}\\ k_{on}\cdot \left( \frac{v(t)}{v_{on}}-1\right) ^{\alpha _{on}}\cdot f_{on}(w),&{}v<v_{on}<0\\ \end{array}\right. } \end{aligned}$$
(3)

Where \(k_{off}\)\(k_{on}\)\(\alpha _{off}\), and \(\alpha _{on}\) are constants, \(f_{off}(w)\) and \(f_{on}(w)\) are window functions, which constrain the state variable to bounds of \(w\in \left[ w_{on},w_{off}\right]\), and \(v_{on}\) and \(v_{off}\) are threshold voltages. The memristor M1 uses the memristor model proposed in (TaheriNejad and Radakovits 2019) and shown in Table 3; this model has and is capable of the fastest response in the sub-nanosecond. The memristor M2 is chosen to meet the fast response speed and the need to keep the NMOS ON or OFF state by dividing the voltage in series, analyze the circuit characteristics of the 1T3M structure as shown in Fig. 5a, and the voltage division formula gives:

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathrm{{V}}_{gate}} &{}= \frac{{\mathrm{{M}}{2_{Left}}}}{{\mathrm{{M}}{2_{Left}} + \mathrm{{M}}{\mathrm{{2}}_{Right}}}}{{\text{Vs}}} \\ \ \ \ v_{th} &{}\gg \text{Vs}\cdot \frac{R_{ON}}{R_{OFF}+R_{ON}},{\text{Vs}}=2 \\ \ \ \ v_{th} &{}< \ \text{Vs}\cdot \frac{R_{OFF}}{R_{OFF}+R_{ON}},{\text{Vs}}=2\\ \end{array}\right. } \end{aligned}$$
(4)

where \(v_{th}\) denotes the NMOS threshold voltage. When \({\mathrm{{V}}_{gate}}\) is much smaller than \(v_{th}\), the switch is in the \(\text{OFF}\) state. When \({\mathrm{{V}}_{gate}}\) is greater than \(v_{th}\), the switch is in the \(\text{ON}\) state. This needs to meet the appropriate \(R_{OFF}/R_{ON}\) switch ratio, which is usually greater than 10 to meet the condition.

Table 3 Parameters of two kinds of memristor

The parameter settings of M2 match the characteristic data of a physical memristor model published in Chanthbouala et al. (2012) and shown in Table 3. In reality, any memristor device that satisfies the requirements outlined in Eqs. (4) and (5) can meet the specified conditions. For instance, considering the 1T3M pulse-triggered switch depicted in Fig. 5a, assuming the trigger voltage of the switch is \(\pm 5\text{V}\), and the threshold voltages of the memristor device M2 are \(v_{off} =2\,\text{V}\) and \(v_{on} =-2\,\text{V}\), with the resistances \(R_{OFF}=10\, {\text{K}}\Omega\) and \(R_{ON}=1\, {\text{K}}\Omega\), the conditions for the pulse-controlled switch can be satisfied.

To elaborate, when the switch needs to transition from the initial ON state to the OFF state, referring to Eq. (5), as long as \(\mathrm {V_{S} }< v_{on} < 0\), the memristor \(\text{M2} _{Left}\) will rapidly shift from \(R_{OFF}\) to \(R_{ON}\), while \(\text{M2} _{Right}\) is initially set to \(R_{ON}\). Subsequently, based on the equation \(\mathrm{{V}}_{gate}= {{\text{Vs}}} \cdot \text{M2}_{Left}/(\text{M2}_{Left}+ \text{M2} _{Right} )\), the condition \({\mathrm{{V}}_{gate}} = |\text{Vs}/2 |> v_{off}\) should be met, where Vs = − 5 V, which satisfies the criteria. Consequently, \(\text{M2} _{Right}\) transitions from \(R_{ON}\) to \(R_{OFF}\), resulting in the switch being in the OFF state. When a conducting voltage of 2 V is reapplied to Vs, \({V_{gate}}\mathrm{{ }} < \mathrm{{ }}{v_{th}}\) and the NMOS enters the cutoff state. A similar analysis can be applied when the switch transitions from OFF to ON.

$$\begin{aligned} {\left\{ \begin{array}{ll} when:Switch (OFF\longrightarrow ON) \ \\ \frac{{\mathrm{{M}}{2_{Right}}}}{{\mathrm{{M}}{2_{Left}} + \mathrm{{M}}{\mathrm{{2}}_{Right}}}}|{{\text{Vs}}}|>v_{off},{\text{Vs}}< v_{on}<0\\ when:Switch (ON\longrightarrow OFF) \ \\ \frac{{\mathrm{{M}}{2_{Left}}}}{{\mathrm{{M}}{2_{Left}} + \mathrm{{M}}{\mathrm{{2}}_{Right}}}}\text{Vs}>v_{on}\,\ \ 0<-v_{on}<{\text{Vs}}\\ \end{array}\right. } \end{aligned}$$
(5)

where \(v_{th}\) denotes the NMOS threshold voltage.

For the leakage current analysis of the 1T3M arrays, assume that all NMOS are initialized to the OFF state. After row selection, we expect to select a cell (such as the yellow area in Fig. 6). First, apply a switch on voltage to the selected cell row address selector, line address selector ground, other row and line suspended, the selected cell NMOS will change to ON state, While the NMOS of other cells remains in OFF state, the 2 V gate voltage is applied to the row line of the selected cell when reading or writing, so that only the current flows through the selected cell, as shown in Fig. 6, the solid red line indicates the current flow direction. In contrast, other cells will not have a current flow, which will further overcome the problem of leakage current in large-scale memristor arrays. We also consider the effect of adding two memristors to the 1T1M structure on the control complexity because these two memristors are in series; only one 50 ns setup signal at Vs terminal is needed to set the two memristors to the desired state (ON or OFF). No additional control signal is needed during the calculation. As with the 1T1M structure, only a constant voltage needs to be applied to the Vs terminal to keep the NMOS on, and the other terminal is grounded.

3.3 Algorithm mapping to memristor array

The current hardware scale of memristor-based chip integration is still challenging to reach and above. Consequently, we design an easily scalable circuit to increase the integration scale. We build the test circuit on LTspice and package the 1T3M cells, represented by rectangular squares, leaving only four pins for connection. The size of the convolution kernel size \(d \times d\) determines the number of columns of the primary array cell, and the number of rows is determined by the dimension of the data to be processed. The n rows and m columns of the input matrix can be extended by the primary array cell, the size of the basic array cell is n rows and d columns, and the number of essential array cells needed is m. We still use the \(5 \times 5\) input and \(2 \times 2\) kernel matrix in Fig. 3 as examples. The primary array cell dimension is \(5 \times 2\), and the number of essential array cells required to compose a \(5 \times 5\) input matrix is 5. So, the structure of the initialized array is shown in Fig. 8.

The circuit implementation flow is divided into three main steps. The first step is the circuit initialization configuration. The purpose is to allow the setting of all memristors by cascading all switches vup and the line switch vset to close. The setting method is to select every first column of all array cells with the column selector. Then apply − 10 V (50 ns) at the end of the row selector of the non-computing cells, which will make all the non-computing cells in the first column of each array cell NMOS to OFF state. Then select the rest column of the array cell to perform the same operation. Because the number of columns of the array cell is the same as the size of the convolutional kernel d, here, \(2 \times 2\) convolutional kernel, the number of settings is only two times. The switches in Fig. 8 are implemented with NMOS, gray fill means the switch is closed, and no fill means it is open.

Fig. 8
figure 8

a Two rows of memristor arrays are used to separately store W+ and \(\vert \mathrm W-\vert\). b Fully parallel convolution region, the red solid line squares I are the first convolution calculation area, and the blue dashed line squares II are the second convolution area after moving to the right. (Color figure online)

The second step is to store the input matrix and kernel matrix. Convert the kernel and input matrix to conductance and keep them in the manner mentioned below. For example, the mapping of conductance to the input matrices A can be performed by a linear transformation (Li et al. 2018):

$$\begin{aligned} G_{\mathrm A}=\alpha \cdot \mathrm A\;+\;\beta \cdot I \end{aligned}$$
(6)

where I is the matrix of ones, and coefficients of the change are determined by:

$$\begin{aligned} \begin{array}{l} \alpha = {{\left[ {{G_{\max }} - {G_{\min }}} \right] } \ {\left[ {\max (\mathrm{{A}}) - \min (\mathrm{{A}})} \right] }}\\ \beta = {G_{\min }} - \alpha \cdot \min (\mathrm{{A}}) \end{array} \end{aligned}$$
(7)

The convolution calculation results can be recovered from the crossbar measurements by:

$$\begin{aligned} \begin{array}{l} y = \mu {\alpha ^{ - 1}}\sum {{v_{in}}} {G_{\mathrm{{A}}}} - \mu {\alpha ^{ - 1}}\beta \sum {{v_{in}}} I \end{array} \end{aligned}$$
(8)

where \({v_{in}}\) is the voltage form of the input kernel matrix, \(\mu\) is the scale factor matching the input voltage range, and I is the vector of ones. The second term of the equation contains the sum of all elements of the input voltage and can be easily post-processed by hardware or software.

The third step is to calculate. Before the calculation, we need to select the kernel convolution region for this operation. As in Fig. 8, we show the part chosen by the first two operations, the four red realized squares I are the area of the first convolution calculation, and the four blue dashed squares II are the area of the second convolution calculation after moving to the right.

Now we perform the calculation for the selected area I, a reading voltage with an amplitude of 0.2 V 50 ns pulse width and 50% duty cycle is applied to the W+ input side and connect the output with op-amp circuits after each column as shown in Fig. 9. The amplifier uses LTC6268-10 with a gain bandwidth of 4 G(HZ), and 3.5 ns delay the output after the two op-amps and limits the output voltage to less than the memristor’s threshold voltage M1, here the output voltage range is [0–0.6 V].

Fig. 9
figure 9

A selected kernel convolution region

The kernel elements W+ then enter the corresponding kernel convolution region as voltages. Then close the NMOS switch vup1 connecting the two cell arrays, and use the v-lin4 selector switch to conduct the current in both select rows, and finally use the Output selector switch to connect the calculated results to the output circuit. If an inverting amplifier circuit is connected to the output, as in Fig. 10a, the output result can be expressed as:

$$\begin{aligned} \mathrm{{V1}} = - {R_{\mathrm{{f}}1}}(v1\cdot G1 + v2\cdot G3 + v3\cdot G2 + v4\cdot G4) \end{aligned}$$
(9)

We use an integrated circuit to hold the current output; it should be noted that if the current reading is W+, the single pole double throw (SPDT) switch CL1 throw to a terminal. Then the output can be expressed as:

$$\begin{aligned} v_{out}=-\left( R_{\mathrm f2}\cdot C\right) ^{-1}\int \mathrm V1dt \end{aligned}$$
(10)

We set the integration time to be the same as the duration of the input signal and make the integration result linear with the output signal by adjusting the RC parameters.

After completing the W+ read and calculation, then execute the above read operation at the \(\vert \mathrm W-\vert\) side; the difference is that here the read voltage needs to be delayed by 100 ns, and the SPDT switch CL1 throw to b terminal, invert the output V1, the result of the calculation is superimposed on the previous W+ result by the RC integration circuit, as shown in Fig. 10b. The delay time generated by the output circuit is about 2.6 ns.

Fig. 10
figure 10

Output circuit: a output circuit with two op-amps, b the integrator accumulates positive and negative convolution kernel element calculation results

With the superimposed output results of kernel W+ and \(\vert \mathrm W-\vert\), one kernel operation is considered complete, and the convolution of the four red solid line squares regions I in Fig. 9 is all completed. Next, the algorithm in Eq. (1) can be followed to move and calculate the results of all remaining convolutional regions; due to the parallel nature of the algorithm, the number of such move operations will be minimal, in this case, there are only four operations.

4 Simulation results

4.1 Simulation framework

To simulate the designed FSCA algorithm and the convolutional computing circuit based on the memristor crossbar, a hierarchical simulation framework is used. The circuit layer simulation is performed on the open EDA software LTspice, which provides easy access to ADI’s commercial device libraries, such as the op-amp LTC6268, NMOS devices compatible with standard TSMC 0.18-μm CMOS process. It allows us to build our library of standard devices for memristor. We build SPICE models of voltage threshold memristors that match the characteristic parameters of the physical memristor model. The circuit-level simulation is also used to generate the behavioral models, power consumption, and delay parameters needed for system-level simulation and analysis. We use the MATLAB numerical simulation tool to build the behavioral model of the memristor to verify the flexibility and scalability of the memristor crossbar-based FSCA.

4.2 Circuit-level simulation

The circuit system verification is performed. First, convolution calculation is commonly used for feature extraction, here a \(14 \times 14\) grayscale image is used as the input matrix for full-size convolution calculation to extract edges; the data is used from two randomly selected Modified National Institute of Standards and Technology database (MNIST), the original \(28 \times 28\) image is compressed to \(14 \times 14\) and the edge extraction is performed with a \(3 \times 3\) convolution kernel of Sober operator.

Before storing the pixels, it is necessary to program the voltage and conductance fitting of the memristor, see method (Merced-Grafals et al. 2016). Because the original image MNIST data is an 8-bit gray-scale image, the memristor used to store pixels also needs a stable storage state of 8 bits, program the memristor with voltages of different amplitudes of the same pulse width of 50 ns. To simplify the process, the positive and negative values in the Sober operator are replaced by voltage sources below the threshold voltage of the memristor. Here the basic expansion unit used is \(14 \times 3\), 14 expansion units are required to achieve the full-size convolution of the input matrix, and we build an array of \(14 \times 54\), as shown in Fig. 11.

Fig. 11
figure 11

\(14 \times 54\) arrays made up of fourteen array units

The control signal is simulated using a timing pulse signal source, which can be implemented in the actual circuit using FPGA and microcontrollers. The timing diagram of some of the control signals is shown here, the left part of the red dashed line in Fig. 12 is the storage stage, and the right is the calculation stage. In the storage stage, the uniform pulse signal width is 50ns, and the interval between two adjacent pulses is 5 ns. The input pixel matrix in Fig. 11 is stored in the cross array column by column from left to right, so it takes \(14 \times 55\) ns to keep a \(14 \times 14\) matrix. The control signals V(clin1) and V(clin4) in the storage stage are high level, enabling the data can stored in the first and fourth columns, and V(vset1) and V (vset4) allowed the set operation to be performed on the memristor. V(vup1) and V(vup4) enabling the extended array cells are linked to each other. V(vin13), V(vin14), and V(vin15) are the set voltage values corresponding to the pixel value used to store to the cross array.

Fig. 12
figure 12

Input signal and control signal timing diagram: a The control signals V(clin1), V(clin4), V(vset1) and V(vset4) in the storage stage are high level. b V(vin13), V(vin14), and V(vin15) are the set voltage values corresponding to the pixel value used to store to the cross array. c V(vup1) and V(vup4) enabling the extended array cells are linked to each other. d The corresponding output results of the selected calculation area. (Color figure online)

In the calculation phase, the right half of the red dashed line, the Sober operator is first transformed to voltage as the input, as shown in the partially enlarged view in the right half of Fig. 12b. The V(clin) is kept high and applied to NMOS in the four-pin device to keep the state whether the device participates in the calculation or not, and the V(line) switch controls the current flow between different columns. The V(vup1) and V(vup4) are opened or closed when moving right, making the expansion cells interconnect in the computational order. Figure 12d shows the computational results of each output cell at different moments of the computational stage. We import the calculation results into MATLAB for display.

The simulation results are shown in Fig. 13; we also used MATLAB to perform Sober edge extraction and compared it with the circuit simulation results at the same gradient threshold. Using the FSCA algorithm, the number of convolution operations for \(14 \times 14\) input is \(\left\lceil {d/s} \right\rceil \times \left\lceil {d/s} \right\rceil = 9\). Because the Sober operator we use has two directions of feature extraction template, the actual computation time is \(9 \times 2 \times 100\,\text{ns}\), plus the storage time \(14 \times 55\,\text{ns}\), the total computation time for a \(14 \times 14\) input matrix is 2.6 μs. We use the power consumption calculation function with the LTspice software to estimate the average power consumption of each functional device. The average power consumption of a single NMOS switch is 700 pW, the two M2 memristors in a 1T3M structure is 79.23 nW, and the average power consumption of an M1 memristor is 244.45 nW.

Fig. 13
figure 13

Handwritten digit edge extraction. a Original image. b Circuit simulation result. c Software simulation result

4.3 System-level simulation

The memristor behavioral model is used to simulate FSCA algorithm-based image process. First, we verify the scalability and computational capability of the system. Assuming the system is scaled to an array \(768 \times 256\), it can process a \(256 \times 256\) image. We take a Lena grayscale image as an example and extract its horizontal and vertical directional gradients. The Lena image is stored in a memristor array using the memristor behaviour model in the previous section.

The Sober operator is converted to a below-threshold voltage and fed into the crossover array. It is worth noting here that the positive and negative of the sober operator are calculated separately, both are converted to absolute value voltage for input, and the computed results are inverted for the negative operator and added to the positive results. The weight and pixel mapping scheme are adopted from the literature (Li et al. 2018). We superimpose the horizontal and vertical features and show the results after different numbers of operations, as in Fig. 14. The graph shows that only 9 parallel convolution operations are needed for edge extraction of a Lina image by the FSCA algorithm, which greatly improves the efficiency of image feature extraction. Through the average power consumption of the memristor and transistor during circuit calculation in the previous section, we can estimate the power consumption of a \(768 \times 256\) memristor crossbar in processing a picture is 21.69 mW. Assuming that the time required for one convolution operation in the memristor crossbar circuit in the previous section is used, our convolutional computation system can reach a peak calculation performance of 1.23 tera operations per second (TOPS).

Fig. 14
figure 14

Lena image Sober horizontal and vertical gradient extraction after 3, 6, and 9 operations, respectively

In machine vision, Gabor feature can simulate human visual response well and is widely used in image processing. Some methods use Gabor filters to extract features, which are then fed into the CNN, while others simply use Gabor filters instead of convolutional kernels (Yuan et al. 2022). Data preprocessing plays a crucial role in handling noise, outliers, and duplicate values, ultimately improving the speed and predictive performance of the model. Another work from our team has developed a memristor-based neuromorphic vision system that eliminates the need for complex data conversion (Zhou et al. 2023). This system enables direct optoelectronic sensing and image preprocessing on memristor arrays. In this context, we have designed a novel FSCA that is particularly well-suited for integration into vision processing devices based on image sensors. Previously, we have already validated the significant advantages of our design algorithm in terms of hardware resource utilization through circuit-level simulations. Here, we will proceed to verify the potential of our algorithm in feature extraction and neural network applications at the system level.

Gabor filter is defined as the product of the Gaussian function and sinusoidal function, and the expression is (Grigorescu et al. 2003):

$$g\left( {x,y;\lambda ,\theta ,\psi ,\sigma ,\gamma } \right) = \exp \left( { - \frac{{x^{\prime2} + \gamma ^{2} y^{\prime2} }}{{2\sigma ^{2} }}} \right)\exp \left( {i\left( {2\pi \frac{{x^{\prime}}}{\lambda } + \psi } \right)} \right)$$
(11)

Where \(x' = x\cos \theta + y\sin \theta\), and \(y = - x\sin \theta + y\cos \theta\), and denote the horizontal and vertical coordinates of a pixel in the image, respectively, is the orientation parameter, and refers to the wavelength and is measured in pixels. In addition, is the phase shift, and \(\sigma\) is the standard deviation of the Gaussian coefficient, \(\gamma\) denotes the spatial aspect ratio, which determines the ellipticity of the received field.

Although Gabor filtering with a variety of wavelengths and orientation can extract different features, our main purpose here is to verify our FACA algorithm, so Gabor filters with limited wavelengths and fixed orientation is used to extract features. We refer to the method of converting the Gabor filter into convolution calculation. Perform Gabor filtering at different wavelengths of 0.9, 1.2 and 1.5 by selecting the 0° and 90° orientation parameters, and other parameters are selected as typical values, such as a = 1, which are used to extract the characteristics of the horizontal and vertical directions of the image at different wavelengths. The memristor used to store the input pixel matrix is also simulated using an 8-bit stable state memristor.

Next, we validate the application of design algorithms in deep neural networks for enhanced MNIST data set recognition. First, we build a multilayer neural network; the network structure is 768-100-10, and the MNIST data set is divided into three parts, 50,000 training data, 10,000 validation data, and 10,000 test data. The training algorithm uses mini-batch Stochastic Gradient Descent (Mini-batch SGD), the number of batches is 200, and without other special processing, the recognition accuracy of the 10,000 test set is 97.52%. Next, we add a Gabor filter using the FACA algorithm before the first layer of the network to extract features from the input image to enhance the network performance. Figure 15 illustrates the concept of connecting the output of Gabor filter results to a fully connected neural network using memristor arrays.

Fig. 15
figure 15

Concept diagram of memristive multi-layer neural network using Gabor filter preprocessing

During the preprocessing phase, the input image is sequentially read column-wise into the feature extraction array. The Gabor filter kernel is then applied to the input, and the filtered results are converted using an analog-to-digital converter (ADC) before being inputted into the memristor-based multi-layer fully connected network. The Rectified Linear Unit (ReLU) activation function is used, and the number of neurons in the hidden layer is set to 100.

The output of the hidden layer is processed by an ADC and subsequently inputted into the multi-layer neural network. Notably, the size of the convolution kernel varies depending on the wavelength of filtering. It is important to mention that in the system simulation phase, we did not model or simulate the memristor-based multi-layer neural network. Instead, our modeling and simulation efforts focused solely on the feature extraction array using memristor arrays. For the subsequent multi-layer neural network, we utilized a pre-existing MATLAB programming implementation. Because of the existence of the convolution kernel, the parameters of the next layer input can be reduced, such as the number of parameters of the network filtered by 0.9 Gabor wavelength becomes 676-100-10, and the parameters of the network with 1.2 and 1.5 wavelengths become 576-100-10.

Compared with the network without Gaussian filtering, the network parameters are reduced by 13.6%, 26.2% and 26.2% for wavelengths of 0.9 and 1.2 and 1.5, respectively. And through 20 experiments, the highest average recognition rate was 98.25% on the test sets at the wavelength of 1.2, as shown in Table 4.

Table 4 Effect of wavelength \(\lambda\) and and ADC bit width on network

However, in practical scenarios, the output of the Gabor filtering process is still a analogy output, as depicted in Fig. 10b. Consequently, to facilitate its utilization in subsequent applications, the output requires conversion and transmission through ADCs. The quantization error of ADCs plays a crucial role in impacting the performance of neural networks. In the circuit design section, our focus is on implementing the FACA algorithm we designed. In the system simulation part, we conduct experiments to examine the influence of ADCs’ quantization error on system applications. The experimental approach involves adding a Gabor filter layer based on the memristor model before the multi-layer neural network in the system simulation. Additionally, we introduce linear ADCs to the output unit, which can simulate the quantization accuracy of ADCs to some extent and positively affect subsequent network performance (Peng et al. 2020). In this study, we only consider the quantitative aspects of the feature extraction layer and directly connect the output from the feature extraction layer to the subsequent multi-layer fully connected network.

We conducted tests to assess the impact of using 2-bit, 4-bit, 6-bit, and 8-bit ADCs under different wavelength conditions. The experimental results are presented in the Table 4. Our findings indicate that the quantization error of ADCs in the feature extraction layer has minimal impact on the performance of the subsequent network. However, when the quantization accuracy of ADCs decreases to 2 bits, there is a significant decrease in network performance. We attribute this observation to the Gabor filter layer, which primarily focuses on extracting image features rather than directly identifying the network. Furthermore, the MNIST recognition task is relatively simple and does not involve complex information such as color and texture. This explains why the network trained on binarized MNIST dataset still achieves high recognition accuracy (Kayumov et al. 2022).

Because of the non-ideal characteristics of the memristor device itself, the performance factor that reduces the memristor network is the device-to-device and cycle-to-cycle variation when writing to memory devices. The device-to-device variation is caused by the fact that the amount of change in conductance is not the same each time when the same set voltage is applied to the same memristor, which can be simulated by adding Gaussian noise to the amount of change in conductance of the memristor and can be expressed as

$$\begin{aligned} \Delta G \in \mathrm{{N}}(\Delta G,\sigma _{C2C}^2) \mathrm{{ }},\mathrm{{ }}\sigma _{C2C}^{} = \gamma \Delta G \end{aligned}$$
(12)

We test the classification accuracy at different wavelengths, and the results in Fig. 16a show that our network can effectively resist the device-to-device variation within 20% of memristors; after exceeding 30%, the classification accuracy drops obviously.

Fig. 16
figure 16

Accuracy of identifying MNIST datasets on cycle-to-cycle and device-to-device variation. a Cycle-to-cycle variation. b Device-to-device variation

Device-to-device variation can be simulated by adding Gaussian noise to the coefficients in the differential equation of memristor \(koff \in \mathrm{{N}} (koff,\sigma _{dp}^2)\) and \(kon \in \mathrm{{N}}(kon,\sigma _{dn}^2)\). As shown in Fig. 16b, we find that the recognition rate of the network is unaffected by

device variation up to 20% and decreases significantly above 30%.

Table 5 Parameters of two kinds of memristor

5 Conclusion

This work presents a full-scale convolutional computational circuit based on a 1T3M memristor crossbar that can effectively suppress the sneak path issue in the array structure while verifying the significant advantages of the design scheme in terms of computational time and power consumption in spice circuit simulations. Table 5 summarizes the comparison with other state-of-the-art techniques in terms of performance. Our work boasts the most impressive computational power as it manages to complete kernel convolution computations with 7225 directions in a single operation, all without increasing the area of the computational unit. Additionally, our design not only exhibits excellent computational performance but also has the lowest power consumption for single-convolution operations. We also verify that the Gabor filtering implemented based on our design scheme is able to improve the recognition rate on the MNIST dataset to 98.25%, reduce the network parameters by 26.2%, and effectively immunize against various non-idealities of the memristive synaptic devices up to 30%.