Introduction

Retinal ganglion cells (RGCs) play a crucial role in transmitting visual information from the retina to the central nervous system [1]. Located near the inner surface of the retina, these cells receive visual inputs from photoreceptors and propagate the signals to the brain through intermediate neurons, such as bipolar cells and retinal longitudinally free cells [2]. The resulting spike trains, representing the neural activity patterns generated by visual stimuli, are of great interest in the field of retinal spike decoding [3]. The ability to accurately reconstruct visual scenes from retinal spike trains has significant implications for understanding visual perception and developing medical interventions for visual impairments [4].

Despite recent advances in retinal imaging techniques [5], conventional spike decoding techniques remain inadequate for accurately reconstructing visual scenes from retinal data. While they can detect and analyze spike activity in the retina [6], decoding the complex relationship between retinal spikes and visual stimuli remains a challenging task. Moreover, The challenge is compounded by the specificity and information selectivity of RGCs, which have complex coding rules that are selectively computed only for specific stimulus features [7]. The highly nonlinear processing rules precisely shape retinal narratives and require more sophisticated and advanced techniques to unlock their attributes.

Artificial neural networks (ANNs) [8] such as convolutional neural network (CNN), recurrent neural network (RNN), spiking neural network (SNN), and machine learning models are inspired by the structure and function of the biological brain [9] to provide a potential solution to this problem. The machine learning models are well suited to decode retinal spike trains due to their ability to learn the complex relationship between the input signal and the target output. Additionally, vector quantization (VQ) [10] is an established technique for compressing data. VQ provides an efficient way to represent scalar datasets as vectors, which can then be quantized in the vector space, resulting in data compression without significant loss of information. Based on this idea, our paper presents a novel neural network framework called the vector quantization fully connected convolutional decoding network (VQ-FCDnet), designed for decoding retinal spike trains. This model achieves outstanding efficacy and superior quality in reconstructing natural visual scenes from spikes of retinal ganglion cells (RGCs) that have been recorded in the retina.

Our proposed framework comprises three major steps. First, during the encoding process, the visual scene is represented as spike trains using either retinal cells or simulation software. Second, the spike trains are converted into feature maps using a fully connected network [11], followed by convolutional operations that further extract and compress the features. Finally, during the decoding process (i.e., spikes to image), the nearest neighbour search method is employed to find the closest embedding vector in the feature maps corresponding to common embedding codebooks, and thus derive a new feature map. The feature map is then decoded using convolutional neural networks [12] and transposed convolutional neural networks [13] to reconstruct the visual scene.

The main contributions of this paper are as follows:

  • We propose a new neural network model for decoding retinal spikes trains. The network has a simple structure and directly reconstructs visual scenes from retinal spike trains. It achieves higher scores than other network architectures in evaluation indicators like peak signal to noise ratio (PSNR), structural similarity index (SSIM) and mean square error (MSE).

  • The proposed structure of the network based on vector quantization aggregates similar features of spike trains and distributes dissimilar features to reconstruct visual scenes, providing a better scheme for recreating visual scenes from spike trains.

  • In the proposed network, the retinal spike trains are not directly decoded, but the most similar features are found in the trained embedded codebook with all the features for decoding, which increases the stability and anti-interference of the network.

The paper is organized as follows. “Related work” reviews the related work. “Proposed method” describes the detailed process of using VQ-FCDnet decoding and the loss function for training the network. “Experimental results and comparative analysis” presents the evaluation criteria, experimental configurations, data sets, and results, as well as a comparison with the first step method for reconstructing visual scenes from RGCs spike trains. “Future scope” suggests future work and “Conclusion” concludes the paper.

Related work

In general, an ideal visual neuron decoder [14] should be able to reconstruct stimuli from neural responses to clearly restore the visual scene. However, the reconstruction of visual scenes from visual neuron spike trains is a complex and difficult task. Neurons generate noise when processing visual information [15], so each spike in a spike train contains noise. These noises make accurate reconstruction of visual information more difficult [16]. During the transmission of the spike trains, some of the spikes in the spike trains may be lost due to the complex connection between neurons. The lost spikes may contain critical information, which can also interfere with the reconstruction of visual information.

There are traditional and neural network-based methods for reconstructing visual scenes with retinal neuron decoders. Studies related to traditional methods are as follows. Pillow et al. [17] proposed a model consisting of a linearly filtered stimulus-driven leaky integrated-fire pulse generator, post-pulse currents, and Gaussian noise currents. This model can be used to derive an explicit maximum likelihood decoding rule [18] for neural spike training and primate RGC light responses with stimulus selectivity. Ariadna et al. [19] combined multi-electrode array (MEA) and a software capable of characterizing and grouping spikes based on principal component analysis (PCA) [20]. They also used different clustering algorithms to localize the response of moving stripes crossing the visual field in eight orthogonal directions. The receptive field of each cell was used to reconstruct the complex visual stimuli. However, the reconstructed scenes are very blurred, and only greyscale images can be reconstructed.

Related studies on visual scene reconstruction using neural networks for retinal pulse signals are as follows. Kim et al. [21] combined a low-pass linear decoder and a high-pass nonlinear decoder to obtain preliminary reconstruction results, which were then fed into a neural network to reconstruct visual scenes. The modified approach could only reconstruct simple visual scenes such as bicycle tyres, cylinders, and simple black-and-white textures. Zhang et al. [22] designed a SID model including a spike-to-image converter and an image-to-image autoencoder to implement an end-to-end decoder from neural spikes to images to reconstruct visual scenes directly from spike signals. By combining a fully connected network (FCN), a capsule network (CapsNet) [23], Li et al. [24] designed a structural similarity index metric based on SSIM and L1 loss function for retinal spike train decoder. These images are better than those generated by previous methods but suffer from blurring. Although visual scene reconstruction has been studied for many years, the decoding performance of existing methods still needs further improvement, especially in complex visual scene reconstruction. Thus, scene neural decoding is still a challenge that needs further development and innovation.

VQ is a lossy data compression method based on block coding rules. Its basic idea is to compress data without losing much information by transforming a number of scalar datasets into a vector and then performing an overall quantization in vector space. The vector quantization variant autoencoder (VQVAE) [25] based on the concept of VQ is an advanced framework for image generation similar to the traditional autoencoder and variant autoencoder (VAE) [26], but with the addition of a quantization step that maps continuous values to discrete codes. This quantization step improves the ability of the network to compress the data, while the variable component ensures that the generated output remains of high quality. By using two scales of features for quantization, VQVAE-2 [27] models the local and global features of an image, with bottom-level features used to extract local information and top-level features used to extract global information. This enables the generation of larger and clearer images. Inspired by VQVAE, we also use VQ to quantify the features extracted from RGC spike trains.

Proposed method

Figure 1 illustrates the comprehensive process of encoding visual scene stimuli to reconstruct visual scenes. Initially, the input image is encoded into RGC Spike through an encoder, such as retinal cells or retinal simulation software (refer to “Retinal spike dataset” for more information). Subsequently, the RGC spike undergoes a flatten operation and is then fed to our novel VQ-FCDnet which produces the decoder image. In this section, we present an in-depth analysis of the architecture of our proposed VQ-FCDnet, elucidating the function of each module, the loss function, and the training method.

Fig. 1
figure 1

Process description: the image is encoded into a spike trains by retinal cells, and the spike trains is sent into VQ-FCDnet, where the decoded image is obtained through VQ-FCDnet decoding

Network architecture

The architecture of VQ-FCDnet is mainly composed of three blocks (as shown in Fig. 1): feature extraction and compression network (FECN); vector quantization layer; and reconstruction network (REN).

FECN module

The FECN module, as shown in Fig. 2, performs extraction and compression of feature maps from the spike signals. It consists of fully connected feature extraction block (FEB) and convolutional feature extraction and compression block (CECB). FEB performs shallow extraction of spike signals through four fully connected layers, and its structure is shown in Fig. 3. The size of the one-dimensional features generated by these four fully connected layers is 8192, 4096, 8192, and 16,384. We employ the LeakyReLU activation function in the first three layers to enhance the nonlinearity of the feature extraction process. This function allows the network to capture more complex relationships between the input spike signals and the extracted features, thereby improving the representational power of the network.

Fig. 2
figure 2

FECN module

Fig. 3
figure 3

Structure of FEB

If the size of the input spike signal after the flatten operation is 10,000, the output expression for FEB at layer i is expressed as

$$\begin{aligned} O_{FEB}^{[i]} = \psi _{(\alpha )}^{[i]}\left( W^{[i]} O_{FEB}^{[i-1]} + b^{[i]}\right) , \end{aligned}$$
(1)

where \(O_{FEB}^{[i-1]}\) represents the output of layer \(i-1\) of the FEB. \(\psi _{(\alpha )}^{[i]}\) is LeakyReLU activation function of the i-th layer, where \(\alpha \) is the negative slope of LeakyReLU. \(W^{[i]}\) and \(b^{[i]}\) are, respectively, the weight matrix and bias vector of the i-th layer full connection.

As the output, \(O_{{FEB}(16384)}\), of the FEB represents the output of the first four layers specifically, the fourth layer does not use the activation function. Hence, \(\psi _{(0.2)}^{[4]}=1\). This is mathematically represented as

$$\begin{aligned} O_{{FEB}(16384)} = O_{FEB}^{[4]} = \psi _{(0.2)}^{[4]}\left( W^{[4]} O_{FEB}^{[3]} + b^{[4]}\right) . \end{aligned}$$
(2)

Convolutional feature extraction and compression blocks use two residual blocks (ResBlock) [28] and a channel attention module (CAM) to finely extract feature maps of size (256, 8, 8) from the feature \(O_{{FEB}(16384)}\). The convolutional layer then compresses this 256 channel feature map into a 128 channel feature map. The structure of this block is shown in Fig. 4.

Fig. 4
figure 4

Structure of CECB

Residual block is a jump-connected convolutional neural network that preserves pre-convolutional features on the basis of extracted features. It consists of two convolutional modules and a hop connection, with convolutional core sizes of \(3\times 3\) and \(1\times 1\), respectively. The ResBlock function is mathematically expressed as

$$\begin{aligned}{} & {} Res_{(c,h)}(x) = \phi (\phi (x) \odot K_{(3\times 3, h, 1, 1)}) \odot K_{(1\times 1, c, 1, 0)} + x ,\nonumber \\ \end{aligned}$$
(3)

where \(Res_{(c,h)}(x)\) represents the output of ResBlock, c is the number of channels in the input and output feature maps, h is the number of channels that hide the convolution layer, and x is the input feature maps. \(\phi \) is the activation function ReLU [29] and \(\odot \) is the convolution. \(K_{(3\times 3, h, 1, 1)}\) and \(K_{(1\times 1, c, 1, 0)}\) are, respectively, convolution kernels with size \(3\times 3\) and \(1\times 1\), depths h and c, stride 1 and 1, padding 1 and 0.

CAM [30] is a module that enhances the feature representation capabilities of each channel by learning the correlation between channels, thereby improving the network performance. First, it compresses the feature map into two vectors through global average pooling and global maximum pooling, respectively. A two-layer fully connected network is then used to perform a nonlinear transformation on the two vectors to obtain two weight vectors. The two weight vectors are added, and a sigmoid activation function is used to obtain the weight vectors between different channels of the feature map. Finally, the weighted feature map is obtained by multiplying the feature map with the weight vectors. The operation of CAM is mathematically expressed as

$$\begin{aligned} CAM_{(a,r)}(x)&= \theta ((W_2\phi (W_1 \alpha (x)+ b_1) + b_2)\nonumber \\&\quad + (W_2\phi (W_1 \beta (x)+ b_1) + b_2)) *x , \end{aligned}$$
(4)

where \(CAM_{(a,r)}(x)\) represents the output of CAM, a represents the number of channels in the input and output feature maps, and the output of the second fully connected layer. r represents the output of the first fully connected layer, x is input feature maps, \(\theta \) is the activation function sigmod, \(\phi \) is the activation function ReLU, \(*\) is the multiplier, \(\alpha \) is average pooling and \(\beta \) is max pooling. \(W_1\), \(W_2\) are the weight matrices, and \(b_1\), \(b_2\) are the bias vectors of the two fully connected layers. Therefore, if \(\gamma \) is a reshape parameter, the process of CECB is given by

$$\begin{aligned} Z_{e}(x)&= O_{CECB(128,8,8)} \nonumber \\&= CAM_{(256,16)} (Res_{(256,64)}\nonumber \\&\quad (Res_{(256,64)} (\gamma (O_{{FEB}(16384)})))) \odot K_{(1\times 1, 128, 1, 0)} . \end{aligned}$$
(5)

Through these two blocks, the input spike trains is converted into a feature maps \(Z_{e}(x)\).

Vector quantization

Vector quantization is the discrete quantization of input feature maps into feature maps composed of vectors in a codebook. First, we define a latent embedding codebook of dimension [1024 \(\times \) 128] where 1024 is the number of embeddings and 128 is the dimensionality of each latent embedding vector. Thus, the codebook \(E = [e_1, e_2, ..., e_i] \in R^{K \times D}\).

Next, we use the feature map \(Z_{e}(x)\) (a 128-dimensional vector of size \((8 \times 8)\)) extracted from the FECN. We use the nearest neighbour search method to find the nearest embedding vector \(e_i\) in the embedded codebook for the \(8 \times 8 = 64\) 128-dimensional vectors and use its index to express it, resulting in the index table Z (with a size of \((8 \times 8)\) as shown in Fig. 1).

Finally, we find the \(8 \times 8 = 64\) embedded vectors \(e_i\) corresponding to the index table Z in the codebook, replace the 64 vectors of the feature map \(Z_{e}(x)\) according to the position of the index table Z, and obtain the maps \(Z_{q}(x)\) to be reconstructed later. The vector quantization algorithm is presented as the following pseudo code:

Algorithm 1
figure a

Vector quantization

REN module

The REN module reconstructs a feature map into a visual scene. The module (shown in Fig. 5) consists of two main blocks: convolutional restore block (CRB) and transposed convolution reconstruction block (TRB).

Fig. 5
figure 5

REN architecture

The CRB restores the input feature map \(Z_{q}(x)\) to a feature map \(O_{CRB}\) through a convolutional layer and two residual blocks. It enhances the expression ability of feature maps. The steps involved can be mathematically expressed as

$$\begin{aligned} O_{CRB(256,8,8)}= & {} Res_{(256,64)}(Res_{(256,64)}\nonumber \\{} & {} (Z_{q}(x) \odot K_{(3\times 3, 256, 1, 1)})) . \end{aligned}$$
(6)

The TRB is the last component of our proposed VQ-FCDnet, which employs two transposed convolutions to reconstruct the visual scene. It plays a vital role in the overall architecture, as it is responsible for mapping the feature map to its corresponding target values. By doing so, it is able to provide a high-quality visual representation of the scene, thus enhancing the overall performance of the network. The attention to detail in this module ensures that the spatial location of each feature map is precisely aligned, resulting in a more accurate and realistic reconstruction of the visual scene. The role of TRB is expressed as

$$\begin{aligned} O_{image(1,32,32)}&=O_{TRB(1,32,32)}= \phi (O_{CRB(256,8,8)}\nonumber \\&\quad \oslash T_{(4\times 4, 128, 2, 1)}) \oslash T_{(4\times 4, 1, 2, 1)} , \end{aligned}$$
(7)

where \(\oslash \) is transposed convolution, and \(T_{(4\times 4, 128, 2, 1)}\) and \(T_{(4\times 4, 1, 2, 1)},\) respectivley, represent convolutional kernels with size of \(4\times 4\) and \(4\times 4\), depths of 128 and 1, stride of 2 and 2, and padding of 1 and 1.

Loss function

Similar to VQVAE, the total loss of VQ-FCDnet is due to reconstruction loss, codebook loss, and commitment loss. Reconstruction loss is used to optimize the FECN and REN modules. Due to the non-differentiability of the argmin operation, the gradient of reconstruction error cannot be transmitted to the FECN. We use a straight-through estimator to solve this problem. We directly use the gradient of \(Z_e(x)\) as the gradient of \(Z_q(x)\) and the reconstruction loss is given by

$$\begin{aligned} L_{re} = \log {p(x|Z_q(x))} . \end{aligned}$$
(8)

Although the gradient of the reconstruction error is transmitted to the encoder by the straight-through estimation method [31], the embedding vector \(e_k\) cannot receive the gradient of the reconstruction error band, which also means that the embedding codebook cannot participate in learning.

The codebook loss is determined using a simple dictionary learning method [32], which calculates the L2 error [33] of the output \(Z_e(x)\) of the FECN and the corresponding quantized embedding vector \(e_k\). However, to stabilize the training, improve the performance and generalization ability of the model, and prevent the occurrence of problems such as gradient explosion or disappearance during the training process, we use exponential moving average (EMA) [34] to update the codebook independently.

As the training is based on mini-batch, the updated mathematical expression for embedding the codebook is presented as

$$\begin{aligned} e_i^{(t)} = \frac{m_i^{(t)}}{N_i^{(t)}} = \frac{\lambda m_i^{(t-1)}+(1-\lambda ) \sum _{j=1}^{n_i^{(t)}}z_{i,j}^(t)}{\lambda N_i^{(t-1)}+(1-\lambda )n_i^{(t)}} , \end{aligned}$$
(9)

where \(e_i\) is a vector in codebook, (t) and \((t-1),\) respectively, denote the current and previous time instances, \(m_i\) is the sum of \({z_{i,1}, z_{i,2},...,z_{i,n_i}}\), \(N_i\) is the corresponding number of \({z_{i,1}, z_{i,2},...,z_{i,n_i}}\) elements for each embedding vector \(e_i\), and \(\lambda \) is the decay value that takes a value of 0.99 in this experiment.

Commitment loss [35] is mainly used to constrain the consistency of the FECN output and the embedded codebook to avoid significant changes in the FECN output. Commitment loss directly computes the L2 error of the FECN output \(Z_e(x)\) and the corresponding quantized embedding vector \(e_k\), i.e.,

$$\begin{aligned} L_{com} = ||Z_e(x)-sg[e]||_2^2, \end{aligned}$$
(10)

where sg refers to a stop-gradient operation that blocks gradients from flowing through e.

Therefore, the overall training objective is

$$\begin{aligned} {L = \log {p(x|Z_q(x))} + \beta ||Z_e(x)-sg[e]||_2^2} . \end{aligned}$$
(11)

The reconstruction loss is applied to FECN and REN, and the commitment loss is used to constrain FECN. Here, \(\beta \) is the weighting coefficient, which is set as 0.25. EMA updates the codebook independently regardless of the type of optimizer and update rules.

Training model

The training model uses Adam optimizer and L as the loss function. The learning rate is 0.0001, and the training is terminated after 50 consecutive iterations when val-loss (L) is no longer reduced, i.e., it is inferred that the model has been trained to the optimal level.

Experimental results and comparative analysis

This section presents the dataset, retinal spike generation software, and the experiments conducted for performance analysis.

Retinal spike dataset

The dataset consists of salamander retinal ganglion cell responses to natural images. The dataset was built from Liu et al. [36] collection and contains multi-electrode array recordings of retinal ganglion cell spike activity measured in isolated salamander retinas. The stimuli were the sequences of 300 natural images plus 1 black screen (− 100% contrast), 1 grey screen (0% contrast), and 1 white screen (\(+\) 100% contrast). The dataset contains a \(156\,\times \,303\,\times \,13\,\times \,300\) four-dimensional binary matrix of 0 s and 1 s corresponding to “spikes” (“1”) or “no spikes” (“0”). The dimensions of the matrix correspond to 156 cells, 303 images, 13 trials, and 300 time boxes. In this paper, we select 20 spike trains from different images in one of the 13 experiments as the test set, and the remaining sequences as the training set.

Retinal simulation software: PRANAS

PRANAS [37] is a powerful retinal simulation software that provides pre-set retinal profiles to configure the corresponding retinal parameters. Once the visual scene is input into the software, the corresponding spike trains are generated from the simulated retina. In this paper, we choose the default setting. The stimulus duration for each image is 100 ms and the response of 100 neurons in 1 ms is considered.

Multiple datasets were used for experiments, each with 30,000 images corresponding to spike trains of 100 neuronal cells within 100 ms, i.e., (30000,100,100). 28,000 images and their spike trains were used as training set and 2000 were used for testing. The datasets include simple MNIST and slightly complex Fashion-MNIST, as well as complex Cifar-10, coco and Celeba-HQ.

Fig. 6
figure 6

Features of the MNIST and Fashion MNIST datasets before and after vector quantization

Configuration for experiments

We implemented the VQ-FCDnet model on PyTorch version 1.8.1 and executed Python 3.7 in the background. Experiments were performed on a workstation configured with an Intel Xeon Gold 5118 CPU and 128 G RAM. The batch size was set to 256 and the model was trained using an NVIDIA Quadro P5000 GPU.

In this study, three metrics were used to evaluate the quality of the reconstructed images: SSIM [38], MSE [39], and PSNR [40]. SSIM is a comprehensive reference metric that measures the similarity between two images based on the initial uncompressed or undistorted image (i.e., the reference in this study is the original visual scene stimulus). It is a perceptual model that views image degradation as a perceptual change in structural information, while also incorporating important perceptual phenomena such as brightness masking and contrast masking. The SSIM value ranges between 0 and 1, with a higher value indicating a higher similarity between the two images. On the other hand, MSE measures the mean square deviation between the reconstructed scene and the original visual scene. A smaller MSE value indicates a smaller mean square deviation between the two images. Finally, PSNR, which is defined through MSE, is commonly used to quantify the reconstruction quality of distorted and lossy compressed images and videos. A higher PSNR value indicates a higher reconstruction quality of the image.

Experiments

Ablation study of VQ

Since VQ-FCDnet is based on vector quantization, it is essential to verify whether vector quantization performs an optimization role in this experiment. In this paper, t-SNE [41] is used to visualize the two feature vectors of \(Z_e (x)\) before vector quantization and \(Z_q(x)\) after vector quantization on MNIST and Fashion MNIST datasets. The visualization of the feature diagrams are shown in Fig. 6. The figure demonstrates the effects of vector quantization on the feature points of the images. Prior to vector quantization, the feature points of similar images, such as Image 1 and Image 4 in MNIST, and Image 2 and Image 3 in Fashion-MNIST, exhibit less coincidence, while the feature points of dissimilar images, namely Image2 and Image3, exhibit more coincidence. Following vector quantization, the feature points of these originally similar images show more consistency, while the feature points of dissimilar images show less consistency. Additionally, discernible differences were observed between different categories.

To sum up, the feature segmentation of different types of images before vector quantization is not obviously incomplete, and the features after vector quantization show obvious boundaries. Similar images overlap and are close to each other after quantization. The overlapping features of images with large differences are smaller. It could be observed that vector quantization separates different features and aggregates similar features.

Experiment on noise immunity of vector quantization

In the process of acquiring pulse signals, the presence of noise poses a significant challenge to data analysis. However, by employing vector quantization, we are able to overcome this limitation and improve the anti-noise ability of our neural network. Specifically, the creation of a codebook during training enables our network to capture the essential characteristics of the spike trains. By utilizing this codebook, our network decodes features that match the characteristics of the spike trains, thereby avoiding the direct decoding of the spike trains. Through this indirect approach, our network is able to filter out noise in the signal, ultimately leading to improved performance.

To evaluate the noise immunity of our proposed model, we conducted experiments by introducing varying levels of random noise to the pulse signal. The pulse signal with noise was then fed into two networks: one with vector quantization and the other without. The results are presented in Fig. 7. The figure illustrates the proposed model exhibits robustness against random noise. As the noise ratio increases, the model with vector quantization retains the majority of the signal characteristics, whereas the model without vector quantization fails to capture the original pulse signal information in the presence of noise. These results demonstrate that vector quantization plays a crucial role in enhancing the noise immunity of our proposed model.

Fig. 7
figure 7

Differences in the presence or absence of vector quantization under different noise ratios

Figure 8 presents the experimental results of the proposed method and the method without vector quantization under various levels of noise (noise rate) on popular datasets such as MNIST, Fashion-MNIST, Celeba-HQ, Cifar-10 and Coco. The comparison curves have been drawn using reliable evaluation indicators PSNR, SSIM and MSE to assess the effectiveness of the proposed method. Our findings reveal that while the two methods exhibit similar results in low-noise scenarios, the proposed method with vector quantization outperforms the non-vector quantization methods as the noise rate increases. Specifically, we observed that the results of the non-vector quantization method exhibited a more significant reduction in reconstruction quality across all datasets.

Fig. 8
figure 8

Effectiveness of vector quantization methods with various evaluation indicators under different noise ratios

The outcomes of this experiment demonstrate the superior noise immunity characteristics of the proposed model and highlight the significance of vector quantization in mitigating the impact of noise. These findings are of great significance for improving the robustness of retinal visual scene reconstruction in practical applications, especially in environments where noise interference is common.

Performance comparison with other methods

We compared the performance of the proposed network with five other methods for retinal reconstruction of visual scenes. Method I and Method II are by Zhang et al. [22] and Li et al. [24], respectively, which are considered state-of-the-art methods in the field. In addition, since the inputs are spike trains, we designed a fully connected spiking neural network based on IF neurons to evaluate the difference between the SNN [42] and the proposed network, named Method III. Furthermore, we designed a method similar to the proposed VQ-FCDnet by combining the fully connected neural network with VQVAE [25], named Method IV, to evaluate the performance of the proposed VQ-FCDnet compared to FCN+VQVAE. Finally, to evaluate the effectiveness of VQ-FCDnet, we also removed the CECB from the model, named Method V, and performed ablation experiments. We evaluated the performance of the above five methods and the proposed method on five datasets. The reconstruction visual scene effects are illustrated in Fig. 9.

Fig. 9
figure 9

Comparison of the images generated by the five methods and the proposed method with the source image, respectively, showing the SSIM, MSE, and PSNR scores of the reconstructed image with source image

Figure 9 shows that all methods perform well in reconstructing the simple MNIST dataset. However, on more complex datasets (e.g., Fashion-MNIST and Celeba-HQ), Method I reconstructs poorly, while Method III and Method IV reconstruct blurry images, and Method II and Method V fail to capture the details in the results of the proposed methods. For the complex datasets Cifar10 and COCO, all methods produce blurred reconstruction results compared to the original images. However, in terms of contour details, the proposed method outperforms the other five methods. In addition, we evaluated the reconstruction results of these methods using metrics such as PSNR, SSIM and MSE. The evaluation results are shown in Table 1, where the best results are shown in red. It is observed that the performance of the proposed method is much better than the other methods.

Table 1 Performance comparison with other methods
Fig. 10
figure 10

Comparison of reconstructed images with original images from other datasets

On five different datasets, the proposed method improves PSNR and SSIM by an average of 2.016 and 0.1226, respectively, and reduces MSE by an average of 0.0109 compared to Method I proposed by Zhang et al. [22]. Compared to Method II proposed by Li et al. [24], the proposed method improves PSNR and SSIM by an average of 1.176 and 0.05, respectively, and reduces MSE by an average of 0.0055. Thus, the proposed method outperforms the latest research methods in terms of visual reconstruction performance. Compared with Method III (i.e., SNN), the proposed method has an average improvement of 1.0596 and 0.089 in PSNR and SSIM, respectively, and an average reduction of 0.011 in MSE, which proves that the proposed method outperforms SNN in visual reconstruction. Compared with Method IV (i.e., FCN+VQVAE), which has a similar network structure to the proposed method, there is an average increase of 0.785 and 0.049 on PSNR and SSIM, respectively, and an average reduction of 0.003 on MSE. The metrics prove that the structural improvement of the proposed method improves the visual reconstruction. Compared with Method V of the ablation experiment, the proposed method has an average increase of 0.205 and 0.005 on PSNR and SSIM, respectively, and an average decrease of 0.0006 on MSE, demonstrating that CECB improves the visual reconstruction.

Other experiments

To verify the effectiveness of our proposed method on a broader range of image datasets, we conducted experiments on nature image datasets as well as colour image datasets such as Celeba-HQ and CifAR10. The results are presented in Fig. 10 and the corresponding evaluation metrics are presented in Table 2.

Table 2 Evaluation of other datasets

Based on the restored images and evaluation metrics, it is evident that the proposed method is capable of reconstructing visual scenes with strong contrast even in the presence of a small amount of data, while also effectively reconstructing complex visual scenes in the colour image dataset. The experimental results highlight the efficacy of applying our method to a wide range of applications, such as image and video compression, image restoration and scene reconstruction. Furthermore, the performance of the proposed method on diverse datasets suggests its robustness and generalizability, paving the way for its potential integration into various real-world scenarios.

Future scope

Although our method has achieved good results at its current stage of development, the process of reconstructing visual scene stimuli from RGC spike trains is still a challenge. In future work, we will further improve the network model in the following aspects to make it closer to the function of the human eye:

  • Improve the quality of the reconstructed visual scene and ensure that the reconstructed image is similar to the nature scene in terms of clarity and colour recovery.

  • Improve the reconstruction ability of the model for small datasets.

  • Propose a model for a dynamic video to realize real-time reconstruction and restore the visual scene more realistically.

Finally, we hope that our work can inspire other researchers and jointly promote the development of retinal visual scene reconstruction.

Conclusion

In this paper, we have proposed a deep network VQ-FCDnet for retinal spike signal reconstruction of visual scenes based on vector quantization. We first built a FECN module composed of multi-layer fully connected neural networks and convolutional neural networks to extract and compress the feature information of pulse signals. The nearest neighbour search method is then used to distribute the feature information into multiple vectors of potential codebooks. These vectors are recombined into new feature maps and sent to the REN module composed of convolutional neural networks and transposed convolutional neural networks to reconstruct the visual scene. In structural analysis experiments, it has been shown that vector quantization has a significant impact on the aggregation of similar features and the dispersion of different features of retinal pulse signals. By comparing the impact of networks with or without vector quantization under different levels of noise, it has been verified that vector quantization has a significant impact on the immunity of the network to noise, providing a new method for reconstructing retinal visual scenes. The proposed method was evaluated on multiple datasets and the reconstruction results were evaluated using five evaluation parameters. Experimental results show that the proposed method is superior to other methods, with higher clarity, richer details, and more accurate spatial structure relationships.