Introduction

Word embeddings models can convert both semantic and syntactic information of words into dense vectors, for example, Word2Vec [1] and GloVe [2]. Recently, they attract a lot of attention due to their good performances in various natural language processing tasks, such as language modeling [3], parsing [4], sentence classification [5], and machine translation [6].

However, these dense representations are mostly derived from statistical property of large corpus while are lack of interpretability in each dimension of the word vectors. Several works have tried to transform dense word embeddings into sparse ones to improve the interpretability. Murphy et al. introduced a matrix factorization algorithm named non-negative sparse embeddings (NNSE) on co-occurrence matrix to get sparse, effective and interpretable embeddings [7]. Faruqui et al. defined a L1 regularized objective function and proposed an post-process optimization algorithm to convert original dense embeddings into sparse or binary embeddings. They call them sparse or binary overcomplete word vector [8]. Sun et al. introduced an algorithm to get sparse embeddings during training Word2Vec model through L1 regularizer on cost function and regularized dual averaging optimization algorithm [9]. For binary word embeddings, there are also some rounding algorithms on converting dense vectors into discrete integer values to reduce memory. Ling et al. proposed post-processing rounding, stochastic rounding, and auxiliary update vectors algorithms for word embeddings with limited memory, which is named as truncated word embeddings [10]. The interpretability issue in these works is mentioned but not demonstrated clearly. In this paper, we want to improve it via a brain-inspired approach, explaining each dimension of word embeddings based on neuron coding models.

In biological brains, the encoding of information in the areas such as inferior temporal visual cortex, hippocampus, orbitofrontal cortex and insula is with sparse distributed representation [11]. Many experimental evidences have indicated that biological neural systems use the timing of spikes to encode information [12,13,14]. The spike trains of cell activities during information transition inspire us to combine traditional word embeddings and neuron coding models into binary embeddings. In this paper, we perform post-process operations on original dense word embeddings to get binary ones with inspirations from biological neuron coding models, and the proposed binary embeddings are with less space occupation and with better interpretability than previous models.

Related Works

Neuron Coding

Neuron coding is concerned with describing the relationship between the stimulus and the neuronal responses [15]. A great many efforts have been dedicated to developing techniques to enable the recording of the brain’s electrical activity at different spatial scales, such as single cell spike train recording, local field potential (LFP), and electroencephalogram (EEG) [16]. Neuron coding models mainly concern how neurons encode, transmit, and decode information, and their main focus is to understand how neurons respond to a wide variety of stimuli, and to construct models that attempt to predict responses to other stimuli.

Neurons propagate signals by generating electrical pulses called action potentials: voltage spikes that can travel down nerve fibers. For example, sensory neurons change their activities by firing sequences of action potentials in various temporal patterns, with the presence of external sensory stimuli, such as light, sound, taste, smell and touch [16]. It is known that information about the stimulus is encoded in action potentials and transmitted through connected neurons in our brains.

There are various kinds of hypotheses on neuron coding based on recent neurophysiological findings on biological nervous system, mainly including spike rate coding and spike time coding. For spike rate coding, only the firing rate in an interval is concerned as a measurement for information carried. Rate coding is firstly motivated by the observation of the frog cutaneous receptors by Adrian et al. in 1926 that physiological neurons tend to fire more often for stronger stimuli [17]. Spike rate coding has been the main paradigm in artificial neural networks, such as sigmoidal neurons. Meanwhile, the Poisson-like rate coding is widely used by physiologists to describe how the neurons transmit information. Recently, some neurophysiological results show that efficient processing of information is more likely based on precise timing of action potentials rather than firing rate in some biological neural systems [18,19,20]. For timing coding hypotheses [21], they mostly concentrate on the timing of individual spikes and the typical ones are the time to first spike [22, 23], rank order coding [20, 24], latency coding [25], and phase coding [26].

In our study, we use Poisson-like coding for spike rate coding and various spiking neuron models for time coding. We try to apply these biological neuron coding hypotheses to build binary word embedding models.

Spiking Neural Network Models

Spiking neural networks (SNNs), which are highly inspired from recent advancement in neuroscience, are often referred as the third generation neural network models [27]. Different from traditional neural networks, SNNs consider the timing of individual spikes as the means of communication and neural computation [21].

Spiking neuron models are the basis of SNNs, which describe the properties of certain cells in the nervous system that generate spikes across their cell membrane. The most well-known neuron model is Hodgkin-Huxley model (H-H model). In 1952, Hodgkin and Huxley did experiments on the giant axon of squid with the voltage clamp technique, which punctured the cell membrane and allowed to force a specific membrane voltage or current [28]. The model was proposed by the recordings and fitting results, well describing the change of ion channel and neuron behavior after stimulation.

In the H-H model [29], the semipermeable cell membrane separates the interior of the cell from the extracellular liquid and acts as a capacitor. Because of the active ion transportation through the cell membrane, the ion concentration inside the cell is different from that in the extracellular liquid. The Nernst potential generated by the difference in ion concentration is represented by a battery.

The model takes three types of channel into consideration: a sodium channel, a potassium channel, and an unspecific leakage channel with resistance R. From the definition of a capacity C = Q/v where Q is a charge and v is the voltage across the capacitor, thus:

$$ C\cdot \frac{dv}{dt} = -\sum\limits_{k} I_{k}(t)+I(t) $$
(1)

The leakage channel is described by a voltage-independent conductance gL = 1/R. For the sodium channel and the potassium channel, if both of them are open, they transmit currents with a maximum conductance gNa or gK, respectively. However, the channels are not always open; the probability that a channel is open is described by additional variables m, n, and h. The combined action of m and h controls the Na+ channels while the K+ gates are controlled by n.

$$ \sum\limits_{k} I_{k} = g_{Na}m^{3} h (v-E_{Na} )+g_{K}n^{4} (v-E_{K} ) + g_{L} (v-E_{L} ) $$
(2)

The parameters ENa, EK, and EL are empirical parameters and the gating variables m, n, and h are defined by differential equations [28].

In addition to the H-H model, other types of spiking neuron models have been proposed, such as integrate-and-fire models and variants, Izhikevich’s neuron model, and spike response model (SRM). Recently, SNN-based models have been applied in variant AI applications, such as character recognition [30, 31], object recognition [32], image segmentation [33], speech recognition [34], robotics [35], knowledge representation [36], and symbolic reasoning [37]. In this paper, we will use leaky integrate-and-fire model and Izhikevich’s neuron model to convert the word embeddings into more explainable binary embeddings.

Word Embedding Models Based on Inspirations from Biological Neuron Coding

The Framework

We build unsupervised models for post-processing binary word embeddings based on two types of brain-inspired models, homogeneous Poisson process and spiking neural networks. Based on preprocessed word embeddings, such as Word2Vec and GloVe, these models convert original dense embeddings into the form of binarization. Different from traditional works on binary word representations, our models are inspired by neuroscience which are biologically plausible and more interpretable.

To mimic information transmission in biological brains, we take temporal information into consideration. As Fig. 1 shows, our models combine original dense word embeddings and neural coding algorithms to get the spiking times of neurons during a given period of time. We denote the original dense word embeddings matrix as W, for each element wid, where i = 1,2,⋯ ,|N|, d = 1,2,⋯ ,|D|, |N| represents the total number of words and |D| represents the dimensions of each word. For each word, we build a neural model based on the value of each dimension. And during a given time T, we record the membrane potential for each neuron per Δt, via neural coding algorithms which will describe in “Homogeneous Poisson Process-Based Binary Word Embeddings” and “Spiking Neural Networks Based Binary Word Embeddings.” Then, spiking times matrix S(i), which contains all neurons’ spiking times for the i th word, will be flattened as a vector f(i) with each row concatenated head to tail. The dimensions for f(i) is |D|× (T/Δt). Finally, to make our model more robust, we introduce the tolerance factor tol. We allow a window of tol × Δt to generate a binary bit, and obtain the binary word embeddings in the following way:

$$ \begin{array}{@{}rcl@{}} \mathbf{b}^{(i)} &=& [\mathcal{T}(\mathbf{f}^{(i)}_{1:1*tol)}),\mathcal{T}(\mathbf{f}^{(i)}_{1*tol + 1: 2*tol)}) \cdots ,\mathcal{T}(\mathbf{f}^{(i)}_{(k-1)*tol+1:k*tol)}),\\ &&\cdots , \mathcal{T}(\mathbf{f}^{(i)}_{|D|\times (T/ {\varDelta} t) - tol:|D|\times (T/ {\varDelta} t)})] \end{array} $$
(3)

The \(\mathcal {T}(vector)\) operation means that if there are 1s in the vector, then the bit is 1, otherwise it is 0.

Fig. 1
figure 1

The framework of neuron coding-based binary embeddings

Homogeneous Poisson Process-Based Binary Word Embeddings

Poisson-like rate coding is a major algorithm to simulate spiking response to stimuli. Biological recordings from medial temporal [38, 39] and primary visual cortex [40] of macaque monkeys have shown good evidence for Poisson process-based coding.

For homogeneous Poisson process, it assumes that for the current spike, there is no dependence at all on preceding spikes and the instantaneous firing rate r is constant over time. Consider that we are given a interval (0, T) and we place a single spike in it randomly. If we pick a subinterval (t, t + Δt) of length Δt, the probability that the spike occurred in the subinterval equals Δt/T. When we place k spikes in (0, T), according to binomial formula, the probability that n of them fall in (t, t + Δt) is:

$$ P\{n \ spikes \ during \ {\varDelta} t \} \!=\! \frac{k!}{(k\!-\!n)!n!}({\varDelta} t /T)^{n}(1\!-\!{\varDelta} t /T)^{k-n} $$
(4)

Keeping fire rate r = k/T constant, we increase k and T synchronously. As k, the probability becomes:

$$ P\{n \ spikes \ during \ {\varDelta} t \} = \frac{(r{\varDelta} t)^{n}}{n!}e^{-r{\varDelta} t} $$
(5)

This is the probability density function for Poisson distribution.

In our homogeneous Poisson process-based binary word embeddings model, we consider each dimension as an independent homogeneous Poisson process and the normalized value of the dimension \(w_{id}^{normalized}\) as the constant firing rate. Following the spike generator within the program, for each Δt in the interval (0, T), we compare \(w_{id}^{normalized}\cdot {\varDelta } t\) with a random variable xrandom. Then, we can get the spiking time matrix in this way:

$$ w_{id}^{normalized}\cdot {\varDelta} t = \left\{\begin{array}{l} > x_{random} \ \ fire \ a \ spike \\ \leq x_{random} \ \ nothing \end{array}\right. $$
(6)

Spiking Neural Networks Based Binary Word Embeddings

The LIF-Based Binary Word Embedding Model

The leaky integrate-and-fire (LIF) neuron model, a simplified version of H-H model, is one of the simplest spiking neuron models [41]. LIF model is widely used because it is biologically realistic and computationally simple to be analyzed and simulated [31, 42, 43].

In the LIF model, as Eq. 7 shows, v is the membrane potential, τm is the membrane time constant, and R is the membrane resistance, and for LIF-based word embeddings model, we replace the input current I with the product of the d th dimension value of the i th word and current boost factor Iboost.

$$ \tau_{m} \frac{dv}{dt} = -v(t)+R\cdot I_{boost} \cdot w_{id}, \ \ if \ v(t) > v_{th}, \ v(t)\leftarrow v_{r} $$
(7)

In our LIF-based binary word embedding model, we regard the value Iboostwid as the intensity of current for neurons, and we get the spiking time matrix based on the record of membrane potential v. In addition, we also try to add white noise to the current to improve its robustness.

The Izhikevich Neuron-Based Binary Word Embedding Model

The Izhikevich neuron model is not only capable of producing rich firing patterns exhibited by real biological neurons but also computationally simple [44]. The model makes use of bifurcation methodologies [45] to reduce more biophysically accurate H-H neuron model to a simple one of the following form:

$$ \frac{dv}{dt} \!=\! 0.04v(t)^{2} + 5v(t)+140-u(t)+I, \! \ \frac{du}{dt} \!=\! a(bv(t)-u(t)) \ \ $$
(8)

If v(t) ≥ vth, then v(t) ← c and u(t) ← u(t) + d.

In the Izhikevich neuron model, the meaning of v, vth, and vr are the same as in the LIF model, while u represents the membrane recovery variable and a, b, c, and d are four important hyper-parameters. The parameter a describes the time scale of u, b describes the sensitivity of u to the subthreshold fluctuations of v, and c is used to describe the after-spike reset value of v and is caused by fast high-threshold K+ conductances. d is used to describe the after-spike reset of u and is caused by slow high-threshold Na+ and K+.

As Izhikevich et al. [44] shows, different choices of these four parameters can simulate different types of neurons in the mammalian brains, such as excitatory cortical cells, inhibitory cortical cells, thalamocortical cells, etc. In this paper, we mainly focus on excitatory and inhibitory cortical neurons. According to the intracellular recordings, cortical cells can be divide into different types, for example, regular spiking (RS), intrinsically bursting (IB), and chattering (CH) for excitatory neurons while fast spiking (FS) and low-threshold spiking (LTS) for inhibitory neurons.

In our Izhikevich neuron model-based binary word embedding models, we make use of the combination of excitatory and inhibitory neurons at the rate of 4:1, which is motivated by the rate in mammalian cortex [44]. As mentioned before, for each word, we set |D| neurons and regard the product of the original word embeddings wid and a factor Iboost as the the current for the model. We set each neuron to excitatory/inhibitory sub-models, and for different dimensions of each word, we get the spike times according to its sub-models.

Experiment Validations

Validation Tasks and Datasets

We evaluate our binary embeddings on word similarity and text classification tasks. The word similarity task has been widely used to measure in which degree the word embeddings can capture the similarity between two words, while the text classification task is a traditional NLP application. In our experiment, all the binary word embedding models are based on two kinds of well-accepted original word embeddings, namely, Word2Vec [1] and GloVe [2].

For word similarity task, we find similar words via Hamming distance, which will be faster than traditional cosine distance for dense embeddings and we evaluate embeddings on three public datasets: (1) WordSim-353, it is the most widely used dataset for word similarity test, consisting of 353 pairs of words [46]; (2) SimLex-999, it consists of 999 pairs of words and provides a way of measuring how well the word embeddings capture similarity, rather than relatedness or association [47]; (3) Rare Words, it consists of 2,034 word pairs proposed by Luong et al. [48], focusing on rare words to complement exiting ones. All these pairs of words are along with human-assigned similarity scores and we check Spearman’s rank correlation coefficient between word embeddings and the human labeled ranks.

For the text classification task, we do OR operation on binary embeddings to generate the representation for text and use the k-nearest neighbors (kNN) classifier to measure accuracy. We validate our algorithms on two public text datasets: (1) Search Snippets, it is a short text dataset collected by Phan et al. [50], which is selected from the results of Web search transaction using predefined phrases of 8 different domains; (2) Sentiment Analysis, it is proposed by Socher et al. [49] and is a treebank of sentences annotated with sentiment labels from movie reviews. The sentences in the treebank were split into a train (8544), dev (1101), and test splits (2210). We merge the train and dev part for the kNN classifier and ignore neutral sentences, analyzing performance on only positive and negative class.

Experiment Details and Results

In our experiment, we use the pre-trained GloVeFootnote 1 and Word2VecFootnote 2 embeddings, both of which are 300 dimensions. We set three comparative experiments of original embedings, binary embeddings, “Overcomplete-B” derives from Faruqui’s work [8], and “Rude Binarization” convert original embeddings into binary ones via simple sign function.

For all the biological neuron coding-inspired models, we set the interval T = 10 ms and subinterval Δt = 0.1 ms. We find the best hyper-parameter through grid-search on word similarity tasks and apply these for both experiment tasks. For Poisson, LIF, and LIF with noise-based model, the tol is 5, while for other models, tol is 10. For LIF and LIF with noise model, τm = 10 and vth = 15, while for Izhikevich model, vth = 30, and other parameters follow [44] for different sub-models. The Iboost factors are 100 and 200 for GloVe and Word2Vec respectively. In Addition, for Poisson coding and LIF with noise model, we do 10 times for each, with different random seeds, and Table 1 shows the average and their standard deviation results.

Table 1 Results of word embeddings on the word similarity tasks

Result Analysis

Through analysis from the data shown in Tables 1 and 2 and Fig. 3, we can infer that: (1) We make an exploration on how to generate binary embeddings via biological neuron coding-inspired models (Figs. 2 and 3. The results show that the SNN-based models show good performance while the Poisson coding-based model reflected rate coding’s weakness when transforming dense information into binary bits. Which means, it cannot carry enough information to represent stimuli or patterns. (2) For word similarity task, binary word embeddings, especially rude binarization, LIF-based, and Izhkevich-based models which are transformed through dense word embeddings, can get similar results to original ones. (3) The LIF-based binary embeddings model performs well on word similarity tasks while somehow bad on text classification task. This may due to over simplified mechanism of LIF model, making it robust to represent words while lost many semantic information; LIF model with noise can improve the performance of text classification task, while it is unstable and can pull down the word similarity results. (4) The Izhkevich neuron-based binary embedding model gets excellent results on both tasks, especially the combination of RS and FS neuron sub-models is the best one. The model combines the excitatory and inhibitory neurons to mimic the neurons in the biological brain, making a difference when converting the original dense embeddings into binary ones. (5) From the perspective of space occupation, for database of 3 million words (such as the public pre-training Word2Vec vectors) with 300 dimensions takes 3.6 GB in floating point while 1.125 GB as 3000-bit codes (tol = 10) for the Izh_RS+FS model, which reduced approximately 68.75% space occupation. For neuron coding-based binary embeddings models, the compression ratio is mainly due to the run time and the tolerance factor tol.

Table 2 Summarized results of two tasks
Fig. 2
figure 2

Results visualization for word similarity tasks. a spiking matrix of the word “people” in Poisson-based model with different random seeds; b, c spiking matrix and the 3rd neuron’s membrane potential of LIF model and Izh RS+FS model respectively

Fig. 3
figure 3

Results of the text classification task

Conclusion

In this paper, we propose three kinds of biological neuron coding-inspired models to generate binary word embeddings, which show better performance and interpretability compared to existing works on word similarity evaluation and text classification task. To the best of our knowledge, this is the first attempt to convert the dense embeddings into binary ones via spike timing, and we have proved its feasibility on some natural language processing applications.

Future Work

Due to the limitation on the performance of supervised SNNs, in this paper, we do post-processing operations on given word embeddings. However, we are looking forward to build SNN-based language model to get brain-inspired word embeddings from the raw corpus. We are trying to adjust the cost function of supervised SNNs and add several biological mechanisms such as STDP to the model to get them. Furthermore, in contrast to excitatory neocortical neurons, which have stereotypical morphological and electrophysiological classes, inhibitory neocortical interneurons have wildly diverse classes with various firing patterns that cannot be classified as FS or LTS [45]. In this paper, we focus on FS and LTS inhibitory neurons for their parameters in Izhikevich’s neuron model are easy to get. In the future, we will pay more attention to more detailed types of inhibitory neuron models.