Biologically Plausible Learning of Text Representation with Spiking Neural Networks

This study proposes a novel biologically plausible mechanism for generating low-dimensional spike-based text representation. First, we demonstrate how to transform documents into series of spikes spike trains which are subsequently used as input in the training process of a spiking neural network (SNN). The network is composed of biologically plausible elements, and trained according to the unsupervised Hebbian learning rule, Spike-Timing-Dependent Plasticity (STDP). After training, the SNN can be used to generate low-dimensional spike-based text representation suitable for text/document classification. Empirical results demonstrate that the generated text representation may be effectively used in text classification leading to an accuracy of $80.19\%$ on the bydate version of the 20 newsgroups data set, which is a leading result amongst approaches that rely on low-dimensional text representations.


Introduction
Spiking neural networks (SNNs) are an example of biologically plausible artificial neural networks (ANNs). SNNs, like their biological counterparts, process sequences of discrete events occurring in time, known as spikes. Traditionally, spiking neurons, due to their biological validity, have been studied mostly by theoretical neuroscientists, and have become a standard tool for modeling brain processes on a micro scale. However, recent years have shown that spiking computation can also successfully address common machine learning challenges [35]. Another interesting aspect of SNNs is the adaptation of such algorithms to neuromorphic hardware which is a brain-inspired alternative to the traditional von Neumann machine. Thanks to mimicking processes observed in brain synaptic connections, neuromorphic hardware is a highly fault-tolerant and energyefficient substitute for classical computation [27].
Recently we have witnessed significant growth in the volume of research into SNNs. Researchers have successfully adapted SNNs for the processing of images [35], audio signals [9,39,40], and time series [18,31]. However, to the best of the authors knowledge, there is only one work related to text processing with SNNs [37]. This state of affairs is caused by the fact that text, due to its structure and high dimensionality, presents a significant challenge to tackle by the SNN approach. The motivation of this study is to broaden the current knowledge of the application of SNNs to text processing. More specifically, we have developed and evaluated a novel biologically inspired method for generation of spike-based text representation that may be used in text/document classification task [25].

Objectives and summary of approach
This paper proposes an Spike Encoder for Text (SET) which generates spikebased text representation suitable for classification task. Text data is highly dimensional (the most common text representation is in the form of a vector with many features) which, due to the curse of dimensionality [20,2,17,26], usually leads to overfitted classification models with poor generalisation [38,32,30,13,4].
Processing highly dimensional data is also computationally expensive. Therefore, researchers have sought text representations which may overcome this drawback [3]. One of possible approaches is based on transformation of high dimensional feature space to low-dimensional representation [5,36,6].
In the above context we propose the following two-phase approach to SNN based text classification. Firstly, the text is transformed into spike trains. Secondly, spike trains representation is used as the input in the SNN training process performed according to biologically plausible unsupervised learning rule, and generating the spike-based text representation. This representation has significantly lower dimensionality than the spike trains representation and can be used effectively in subsequent SNN text classification. The proposed solution has been empirically evaluated on the publicly available version, bydate [21] of the real data set known as 20 newsgroups, which contains 18 846 text documents from twenty different newsgroups of Usenet, a worldwide distributed discussion system.
Both the input and output of the SNN rely on spike representations, though of very different forms. For the sake of clarity, throughout the paper the former representation (SNN input) will be referred to as spike trains, and the latter one (SNN output) as spike-based, or spiking encoding, or low-dimensional.

Contribution
The main contribution of this work can be summarized as follows: -To propose an original approach to document processing using SNNs and its subsequent classification based on generated spike-based text representation; -To experimentally evaluate the influence of various parameters on the quality of generated representation, which leads to better understanding of the strengths and limitations of SNN-based text classification approaches; -To propose an SNN architecture which may potentially contribute to development of other SNN based approaches. We believe that the solution presented may serve as a building block for larger SNN architectures, in particular deep spiking neural networks (DSNNs) [35];

Related work
As mentioned above, we are aware of only one paper related to text processing in the context of SNNs context [37] which, nevertheless, differs significantly from our approach. The authors of [37] focus on transforming word embeddings [23,28] into spike trains, whilst our focus is not only on representation of text in the form of spike trains, but also on training the SNN encoder which generates l low-dimensional text representation. In other words, our goal is to generate a low-dimensional text representation with the use of SNN base, whereas in [37] the transformation of an existing text embedding into spike trains is proposed. This remainder of the paper is structured as follows. Section 2 presents an overview of the proposed method; Section 3 describes the evaluation process of the method and experimental results; and Section 4 presents the conclusions.

Proposed spiking neural method
The proposed method transforms input text to spike code and uses it as training input for the SNN to achieve a meaningful spike-based text representation. The method is schematically presented in Fig. 1 vector representation and afterwards each vector is encoded as spike trains. Once the text is encoded in the form of neural activity, it can be used as input to the core element of our method -a spiking encoder. The encoder is a two-layered SNN with adaptable synapses. During the learning phase (II ), the spike trains are propagated through the encoder in a feed-forward manner and synaptic weights are modified simultaneously according to unsupervised learning rule. After the learning process, the output layer of the spiking encoder provides spike-based representation of the text presented to the system. In the remainder of this section all elements of the system described above are discussed in more detail.

Text vectorization
During a text to spike transformation phase like the one illustrated in Fig. 1 text is preprocessed for further spiking computation. Text input data (data corpus) is organized as a set D of documents d i , i = 1, . . . , K. In the first step a dictionary T containing all unique words t j , j = 1, . . . , |T | from the corpus data is built. Next, each document d i is transformed into an M -dimensional (M = |T |) vector W i , the elements of which, W i [j] := w ij , j = 1, . . . , M represent the relevance of words t j to document d i . In effect, the corpus data is represented by a real-valued matrix W K×M also called document-term matrix.
The typical weighting functions are term-frequency (TF), inverse document frequency (IDF), or their combination TF-IDF [22,12]. In TF the weight w ij is equal to the number of times the j-th word appears in d i with respect to the length of |d i | (the number of all non-unique words in d i ). IDF takes into account the whole corpus D and sets w ij as the logarithm of a ratio between |D| and the number of documents containing word t j . Consequently, IDF mitigates the impact of words that occur very frequently in a given corpus and are presumably less informative from the point of view of document classification than the words occurring in a small fraction of the documents. TF-IDF sets w ij as a product of TF and IDF weights. In this paper we use TF-IDF weighting which is the most popular approach in text processing domain.

Vector to spike transformation
In order to transform a vector representation to spike trains one, presentation time t p which establishes for how long each document is presented to the network, and the time gap between two consecutive presentations ∆t p , must be defined. A time gap period, without any input stimuli is necessary to eliminate interference between documents and allow dynamic parameters of the system to decay and "be ready" for the next input.
Technically, for a given document d i , represented as M dimensional vector of weights w ij , for each weight w ij in every millisecond of document presentation a spike is generated with probability proportional to w ij . Thanks to this procedure, we ultimately derive a spiking representation of the text.
In our experiments each document is presented for t p = 600[ms] and ∆t p = 300[ms], and proportionality coefficient α is set to 1.5.
For a better clarification, let's consider a simple example and assume that for a word baseball the corresponding weight w ij in some document d i is equal to 0.1. Then for each millisecond of a presentation time a probability of emitting a spike P (spike|baseball) equals α · 0.1 = 0.15. Hence, 90 spikes during 600[ms] presentation time are expected, on average, to be generated.

Spiking encoder architecture and dynamics
A spiking encoder is the key element of the proposed method. The encoder, presented in Fig. 2, is a two layered SNN equipped with an additional inhibitory neuron. The first layer contains M neurons (denoted by blue circles) and each of  them represents one word t j from the dictionary T . Neuron dynamics is defined by the spike trains generated based on weights w ij corresponding to documents d i , i = 1, . . . , K. Higher numbers of spikes are emitted by neurons representing words which are statistically more relevant for a particular document, according to the chosen TF-IDF measure. The spike trains for each neuron are presented in Fig. 2 as a row of short vertical lines.
In the brain spikes are transmitted between neurons via synaptic connections. A neuron which generates a spike is called a presynaptic neuron, whilst a target neuron (spike receiver) is a postsynaptic neuron. In the proposed SNN architecture (cf. Fig. 2) two different types of synaptic connections are utilised: excitatory ones and inhibitory ones. Spikes transmitted through excitatory connections (denoted by green circles in Fig. 2) leads to firing of postsynaptic neuron, while impulses traveling through inhibitory ones (red circles in Fig. 2) hinder postsynaptic neuron activity. Each time an encoder neuron fires its weights are modified according to the proposed learning rule. The neuron simultaneously sends an inhibition request signal to the inhibitory neuron and activates it. Then the inhibitory neuron suppresses the activity of all encoder output layer neurons using recursive inhibitory connection (red circles). The proposed architecture satisfies the competitive learning paradigm [19] with a winner-takes-all (WTA) strategy.
In this work we consider a biologically plausible neuron model known as leaky integrate and fire (LIF) [11]. The dynamics of such a neuron is described in terms of changes of its membrane potential (MP). If the neuron is not receiving any spikes its potential is close to the value of u rest = −65[mV ] known as resting membrane potential. When the neuron receives spikes transmitted through excitatory synapses, the MP moves towards excitatory equilibrium potential, u exc = 0[mV ]. When many signals are simultaneously transmitted through excitatory synapses the MP rises and at some point can reach a threshold value of u th = −52[mV ], in which case the neuron fires. After firing, the neuron resets its MP to u rest and becomes inactive for t ref = 3[ms] (the refractory period). In the opposite scenario, when the neuron receives spikes through the inhibitory synapse, its MP moves towards inhibitory equilibrium potential u inh = −90[mV ], i.e. further away from its threshold value, which decreases the chance of firing. The dynamics of the membrane potential u in the LIF model is described by the following equation: where g e and g i denote excitatory and inhibitory conductance, resp. and τ = 100[ms] is membrane time constant. The values of g e and g i depend on presynaptic activity. Each time a signal is transmitted through the synapse the conductance is incremented by the value of weight corresponding to that synapse, and decays with time afterwards according to equation (4) , where τ e = 2[ms], τ i = 2[ms] are decay time constants. In summary, if there is no presynaptic activity the MP converges to u rest . Otherwise, its value changes according to the signals transmitted through the neuron synapses.

Hebbian synaptic plasticity
We utilise a modified version of the Spike-Timing-Dependent Plasticity (STDP) learning process [33]. STDP is a biologically plausible unsupervised learning protocol belonging to the family of Hebbian learning (HL) methods [14]. In short, the STDP process results in an increase of the synaptic weight if the postsynaptic spike is observed soon after presynaptic one ('pre-before-post'), and in a decrease of the synaptic weight in the opposite scenario ('post-before-pre'). The above learning scheme increases the relevance of those synaptic connections which contribute to the activation of the postsynaptic neuron, and decreases the importance of the ones which do not. We modify STDP in a manner similar to [8,29], i.e. by skipping the weight modification in the post-before-pre scenario and introducing an additional scaling mechanism. The plasticity of the excitatory synapse s ij connecting a presynaptic neuron i from the input layer with postsynaptic neuron j from the encoder layer can be expressed as follows: where and η = 0.01 is a small learning constant. In eqs.

Learning procedure
For a given data corpus (set of documents) D the training procedure is performed as follows. Firstly, we divide D into s subsets u i , i = 1, . . . , s in the manner described in Section 3.1. Secondly, each subset u i is transformed to spike trains and used as input for a separate SNN encoder H i , i = 1, . . . , s composed of N neurons. Please note that each encoder is trained with the use of one subset only. Such a training setup allows the processing of the data in parallel manner. Another advantage is that this limits the number of excitatory connections per neuron, which reduces computational complexity (the number of differential equations that need to be evaluated for each spike) as during training encoder H i is exposed only to the respective subset, T i of the training set dictionary T and the number of its excitatory connections is limited to |T i | < |T |. Spike trains are presented to the network four times (in four training epochs). Once the learning process is completed, the connection pruning procedure is applied. Please observe that HL combined with competitive learning should lead to highly specialised neurons which are activated only for some subset of the inputs. The specialisation of a given neuron depends on the set of its connection weights. If the probability of firing should be high for some particular subset of the inputs, the weights representing words from those inputs must be high. The other weights should be relatively low due to the synaptic scaling mechanism. Based on this assumption, after training, for each output layer neuron we prune θ per cent of its incoming connections with the lowest weights. θ is a hyper parameter of the method empirically evaluated in the experimental section.

Empirical evaluation and results comparison
This section presents experimental evaluation of the method proposed. In subsection 3.1 the technical issues related to the setup of experiment and implementation of the training and evaluation procedures are discussed. The final two subsections focus respectively on the experimental results and compare them with the literature.

Data set and implementation details
The bydate version 3 of 20 newsgroups is a well known benchmark set in the text classification domain. The set contains newsgroups post related to different categories (topics) gathered from Usenet, in which each category corresponds to one newsgroup. Categories are organised into a hierarchical structure with the main categories being computers, recreation and entertainment, science, religion, politics, and forsale. The corpus consists of 18 846 documents nearly equally distributed among twenty categories and explicitly divided into two subsets: the training one (60%) and the test one (40%).
The dynamics of the spiking neurons (including the plasticity mechanism) was implemented using the BRIAN 2 simulator [34]. Scikit-learn Python library 4 was used for processing the text and creating the TF-IDF matrix.

Training
As mentioned in Section 2.4 the training set was divided into s = 11 subsets u i each of which, except for u 11 , contained 1 500 documents. Division was performed randomly. Firstly the entire training set was shuffled, and then consecutively assigned to the subsets according to the resulting order with a 500 document redundancy (overlap) between the neighbouring subsets, as described in Table 1.
The overlap between subsequent subsets resulted from preliminary experiments which suggested that such an approach improves classification accuracy. While we found the concept of partial data overlap to be reasonably efficient, it by no means should be regarded as an optimal choice. The optimal division of data into training subsets remains an open question and a subject of our future research.

Evaluation procedure
The outputs of all SNNs H i , i = 1, . . . , s, i.e. spike-based encodings represented as sums of spikes per document were joined to form a single matrix (a final low-dimensional text representation) which was evaluated in the context of a classification task. The joined matrix of spike rates was used as an input to the Logistic Regression (LR) [15,17] classifier with accuracy as the performance measure.

Experimental results
In the first experiment we looked more closely at the weights of neurons after training and the relationship between the inhibition mechanism and the quality/efficacy of resulting text representation. We trained eleven SNN encoders with 50 neurons each according to the procedure presented above. After training, 5 neurons from the first encoder (H 1 ) was randomly sampled and their weights were used for further analysis. Fig. 3 illustrates the highest 200 weights sorted in descending order. The weights of each neuron are presented with a different colour. The plots show that every neuron has a group of dominant connections represented by the weights with the highest values -the first several dozen connections. It means that each neuron will be activated more easily by the inputs that contain words corresponding to these weights. For example neuron 4 will potentially produce more spikes for documents related to religion because its 10 highest weights corresponds to words 'jesus', 'god', 'paul', 'faith', 'law', 'christians', 'christ', 'sabbath', 'sin', 'jewish'. A different behaviour is expected from neuron 2 whose 10 highest weights corresponds to words 'drive', 'scsi', 'disk', 'hard', 'controller', 'ide', 'drives', 'help', 'mac', 'edu'. This one will be more likely activated for computer related documents. On the other hand, not all neurons can be classified so easily. For instance 10 highest weights of neuron 5 are linked to words 'cs', 'serial', 'ac', 'edu', 'key', 'bit', 'university', 'windows', 'caronni', 'uk', hence a designation of this neuron is less obvious. We have repeated the above sampling and weigh inspection procedure several times and the observations are qualitatively the same. For the sake of space savings we do not report them in detail.
Hence, a question arises as to how well documents can be encoded with the use of neurons trained in the manner described above? Intuitively, in practice the quality of encoding may be related to the level of competition amongst neurons in the evaluation phase. If the inhibition value is kept high enough to satisfy WTA strategy then only a few neurons will be activated and the others will be immediately suppressed. This scenario will lead to highly sparse representations of the input documents, with just a few or (in extreme cases) only one neuron dominating the rest. Since differences between documents belonging to different classes may be subtle, such a sparse representation may not be the optimal setup. In order to check the influence of the inhibition level on the resulting spike-based representation we tested the performance of the trained SNNs H 1 -H 11 for various inhibition levels by adjusting the value of the neurons' inhibitory synapses. The results are illustrated in Fig. 4 (top).  Clearly the accuracy strongly depends on the inhibition level. The best outcomes (≈ 78%) are accomplished with inhibition set to 0 and rapidly decrease along with the inhibition raise. For the inhibition values higher than 1.5 the accuracy plot enters a plateau at the level of approximately 68%. The results show that the most effective representation of documents is generated with the absence of inhibition during the evaluation phase, i.e. when all neurons have the same chance of being activated and contribute to the document representation.
The second series of experiments aimed at exploring the relationship between the efficacy of document representation and the size of the encoders. Furthermore, the sensitivity of the trained encoders to connection pruning, with respect to their efficiency, was verified. The results of both experiments are shown in the bottom plot of Fig. 4. Seven encoders of various sizes (between 110 and 3 300 neurons) were trained, and once the training was completed the before the connection pruning procedure took place.
In the plot, four colored curves illustrate particular pruning scenarios and their impact on classification accuracy for various encoder sizes. 99%, 90%, 80%, and 50% of the weakest weights were respectively removed in the four discussed cases. Overall, for smaller SNN encoders (between 110 and 1 100 neurons) the accuracy rises rapidly along with the encoder size increase. For larger SSNs, changes in the accuracy are slower and for all four curves stay within the range [77.5%, 80.19%].
In terms of the connection pruning degree the biggest changes in accuracy (between 63% and 79%) are observed when 99% of connections have been deleted (the red curve). In particular, the results of the encoders with fewer than 1 100 neurons demonstrate that this range of pruning heavily affects classification accuracy. In larger networks additional neurons compensate the features removed by the connection pruning mechanism and the results are getting closer to other pruning setups.
Interestingly, for the networks smaller than 770 neurons, the differences in accuracy between 50%, 80%, and 90% pruning setups are negligible, which suggests that relatively high redundancy of connections still exist in the networks pruned in the range of 50% to 80%. Apparently, retaining as few as 10% of the weights does not impact the quality of representation and does not cause deterioration of results. This result well correlates with the outcomes of the weight analysis reported above and confirms that a meaningful subset of connections is sufficient for proper encoding the input. The best overall classification result (80.19%) was achieved by the SNN encoder with 2 200 neurons and level of pruning set to 90% (the green curve). It proves that SEM can effectively reduce dimensionality of the text input from initial ≈ 130 000 (the size of 20 newsgroups training vocabulary) to the size of 550 − 2 200, and maintain classification accuracy above 77.5%.

Results analysis. Comparison with the literature.
Since to our knowledge this paper presents the first attempt of using SNN architecture to text classification, in order to make some comparisons we selected results reported for other neural networks trained with similar input (documentterm matrix ) and yielding low-dimensional text representation as the output.
The results are presented in Table 2. SET achieved 80.19% accuracy and outper- Table 2. Accuracy [%] comparison for several low-dimensional text representation methods on bydate version of 20 newsgroups data set.

Method
Accuracy SET (this paper) 80.19 K-competitive Autoencoder for TExt (KATE) [7] 76.14 Class Preserving Restricted Boltzmann Machine (CPr-RBM) [16] 75.39 Variational Autoencoder [7] 74.30 formed the remaining shallow approaches. While this result looks promising we believe that there is still room for improvement with further tuning of the method (in particular a division of samples into training subsets), as well as extension of the SNN encoder by adding more layers. Another interesting direction would be to learn semantic relevance between different words and documents [10,41].

Conclusions
This work offers a novel approach to text representation relying on Spiking Neural Networks. Using the proposed low-dimensional text representation the LR classifier accomplished 80.19% accuracy on a standard benchmark set (20 newsgroups bydate) which is a leading result among shallow approaches relying on low-dimensional representations. We have also examined the influence of the inhibition mechanism and synaptic connections sparsity on the quality of the representation showing that (i) it is recommended that inhibition be disabled during the SNN evaluation phase, and (ii) pruning out as many as 90% of connections with lowest weights did not affect the representation quality while heavily reducing the SNN computational complexity, i.e. the number of differential equations describing the network.
There are a few lines of potential improvement that we plan to explore in the further work. Most notably, we aim to expand the SNN encoder towards Deep SNN architecture by adding more layers of spiking neurons which should possibly allow to learn more detailed features of the input data.