Emotionally charged text classification with deep learning and sentiment semantic

Text classification is one of the widely used phenomena in different natural language processing tasks. State-of-the-art text classifiers use the vector space model for extracting features. Recent progress in deep models, recurrent neural networks those preserve the positional relationship among words achieve a higher accuracy. To push text classification accuracy even higher, multi-dimensional document representation, such as vector sequences or matrices combined with document sentiment, should be explored. In this paper, we show that documents can be represented as a sequence of vectors carrying semantic meaning and classified using a recurrent neural network that recognizes long-range relationships. We show that in this representation, additional sentiment vectors can be easily attached as a fully connected layer to the word vectors to further improve classification accuracy. On the UCI sentiment labelled dataset, using the sequence of vectors alone achieved an accuracy of 85.6%, which is better than 80.7% from ridge regression classifier—the best among the classical technique we tested. Additional sentiment information further increases accuracy to 86.3%. On our suicide notes dataset, the best classical technique—the Naíve Bayes Bernoulli classifier, achieves accuracy of 71.3%, while our classifier, incorporating semantic and sentiment information, exceeds that at 75% accuracy.


Introduction
Text classification is the task of organizing text documents into pre-defined categories [1].It is an important aspect of data processing to make data usable by humans and is used in spam filtering [2], language identification [3], sentiment analysis [4] and many other areas.The Naı ´ve Bayes classifier is a popular and effective algorithm for text classification [5][6][7].Other general classifiers can also be adapted for text classification by using the vector space model [8].Popular general classifiers include support vector classifier [9] and stochastic gradient descent classifier [10].They work on the basis of finding a hyperplane that best separates data points from two classes.Their linear classifier variant works well on text documents and often performs better than Naı ´ve Bayes.However, Naı ´ve Bayes and vector space model discard the position of words and cannot capture the relationship between words.While n-words or n-grams can be used, they cannot capture long-range relationships [11].Latent semantic indexing solves the problem through the application of singular value decomposition [12].
A recent survey articles [13,14] suggest deep neural network is playing a vital role in recent language processing tasks.The methods are successfully applied in many tasks such as sentiment analysis, spam filtering and marketing.This also leads us to use deep leaning in our method.Recently, Jiang et al. [15] proposed a Focal Loss-based which is used in sentiment analysis.Chatterjee et al. [16] extend similar task into big data framework.Lu et al. use a bidirectional LSTM [17].A semantic-based feature selection model is proposed in [2].Associative rule-based systems are also achieved state-ofthe-art accuracy in sentiment classification [7].It is noted that new way of distance measurement can improve the classification models [8].
State-of-the-art techniques treat text as one-dimensional, either as word-probability pairs in Naı ´ve Bayes or as a document vector (Fig. 1, existing method).To push text classification accuracy even higher, two-dimensional document representation, such as vector sequences or matrices, should be explored (Fig. 1, proposed method).Many text classification algorithms operate on document vectors.The dimension of the vector needs to be kept low for better generalization.Converting a document to fit a single lowdimensional document vector necessitates discarding much data.Positions, plurality and tenses of words are often omitted.
We seek to overcome these limitations by representing a document as a sequence of vectors instead.The state-ofthe-art recurrent neural network-long short-term memory (LSTM) [18], will be used to build the classifier.Using a sequence of vectors provides the flexibility to preserve as much information from the original text as possible, and to incorporate information from external databases to supplement the text.We incorporate semantic information from pre-trained GloVe vectors [19], and sentiment information from SentiWordNet [20] to mirror the way human comprehends text.A person goes through formal education to learn the meaning of words.When he/she reads a document, these learned meanings enable comprehension.
The overall aim is to reach a classification accuracy beyond what the classic approaches are able to achieve.Classic approaches include Naı ´ve Bayes, support vector, stochastic gradient descent, passive-aggressive, k-nearest neighbour, Rocchio and ridge regression classifiers.We experimentally proved that attachment of sentiment vector (emotionally charged) and the accuracy of state-of-the-art text classifiers perform better.

Background
The Naı ´ve Bayes classifiers use the Bayes rule in computing the probability that a document belongs to a class.
They assume that occurrence of each word is independent of other words.They compute the probability PðC k jw 1 ; . ..; , where w i is the i th word in the dictionary.Since multiplication is commutative, word positions do not matter.A document can be seen as a set of words, thus one-dimensional.Support vector, stochastic gradient descent (SGD) and passive-aggressive classifiers find the best separating hyperplanes.A document thus has to be represented as a point in multi-dimensional space (i.e. a vector) using these steps: 1.The first step is to tokenize the document.A document can be seen in its entirety, or as its sections, paragraphs, sentences, words or characters.Since words are the smallest unit that have meaning, most classification works on the word level.For English documents, one can extract words by breaking the document at every occurrence of space, comma, period and other word boundaries.Some tokenizers use more complex algorithms to preserve hyphenated words, such as ''ice-cream'', abbreviations, such as ''don't'' and ''U.S.'', and non-words, such as ''1/2'', ''john@example.com''and ''http://example.com''.Some tokenizers preserve punctuations as well.English tokenizers include the StandardAnalyzer in the Lucene library [21] and the PTBTokenizer in the Stanford Natural Language Processing library [22].2. Stemming is then performed to turn the words into their root forms, so as to discard the plurality and tense.
Stemming helps reduce the dimension of the document vector.Stemming may use heuristics, such as in Porter Stemmer [23], or morphology, such as in the Morphy class in Stanford Natural Language Processing library [24].3. A dictionary is constructed by listing the unique words in all the documents.A document is then represented by a vector of dimension equal to the size of the dictionary.The n th element in the vector corresponds to the n th word in the dictionary.In the bag-of-word (BoW) model, each vector element stores term frequency, which is the number of occurrence of the corresponding word in the document.However, it assigns large numbers for these frequent words, such as ''a'' and ''the'', that do not carry much information, skewing distance calculation between vectors.The problem can be solved by dividing the term frequency by the document frequency, where the document frequency is the number of documents containing the word.This yields the term frequency-inverse document frequency (tfÁidf) statistic.Using these vector space models (BoW and tfÁidf) discards the position of the words in the document.
The k-nearest neighbour classifier also uses the vector space model.It classifies a given point as the most frequently occurring class among k-closest neighbours [25].
Rocchio classifier is the nearest centroid classifier with special vector to represent documents.For each class, the centroids (mean) of the labelled data points are computed.A data point is then classified as the class with the nearest centroid to the point [26].Ridge regression classifier uses ridge regression as a classifier.In standard regression, the error function is J n ðwÞ ¼ Ridge regressions add a regularization factor to penalize nonzero weights proportional to a shrinkage coefficient k, giving J n ðwÞ ¼ The proposed classification is demonstrated in Fig. 1.

Proposed method
Here, we propose a text classifier that uses a dual modality of information extraction and a long short-term memory recurrent neural network (LSTM) for the classification.Firstly, a word embedding feature is extracted from pretrained model.Next, the emotion of text is extracted from sentiment network.Finally, the features are combined to classify the text.An LSTM is a type of artificial neural network with self-connection and nodes made up of gated memory blocks.The proposed method is depicted in Fig. 2.

LSTM-based classifier with sentiment data
A long short-term memory neural network (LSTM) is an RNN with memory cells and gates units [27].The recurrent network used in the proposed model is presented in Fig. 3.A node in the hidden layer is in itself a network that consists of input gate, output gate, forget gate, a memory cell and a self-recurrent connection on the memory cell (Fig. 4).
LSTM solves the exploding gradient problem by truncating the gradient computation.It solves vanishing gradient by having the memory cell to repeat its state across time, and turning on and off the input and output gates to allow or prevent modifications to the memory cell.Several variants of LSTM exist; some has additional features such as peephole connections, while other has less, such as removal of output activation function.Klaus Greff et al. found that removing peephole connection and full gate recurrence simplifies computation without affecting performance [28].
Overfitting in LSTM is usually overcome by using dropout.Wojciech Zaremba et al. found that in RNNincluding LSTM-dropout works best when applied only on non-recurrent connections [29].
The values in the node are computed as follows: Block input : where W are rectangular input weight matrices, R are square recurrent weight matrices, and b are bias vectors.Functions r, g and h are pointwise nonlinear activation function, and is pointwise multiplication of two vectors.The sigmoid function r is defined as rðtÞ ¼ 1 1þe Àt .The hyperbolic tangent (tanh) function is defined as 1þe À2t .The activation function is demonstrated in Fig. 5.
Sequence and non-sequence data have to be processed by different types of nodes.To process sequence data consisting of GloVe vectors and/or SentiWordNet scores, 1 hidden layer of 120 LSTM nodes is used (Figs.6 and 7).The memory block is illustrated in Fig. 4. Gradients above 100 are clipped to prevent exploding gradient.To prevent overfitting, input layer is first fed into a dropout layer that has 50% probability of setting value to zero and rescales input with output ¼ input 1À0:5 .Dropout layer is turned off during cross-validation by setting probability of dropout to zero.To process non-sequence valence-arousal data, 1 hidden layer of 3 nodes with tanh activation is used.The output from both the LSTM and tanh nodes is then concatenated via the concat layer, before connecting to the output layer.Learning rate is initialized at 0.01, and algorithm is set as AdaGrad.The Lasagne library is used to create the architecture [30].It abstracts the layers and encapsulates the mathematical computations of the neural network.We have used Lasagne, which is in turn built on  Theano, which is a library that uses GPU or CPU to accelerate computations [31].
Semantic information: Each document is tokenized using the PTBTokenizer in the Stanford Natural Language Processing library.The GloVe vectors and Sen-tiWordNet scores are then retrieved for the tokens.To compare the effect of having semantic and sentiment information against not having them, we compare against one-hot encoding of the tokens.GloVe performs unsupervised learning on documents to obtain word-word co-occurrence statistics and represent it in a vector [19].This results in words similar in meaning represented with vectors that are of small distance to each other.Co-occurrence probabilities can be modelled in the general form using equation (1).
where w 2 R d are the word vectors, w 2 R d are separate context word vectors, and x i is the probability that word j appears in the context of word i.
Computation of word vector is by minimizing the weighted least squares regression model: where V is the size of the vocabulary, f is the weighting function, b is the bias for w, and b is the bias for w.As a neural network is used to solve this model, different random initializations would yield different results.To achieve consistent performance, separate context word vectors w are trained on the same neural network but with different random initializations.
To measure distance in text classification, cosine distance is often used.Cosine distance is unaffected by the magnitude of the vector, unlike Euclidean distance.Consider the case in the bag-of-word and tfÁidf models, magnitude can be doubled by appending a duplicate of a document to itself (Fig. 8).The document with double the magnitude has the exact same content, albeit duplicated, as the original document, and thus should be regarded similar.
The cosine distance between two vectors u and v is defined as To illustrate that words with similar meaning having closer vectors, we compute the cosine distance between ''water'', ''ice'', ''learn'' and ''educate'' (Table 1).The distance between ''water'' and ''ice'' and between ''learn'' and ''educate'' is lower than other permutation.We used pre-trained 300-dimensional GloVe vectors trained on Wikipedia 2014 and Gigaword 5 [32].We also tested the 50-dimensional vectors, but found them ineffective.This set of pre-trained vectors contains 6 billion tokens, of which 400 thousand is vocabulary in lower case.The rest of the tokens are punctuations, numbers, dates, email addresses and others.It is 989 MiB uncompressed.A dictionary is constructed by listing the unique words in all the documents.A word is then represented by a vector of dimension equal to the size of the dictionary.A word that is at n th position in the dictionary is represented by a vector with one at the n th element and zero at all other elements.
Sentiment information: We are interested in classifying emotionally charged documents.Intuitively, additional sentiment information should help classification.We used SentiWordNet to provide positivity, negativity and objectivity scores.For our smaller dataset, we also had 12 healthy participants rate the amount of valence and arousal (approximating the circumplex model of affect) of the documents.WordNet is a database that groups words into sets of synonyms [33].Similarity between words can be inferred from its link to the synonym sets and the number of linkages between the synonym sets.SentiWordNet is built on top of WordNet and its similarity relationship [34].It adds positivity, negativity and objective score to each synonym set.Subjectivity of a word can thus be found by identifying the synonym set it belongs to and looking up The list contains several meanings, such as goodness and product, and parts of speech, such as noun (n), adjective (a), adjective satellite (s) and adverb (r).Of these synonym sets, SentiWordNet labels a subset of them with sentiment scores.are carefully chosen and then scored manually.A random walk algorithm that visits the links between synonym sets propagates these scores to other synonym sets.We encountered an issue obtaining the SentiWordNet scores.A word can have many meanings and thus belong to multiple synonym sets in WordNet.Disambiguating among the different synonym sets is hard, even by hand.As a shortcut, we obtained positivity, negativity and objectivity scores by taking the average score in all synonym sets of a word.

Circumplex model of affect:
The circumplex model of affect proposes that all affective states emerge from cognitive interpretations of neural sensations from two independent neurophysiological systems [35].One system causes valence, while the other causes activation.It places emotions in a two-dimensional circular space, with one dimension indicating the level of valence and another the level of arousal (Fig. 9).Neutral point is at the centre of the circular space.
To simplify survey design, we collected valence and arousal ratings separately, with each scale ranging from -5 to 5.This approximation avoids having to design a custom input that presents the circular scale of the model.

UCI sentiment labelled sentences
The University of California, Irvine Sentiment Labelled Sentences consist of positive and negative labelled sentences taken from three websites: imdb.com,amazon.comand yelp.com[36].For each website, there are 500 positive sentences (labelled as 1) and 500 negative sentences (labelled as 0).
Parameters: Documents were tokenized into words using the PTBTokenizer.To obtain document vectors, the tokens were stemmed using Porter stemmer, and then, the tfÁidf statistic was computed.To obtain one-hot vectors, the tokens were similarly stemmed.The parameters used for the classifiers were K-nearest neighbour: 10 nearest neighbour Passive aggressive: trained for 50 iterations SGD: trained for 50 iterations

LSTM: trained for 6 iterations
Spell correction for pre-trained GloVe and Sen-tiWordNet: This dataset has not been pre-processed and contains several misspellings and missing spacing around punctuations.Missing spacing results in PTBTokenizer not splitting some text into separate tokens.This incorrect tokenization and misspelling causes failure to find the pretrained GloVe vectors and the SentiWordNet scores.We used the Aspell 0.60.6.1 in bad-spellers mode and Algorithm 1 to correct the problems.A similar procedure is repeated to retrieve Sen-tiWordNet scores.The lookup in pre-trained GloVe is substituted with a lookup in SentiWordNet.

Results
Fourfold cross-validation is used to measure the accuracy.To reduce the effects of random initialization on accuracy, each algorithm is run twice, and the average accuracy is reported in Table 2.The sentences were classified into two classes.

Discussion
The best traditional classifier for this dataset is the ridge regression classifier, giving an accuracy of 80.7%.Using one-hot vectors matched that accuracy at 80.6%, showing that LSTM managed to find patterns in the very sparse sequence of vectors.Using LSTM with SentiWordNet scores achieved 69.4% accuracy, better than pure chance of 50%, showing that SentiWordNet scores do contain information.Using LSTM with GloVe vectors resulted in better accuracy, at 85.3%, than the best traditional classifier.Comparing with one-hot vectors, we conclude that the semantic information that GloVe vectors contain helped classification.The addition of SentiWordNet scores further improved accuracy to 85.8%.We can thus imply that semantic and sentiment information improved our classifier.

Notes dataset
Four classes of 20 notes each are used in the experiment, totally 80 notes.The 20 suicide notes and 20 hoax notes are obtained from Cincinnati Hospital Medical Centre.The hospital has de-identified the notes to protect the identity of the patients and the deceased before allowing us access to them.We consider suicide notes to be written by people who died in their suicide attempt, and hoax notes to be written by people who did not.The 20 positive and neutral notes are chosen based on rating by 12 undergraduates in a pre-study.This dataset is pre-processed to correct spelling, punctuation and spacing inconsistencies.
In addition to the notes, we collected valence-arousal (VA) rating for each note.Twelve healthy participants are tasked to rate a subset of 20 notes each on the valence and arousal scale.The notes are arranged in random order.Each scale ranges from -5 to 5 to simplify data input, yielding an approximation of the Circumplex model.The selection of the subset of notes is such that each note would have 3 ratings.

Parameters
Documents were tokenized into words using the PTBTokenizer.To obtain document vectors, the tokens were stemmed using Porter stemmer, and then, the tfÁidf statistic was computed.To obtain one-hot vectors, the tokens were similarly stemmed.The parameters used for the classifiers were K-nearest neighbour: 10 nearest neighbour Passive aggressive: trained for 50 iterations SGD: trained for 50 iterations LSTM: trained for 28 iterations

Results
Given the small dataset, accuracy fluctuated a lot with kfold validation.Thus, we used leave-one-out cross-validation to measure the accuracy of the classifiers.To reduce the effects of random initialization, each algorithm is run 3 times and the average accuracy is reported in Tables 3  and 4. The notes were classified into four classes.

Discussion
The best performing traditional classifier on our dataset is the Naı ´ve Bayes Bernoulli classifier, with accuracy of  5.
In LSTM using SentiWordNet scores, it can be seen that emotional scores alone, without any word vectors, does carry enough information to classify with 38.3% accuracy, better than pure chance of 25%.The addition of Sen-tiWordNet scores to LSTM using GloVe vectors increases accuracy to 72.1%.Further addition of valence-arousal ratings did not improve accuracy, showing that both Sen-tiWordNet and valence-arousal provided similar information.Interestingly, using just valence-arousal ratings with GloVe vectors yields a better result of 73.8%.A simpler network might have enabled the neural network to learn to leverage the information from the ratings better.Presenting the valence-arousal ratings first and appending Sen-tiWordNet scores both increased accuracy with the formal being more effective.The accuracy of 74.2% and 72.1% both outperformed the best traditional classifier.
However, presenting the valence-arousal ratings last reduced accuracy instead (reduced 70.4% to 67.5%; reduced 72.1% to 68.8%).Looking at value produced by the loss function as the training progresses, we observed that the loss decreases slower and hits a plateau at a higher loss than presenting the valence-arousal first.We suspect it is due to have the 300-dimension GloVe vectors being 3 time steps further away from the output, and the LSTM had to assign several nodes to the role of shutting of the input gates of other nodes to preserve the values in the memory cells.This left less nodes for the role of learning the pattern between the input and the output.
Presenting the valence-arousal ratings first and appending SentiWordNet scores at the same time achieved the best accuracy, yielding 75%.Intuition would be that our documents are emotionally charged, and thus, attaching emotional scores would bring documents from different classes further from each other, making classification easier.

Conclusion
This research aims to improve classification beyond what classic text classification algorithms offer.It achieved the goal through breaking the tradition of treating document as a vector.Instead, each document was represented as a sequence of vectors.The main novelty of or work is to use word-based and sentiment-based encoding for deep learning-based text classification.Doing so preserved the position of words in the document, while giving the flexibility of incorporating semantic information from GloVe vectors, valence-arousal ratings and sentiment information from SentiWordNet.Solving the problem might involve using non-homogeneous nodes in a layer, using skip-layers, or changing the LSTM memory block.We would also like to investigate what other kinds of data can be incorporated into our model.
The main drawback of the proposed system is the sparse representation of the words.Sometimes, this comes with information loss and performs poorly.Our method can be improved by utilizing more powerful sentiment vectorization method and use of advance classifiers.

Fig. 1
Fig. 1 Demonstration of the proposed solution.We use sentiment vector addition to the word vector

Fig. 3 aFig. 4
Fig. 3 a A fully connected RNN used as the foundation of the proposed method.b The proposed model uses RNN as an infinitely deep feedforward neural network

\good.n. 01 : 1 ÀFig. 7
Fig. 7 LSTM architecture used in the experiment, illustrated with the recurrent connections unrolled into an indefinitely deep feedforward network

Fig. 8
Fig. 8 Euclidean distance would consider the document containing the words B A B A as closer to the document B A B than B A. However, intuitively, appending a document to itself does not fundamentally change it, and thus, B A B A should be more similar to B A than B A B. Cosine distance (the angular difference between the vectors) conforms to this intuition

Table 1
Cosine distance between GloVe vectors of water, ice, learn and educate

Table 2
Classification accuracy on UCI sentiment labelled sentences