The proposed SA approach is based on Text Categorization (TC). TC is the task of assigning predefined categories to text documents based on the analysis of their content. In the special case of SA, we are interested in classifying a text that expresses opinions as positive, negative or neutral.
So far, several methods based on machine learning have been defined to automate this task [24]. Using a set of documents with assigned class labels, such methods are able to create a model that can predict classes of arbitrary documents. More formally, let D = {d1, ⋯, dn} be a set of documents to be classified and C = {c1, ⋯, cm} a set of classes, TC aims to define a classification function Φ such that [25]:
$$ \Phi :D\times C\to \left\{T,F\right\}\mid \left({d}_i,{c}_j\right)\mapsto \Phi \left({d}_i,{c}_j\right) $$
(1)
which assigns the value T (true) if di ∈ D is classified in cj ∈ C, F (false) otherwise. In the case of SA, three classes are considered, corresponding to a positive, negative or neutral sentiment.
Apart from some generic n-grams based approach [26], most TC methods are language dependent, which means that several trained models are needed to classify texts expressed in different languages. The proposed tool is able to classify messages in English and Italian. For this purpose, two trained classifiers (based on the same model) have been provided and a Language Detection step has been added to switch from one to the other. This step is described in 3.1 along with other preprocessing steps including Word Vectorization.
Once the preprocessing steps have been performed, the input document is processed by the right instance (English or Italian) of the classifier which assigns it a single class together with a confidence score. The proposed classifier is based on a Hierarchical Attention Network (HAN) [11] described in 3.2 and has been pre-trained on a bi-lingual dataset as described in section 4. In order to improve the trained models even during the system operation, the system allows CRM operators to provide feedback on the classifier output i.e. to propose an alternative class when the inferred one is considered incorrect. The feedback is then used for model improvement as described in 3.3.
It should be noted that, as for the other methods analyzed in section 2, the proposed TC-based approach assigns polarity to a whole document (customer request), i.e. if a document shows different polarities (e.g. with respect to multiple aspects), only the dominant one is detected.
Text preprocessing
The preprocessing step aims to perform preliminary operations on the input documents and transform them into a format that can be accepted by the classifier. In the proposed approach this includes text segmentation, language detection and word vectorization. During Text Segmentation the input document is split into a token sequence: syntactically atomic linguistic units representing words and punctuation. It is a relatively simple task performed with regular expressions.
Then, a Language Detection (LD) step is performed to identify which natural language a text is in. Most approaches to this problem view LD as a special case of text categorization. There are several software libraries for LD, based on different statistical methods. Among these, our system adopts LangIDFootnote 1, a Python module based on a naïve Bayesian classifier capable of distinguishing between 97 languages. According to [27], such library obtains better performance in terms of accuracy if compared to other approaches.
Once the document language has been inferred, a Document Vectorization (DV) step is required to obtain a vector representation of the documents. In DV, each document is assigned the values of a fixed, common set of attributes and is therefore represented by a vector of its attribute values, with the number of vector elements being the same for all documents [28]. A simple and widely used vector text representation is the Bag Of Words (BOW) where each document d is represented by a vector \( d=\left({\mathcal{W}}_1,\dots {\mathcal{W}}_{\left|T\right|}\right) \) where T is the set of terms that appear at least once in training documents and each \( {\mathcal{W}}_1\in \left[0,1\right] \) represents how much the term ti ∈ T represents the semantics of d (e.g. word count, term frequency, term frequency–inverse document frequency).
The main limitation of BOW is that it disregards context, grammar and even word order. These limitations can be overcome with more refined context-sensitive approaches like Word Embeddings (WEs). Within WEs, the individual words are represented by dense vectors that project them into a continuous vector space. The position of a word vector within that space is learned from the training documents and is based on the words surrounding each word when used.
WEs are able to capture semantic similarities between words: words with similar meanings have close vector representations. In [29] it has been shown that semantic and syntactic patterns can be reproduced using vector arithmetic e.g. subtracting the vector representation of the word “Man” from the vector representation of “Brother” and then adding the vector representation of “Woman”, we obtain a result closer to the vector representation of “Sister”. Once WEs are learned, a document can be represented by aggregating (e.g. adding or averaging) the word vectors of the included terms to obtain a single vector representation [30].
In the proposed approach, a WEs aggregation strategy based on neural network, closely coupled with the classification step, was adopted and discussed in the following section. For our purposes we have used pre-generated WEs provided by spaCyFootnote 2: a Python tool for NLP. Such WEs were obtained by training a Convolutional Neural Network (CNN) on large natural-language text corpora. In particular, Universal Dependencies and WikiNER corpora [31] were used for Italian, while OntoNotes [32] and Common CrawlFootnote 3 corpora were used for English.
The classification model
Once the WEs were made available for the training documents belonging to each target language, two classifiers (for English and Italian) were trained to assess the degree to which each new document belongs to one of the three classes that identify the sentiment polarity. As anticipated, we have adopted the HAN model which represents an end-to-end solution based on stacked recurrent neural networks that integrate both the WEs aggregation and classification steps.
A Recurrent Neural Network (RNN) is a neural model in which the connections between nodes form a directed graph along a temporal sequence. The basic premise of RNNs is to parse items that form an input sequence (such as the WEs of tokens that form a text), one after the other, updating a hidden state vector to represent the context of the previous input. The hidden state (memory) is used, together with the input vectors, to classify each token in light of its context. Starting from an initial hidden state h0 (generally null), for each WE wi with i ∈ {1, …, n} which composes a text to be analyzed, a new hidden state is generated according to the following equation [33]:
$$ {h}_i=\tanh \left(b+W{h}_{i-1}+U{w}_i\right) $$
(2)
where parameters b (bias vector), U (input-hidden connections) and W (hidden-hidden connections) are learned by the RNN on the training set with algorithms based on the gradient descent, as explained in detail in [33]. When the last word wn is consumed, the last hidden state hn (which summarizes the whole text) is used for classifying the text according to the following equation:
$$ class=\mathrm{softmax}\left(c+V{h}_n\right) $$
(3)
where the parameters c (bias vector) and V (hidden-to-output connections) are also learned on the training set and softmax is a function aimed to non-linearly normalize the output in a probability distribution on the set of classes, highlighting the largest values [34].
The sequential nature of RNNs is often not sufficient to characterize natural language. In fact, in some simple sentences, the order of words is not very important while in complex ones, the relationship between distant words is often more important than that of neighbors. According to [11], a better representation of the text can be obtained by introducing attention mechanisms i.e. the output is not a function of the final hidden state but rather a function of all hidden states (general attention) or a subset of them (local attention). In the same paper a mixed local/global attention mechanism was proposed that mimics the structure of the document, following the intuition that the parts of a document are not equally relevant for a specific classification task.
This model uses a bidirectional RNN (where hidden states depend on both previous and subsequent states) at the word level with a local attention mechanism to extract the important words for the meaning of each sentence and then aggregate the representation of those words to form a single vector representing each sentence. Then, the same process is applied globally to aggregate the sentence vectors so as to obtain a single document vector that is used for classification. In [11] it has been shown that such network outperforms other text classification methods by a substantial margin. For this reason, we decided to adopt it for sentiment classification.
Figure 1 shows the HAN architecture. Given a document consisting of m sentences, \( {\mathcal{W}}_{ij} \) indicates the WE which represents the j-th word of the i-th sentence with i ∈ (1, …, m) and j ∈ (1, …, n). For each sentence, the HAN generates a sentence vector Si through the following steps.
-
Word Encoding: the hidden state hij is calculated for each word \( {\mathcal{W}}_{ij} \) in order to summarize the information of the whole i-th sentence. It is made of two components: \( {\overrightarrow{h}}_{ij} \) dependent on the previous states and calculated according to (2) and a specular \( {\overleftarrow{h}}_{ij} \) component dependent on the subsequent states. A gating mechanism is used to regulate the flow of information according to [35].
-
Word Attention: a sentence vector si is calculated for each sentence as the weighted sum of the hidden states hij with j ∈ (1, …, n), where the weights αij are intended to identify the most informative words of a sentence as follows:
$$ {s}_i=\sum \limits_{j=1}^n{\alpha}_{ij}{h}_{ij};\kern0.75em {\alpha}_{ij}= softmax\left({u}^T\tanh \left(c+V{h}_{ij}\right)\right). $$
(4)
Hence, the importance αij of a word wij is measured as the normalized similarity between the output of the j-th unit of the i-th RNN, calculated according to Eq. (3), and a “context vector” u learned during the training process as described in [11].
Then the process is iterated at sentence level with a Sentence Encoding step aimed at obtaining the hidden states hi from the corresponding vectors si with i ∈ (1, …, m), followed by a Sentence Attention step aimed at generating the document vector v as weighted sum of the sentence vectors. The vector v is a high-level representation of the whole document and is used as input for the Classification step that is performed according to the Eq. (3) with v replacing hn. The class corresponding to the highest value of the obtained probability distribution is then returned as the classification output and the related probability as the corresponding confidence score.
Retraining for model improvement
The classifier described in the previous sections is used as part of a corporate CRM aimed at analyzing and managing customer requests. For each incoming message, the system assigns a sentiment label and a confidence value obtained as output of the soft-max function applied by the last layer of the classifier. Such meta-information helps improve the communication processing. Nevertheless, in some cases, the system-assigned labels may be incorrect. To deal with these cases, the system offers CRM operators the possibility of proposing alternative labels, which at are subsequently used to improve the model.
Improving a trained neural model is a task that presents several hurdles related to the so-called stability/plasticity dilemma, a well-known constraint in artificial and biological neural systems [36]. Ideally, neural models should be plastic enough to learn new things or adapt to an evolving environment but stable enough to preserve important information over time. Unfortunately training algorithms are usually extremely plastic, being designed to quickly converge an initial representation, generally random, towards a new one useful for solving a problem. This leads to the so-called catastrophic forgetting issue: if after its original training is finished a network is exposed to the learning of new information, then the originally learned information will typically be greatly disrupted or lost [37].
The main consequence of catastrophic forgetting is the degradation of the overall performance of a trained network when it is retrained with new examples. The problem has been analyzed by several researchers [38] and some solutions proposed so far like fine-tuning (retraining an existing network with a low learning rate) [39], progressive networks (adding new nodes to an existing network to learn from new examples without affecting other nodes) [40], transfer learning (training a new network on the output of an existing network plus new examples) [41], synaptic consolidation (reducing the plasticity of connections that are vital for previously learned tasks) [42], etc.
Among others, we select a simple but powerful method based on sweep rehearsal [37]. Feedback from CRM operators (including customer requests and proposed sentiment labels) is collected in a feedback repository. After a customizable timeframe (usually a few days of system operation), the classifier is retrained with samples from the feedback repository plus random samples selected from the original training set (in [37] it has been shown that excellent results are obtained by just adding three old samples for each new). Unlike random rehearsal, in sweep rehearsal the training buffer is dynamic, which means that the items of the old training set are randomly chosen for each training epoch. This will allow more previously learned items to be exposed to training without specifically retrain any previously learned item to criterion. After retraining, the feedback repository is emptied, and its content is inserted into the training set (so the new samples become consolidated).