Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text

The paper describes the usage of self-learning Hierarchical LSTM technique for classifying hatred and trolling contents in social media code-mixed data. The Hierarchical LSTM-based learning is a novel learning architecture inspired from the neural learning models. The proposed HLSTM model is trained to identify the hatred and trolling words available in social media contents. The proposed HLSTM systems model is equipped with self-learning and predicting mechanism for annotating hatred words in transliteration domain. The Hindi–English data are ordered into Hindi, English, and hatred labels for classification. The mechanism of word embedding and character-embedding features are used here for word representation in the sentence to detect hatred words. The method developed based on HLSTM model helps in recognizing the hatred word context by mining the intention of the user for using that word in the sentence. Wide experiments suggests that the HLSTM-based classification model gives the accuracy of 97.49% when evaluated against the standard parameters like BLSTM, CRF, LR, SVM, Random Forest and Decision Tree models especially when there are some hatred and trolling words in the social media data.


Introduction
Doubtful online users are hiding their real identities and are using their social media accounts for deceptive online messages and incitement comments and tweets for spreading hatred in the society. Hate content detection is the automated assignment for extracting hatred-based words available in contents on the social media. Hateful contents can further be used to misinform people in the society and thus can result in violent incidents. With online hateful contents culminating in gruesome scenarios like the Rohingya issue in India, Article 307 issue, anti-social elements mob violence, and the Terrorist shooting issues, etc., there is a dire need to understand the dynamics of user interaction that facilitate the spread of such hateful content. It is also observed that the content generated by the hateful users tend to spread faster as compared to the content generated by the normal users [1]. An intelligent tool for detecting hate speech or hatred contents present on social media in Indian context is the need for the present time. The internet medium is the widely used source for communication and information exchange in current scenario. In present times language used for communication over internet is not limited to one rather people are using mixed language for communication. The following sentence illustrates an example showing the mixed language format used for communication exhibiting hatred and trolling terms which can provoke violence in the society. Sentence 1: @-inke last 5 years ke work ko dekh lijiy nafrat ho 1 3 jayegi aapko inse. Desh mein julus nilkegi aur public khule aam patthro se maaregi inko. Phir se dhamkana hoga. Aise logo se Nafrat karo. The sentence 1 uses the code-mixed content posted on social media related to introducing hatred which is translated as '@.. look at the past 5 years' work of these people; one will start hating them. A rally will be organized and public will throw stones at them. Again, there is a need to threaten them. Hate these people. Kill them. In context to above sentence 1, the mechanism of annotation is used for better understanding of scenario. Hindi words are tagged as H, English words are tagged as E, and the words belonging to Hindi or English denoting the hatred are tagged as E/HT and H/HT, respectively. The word "last", "work", "public" is tagged as E. The word "Hate" and "Kill" will be annotated as E/HT. The word "Desh", "inke", "dekh", "julus" will be tagged as H and the words like "nafrat", "Patthar mareenge" and "dhamaka" will be annotated as H/HT. The extensive use of social platforms leads to the availability of these contents in multilingual form and it is a challenging task for processing.
HLSTM (Hierarchical Long Short Term Memory) is a novel classification technique that can be applied to extract language information from code-mixed data. The HLSTMbased learning model is the extension of the existing LSTM model. Neural learning approach is the primary option for processing any user-generated text which further helps in extracting language information of the text. The mixing of languages for expressing any opinion is very common in context to Indian scenario and is widely applied on social media. The main challenge in processing and analyzing these data is the availability of multiple variants of writing a word using transliteration mechanism. The work presented in the paper points out the confusion that rose among the languages which are semantically and linguistically related. As we human beings are more inclined towards the use of our native language for communication [2] and the same native language is used on the internet social media domains for expressing opinions. These native languages provide the freedom to express the things in a very easy manner for those users who are not proficient in using English as a language.
The work presented in the paper points out a novel framework, describing the information available for the word level context identification for predicting the hatred content. Here in regard to the dataset used for evaluation the hatred words are tagged as E/HT. There are many applications where these types of scenarios are needed to be normalized and processed for better understanding. The applications of question answering system and sentiment analysis along with text classification can be potential use-cases for context analysis based on discussed scenarios.
The ambiguity extraction for words belonging to English language is a challenge; however, the use of HLSTM learning helps in analyzing the context of the word. The model discussed in the paper is based on recent research undertaken in the area of computer vision and NLP tasks [1,3]. The HLSTM technique of learning is applied and tested on various parameters.
The text analysis and embedding technique is applied [4] for processing. The character embedding along with word embedding models has been involved for ambiguity extraction and thus the model has been trained to exhibit the learning outcomes in the term of hatred and trolling word detection [5].
The contribution and deliverables of this work are as follows: ( The remaining portion of the paper is represented under the following headings: Literature review is presented in Sect. "Literature review". Proposed methodology is available in Section "Proposed work". Dataset description and definition is in Section "Dataset". The Section "Evaluations" provides experimental discussion and evaluation measures. Last the overall inference extracted is available in Section "Conclusions".

Literature review
This section points out the related work surveyed in context to the language transliteration, code-mixing and ambiguity detection in code mixed data. Section "Language transliteration" describes the state-of-art in the field of transliteration considering the use of machine learning techniques for language transliteration because transliterated text is usually used on internet social media sites for expressing opinions. Section "Code-mixing" points out the research development in the field of code-mixing. The code-mixing is frequent pattern used by the users for writing posts using two languages where one language is English and other is the transliterated form of native language. Section "Hatred and trolling detection" provides the summarization of work undertaken in identifying the hatred and trolling words used in the written sentences. The proposal presented in the paper points out the potential research gap in terms of identifying hatred and trolling terms used in social media posts in Indian language context based on language English and transliterated versions of Hindi.

Language transliteration
The transliteration domain is the current research area for text analysis. The practice of language mining is the foremost task in textual content processing [6]. The paper [7] points the application and use of CRF (Conditional Random Field) method in bigram processing. The paper [8] focuses on applications and use of LR (Logistic Regression) with probability distribution function in code-mixed domain. The paper [9] points out the applicability of dictionary in normalization of transliterated terms.
The paper [10] presents the creation of linguistic resource for the language English and Gujarati. The approach of translation into native language using transliteration is the approach for identifying language. The work presented in the paper mainly concerns with the transliteration of Guajarati for identifying the language used in code-mixed language.
A mixed script based language identification task was conducted for Indian Languages [11]. Here the use of machine learning techniques using SVM (Support Vector Machines) classifier [12] was proposed. The technique of classification and its related machine learning techniques for English-Hindi [13][14][15] languages were taken care. This task gives the opportunity for the emerging researched to enhance their learning and understanding the domain area covered under transliteration field [16]. Various emotion identification models have been described based on learning approach [17] for language mining [18][19][20].

Code-mixing
The paper [12,21] describes the code mixing patterns in text contents. The work on entity mining in code-mixed [22] data is discussed with the use of embedding techniques [23]. The paper [24] focused on communication medium where short forms are used and its meaning extraction work is presented in the research. Use of regional dialects has been pointed out in communication and identification of its context meaning is handled in the paper. The paper [25,26] presents the state-of-art in language identification. The paper [27] points the use of MNN (Multi-Layer Neural Network) along with LSTM (Long Short Term Memory) for ambiguity minimization in mixed script textual data.
The paper [28] provides the detailing of challenges added to the code-mixed data for analyzing the sentiments. The author presents the use of BLSTM (Bi-directional Long Short Term Memory) sentence generation and selection using neural classification model for classifying the codemixed text into predefined sentiment classes. This classification approach based on BLSTM model overcomes the nuances of detecting sentiments in code-mixed data.

Hatred and trolling detection
The work presented in [29] contains the approach for hatred ambiguity removal with the aid of learning models mostly in intrinsic language domain for finding effective context meaning. This section tries to model the ambiguity problem available in code mixed data using embedding technique [30]. The embedding model is widely used in finding ambiguous words which are commonly used in both the languages, as it is the most common research issue [31] in multilingual dataset [32] used in case of NER (Named Entity Recognition) [32] extraction in transliterated domain.
The semantic similarity identification is handled in the paper [33] for analyzing two concepts in the domain of NLP. A method based on WordNet 3.1 is used for determining the similarity. The work presented by the author overcomes the ambiguities found in social media text using the feature selection technique for improving the semantic similarity. The findings suggest that similarity or ambiguity identification in terms of concepts or words depends on selected balanced features rather based on unbalanced features.
The paper [34] provides the detailing regarding evaluation measures, for semantic representation based on the parameters of including shortest path for context measurement. The paper addresses the ambiguity removal mechanism by using the synonymy concept through imparting knowledge based lexicons. The knowledge-based approach and statistical approach have been used for representing the semantic representation. Knowledge-based approach uses dictionary and thesaurus along with ontologies, whereas the statistical approach is based on finding the word frequencies for identifying semantic relations among the words.
Hatred-and trolling-based ambiguous word detection is a challenging issue. Thus, the use of LSTM and BLSTM [30,31] are nowa days incorporated for effective results. Code mixing gives the flexibility to the users to mix more than one language for expressing the thoughts. So, to process such code-mixed text, identification of language used in each word is important for language processing. The main research issue is to propose a technique for identifying the language of the hatred words in Hindi-English code-mixed data. The focus needs to be done on retrieving the word level language identification for hatred words based on the user's intention to use that word in code mixed environment. Thus, the word level intent identification based on the availability of hatred words is an open issue in transliterated domain. The study also reflects that the use of current hierarchical approach of learning can lead towards better learning in predicting hatred, trolling and ambiguous words available in code mixed environment [35][36][37]. As the HLSTM model has a similar architecture to the LSTM where the use of a 0/1 boundary detection is avoided to detect ambiguity or availability of hatred terms instead hierarchical policy gradient method is used which gradually learns a policy to select better actions from inside the phrase for each word in code-mixed environment. The proposal available in next section uses the HLSTM learning model to detect hatred and trolling terms used in social media domain. The next section describes the proposed model for hatred and trolling word identification in code-mixed text.

Proposed work
The work discussed in this section enhances the work projected in the paper [38] regarding scarce language lexicons. The area of research in transliteration is explored with the use of embedding framework. Language extraction is potential area to explore in regard to transliterated environment. The reason behind selecting HLSTM model for the proposed work is that the HLSTM model has a similar architecture to the LSTM but instead of a 0/1 boundary detection, the HLSTM uses a policy gradient method that gradually learns a policy to select better actions from inside the phrase for each word. The HLSTM determines a structured representation for a sentence for effective identification of hatred words as compared to LSTM, as LSTM is suited well to classify process and predict time series-based ambiguity for known duration. In the next section, the architecture of artificial LSTM network model is explained.

Design principles
Generating information from the language is a challenging task; thus we propose a certain design principles for evaluating the proposed algorithm. The design guidelines are provided as follows: a. The document must contain words from two different languages. b. A single script nomenclature must be followed for writing the contents. Here the single script selected for writing is Roman script. c. Scenarios are based considering the fact that in India code mixing is done between two language and out of these two one language is English. d. The comments length must be between 2 and 15 words. e. The sematic and syntactic rules are applied for validating the mixing points of the languages.
The model proposed is depicted in Fig. 1. The proposed model follows the training sequence of HLSTM [39] at word level. The applied embedding technique helps in retrieving   Fig. 2 describes the learning model. Figure 2 depicts the architecture of proposed HLSTM model for hatred terms detection. The proposed model takes the input in terms of code-mixed sentence containing sequence of words in form of (w 1-w n ). The mixed sentence is converted into tokens for further processing. As there exists the possibility of English word to be used in Hindi context also like the word dust which can be used as दु स् ट or as डस् ट, this ambiguity needs to be resolved prior to identifying hatred terms from the code-mixed sentence. This is being done using the BLSTM learning model where embedding has been applied to extract the correct context meaning of the ambiguous words. The HLSTM model depicted in Fig. 2 is based on the concept of Vector representation. This vector representation helps in capturing context the context meaning of that word in regard to previous words and next available words in the sentence sequence. The following are the generated vectors: (i) Vector → Forward character ( �� ⃗ C), (ii) Vector ← Backward character ( ⃖�� C ), (iii) Vector: Embedding word (e(w i )) and i(v) PoS vector Embedding (e(p i )). The transformation w e is applied to Softmax function g. The computation is depicted using Eq. 1. Embedding's are necessary for computations as the words needs to be converted into numbers. The vector representation of the words is necessary for computations. Similar words have similar kind of vectors while the dissimilar words have differences in their vectors. Vector representation helps in capturing context meaning of the words. Consider this example: 'gussa' and 'gussail' can be described in similar context when compared with another word 'gussa' and 'dust'. This technique of representation is the baseline of the algorithm used for training and testing. The classifier feature training parameters are depicted in Table 3.
The model is trained to predict the belongingness of the word to the languages used. The tagging technique applied are used to distinct the context [40,41]. The trained model is an example of learning applied for text analysis. The activation function softmax helps in classification based on embedding parameters e(p). The classification is done on the basis of pre-trained tagging constraints. Tables 1, 2 depict the sample hatred words used for training and classification.
The embedding technique uses the input for extracting features associated with the words. The use of directional mapping enhances the embedding features. Thus, the use of softmax in vector concatenation as an activation function helps in retrieving the ambiguity.
The consideration of the terms that exist prior to pivot term and next to pivot word forming the things as (i + 1) term and (j + 1) term are used as word features. Considers the example sentence (Main to maar daalunga), here the word of the language is extracted and are tagged accordingly. The learning feature of the model will classify the words which are hatred as E/A, E/HT, H/HT. The tagging scenario is depicted as follows:

Embedding: word
The embedding mechanism supports in extracting feature vectors. The purpose behind embedding is for generating computational vectors [42,43]. Consider the below document which illustrates the embedding technique.

Doc. A:
"agar tum is actor ke dushman ho to is actor ko maara karo". Doc. B: "agar tum is actor ke dushman nahi ho to is actor ko maara naa karo".
11 features are there in Doc A and Doc B contains 13 features. The feature is computed on the basis of [44] and can be expressed as follows: { is, agar, tum, actor, dushman, ke, ho,, ko, to, maara, karo} The methodology of skip-gram helps in embedding [45]. The Skip-gram description is pointed in Fig. 3. The input word is T 0 this input is provided to the classifier for finding the next sequence word based on probability of log [46] as depicted in Eq. 1. This log probability facilitates the computation in terms of distributed dimensions.
Here in terms of above equation, the P(T j |T k ) is the word probability, Output vector is represented as V'. The Eq. 4 denotes the significance of character embedding especially used for English words which have been tagged as E/A. This Eq. 4 is used to create feature vectors based on character representation for the words which are tagged as ambiguous English. Consider the word "main" tagged as E/A This word can have phonetics as मै ं in Hindi and मे न in English. The log scaling computation for English words are obtained to further optimize the context probability [47].
The use of probability distribution is applied for word identification based on pre-defined classification. Consider an example the word "to" represented as Word E , and its phonetic similarity in Hindi Word H is represented as तो. This scenario illustrates that one will be having 2 N possibilities for any sentence containing N number of words. The technique of dynamic programming is suitable for computing the possibilities of the words. The concept of context capturing is applied to retrieve all hatred terms used in the sentence. The training for word representation is imparted in the learning phase to forecast the probable words. The learning model applied uses the representation of the words in the sentence as sequence terms (W 1, W 2 , W 3 …W T ) [48][49][50]. This approach seems to be a computational approach for learning with minimum memory and highly scalable in nature [51,52]. Thus, the learning of HLSTM based approach is quite beneficial in terms of finding ambiguity and hatred terms and it can be used as a novel tool for solving problems in the area of NLP.

Dataset
The dataset used in the training and testing comes from [53]. The social media contents provide the base for this dataset. The detailing is presented in the Table 4. The dataset contains English-Roman Hindi code-mixed data. As monolingual English and romanized Hindi and other Indian language text messages are prevalent in social media in Indian context. Here we will be concentrating only on code-mixed English-Hindi to extract hatred and trolling terms used on the social media domains. The corpus used is mostly bi-lingual mix. While two languages are blending, one important aspect is to know about the mixing. The blending of languages states which language is mixed in what manner or in what ratio. This leads towards the notion of MI (Multilingual Index) and CMI (Code Mixing Index).
The dataset depicted in Table 4 is standard dataset referred by many researchers for evaluating their hypothesis. This is important for finding patterns for validating the results. The data and resource of WordNet [54] are further used in case of ambiguity identification and normalization. This WordNet is specifically used for analyzing Indian nased languages. The idea behind this resource is to retrieve most frequently used words in the sentence so that the model can be enabled to understand the frequency of the words used in the sentence.

Evaluations
The measures used for evaluating the process discussed in the paper are presented showing the various evaluation measures applied on the above dataset. Two valuation measures are used to evaluate the results, First the statistical measure is used for assessing similarity measure of words, and second context evaluation measure is used based on proposed HLSTM model which is compared against state-of-art other machine learning models. The evaluation measures based on statistical technique are used to find the similarity measures of words which are represented in code-mixed data having different spelling variations, e.g. the word khauf can be represented as khof, khaof, khaoph and so on in transliterated manner. The statistical evaluation presented in section "Experimental results" and in section "Context retrieval evaluation" describes the evaluation measures for context identification for finding ambiguous hatred terms in codemixed data based on left and right context of the word in regard to entire sentence.

Experimental results
The evaluation measures selected for testing the hypothesis are based on the techniques of statistical evaluations and learning techniques imparted to the machine based on the concept of HLSTM. The following sections provide the detailing of the evaluation standards used for evaluating the results: MI (Multilingual Index) [9] The concept of MI quantifies the language distribution based on tagging mechanism. The multilingual Index available in the dataset is measured using Eq. 5.
Here the symbol K denotes the language specification towards the word P j .
CMI (Code Mixing Index) [9] The concept of CMI quantifies the distribution of language used mostly in the dataset. Its mixing index is computed using Eq. 6.  Training  Testing  Training  Testing   WhatsApp  883  376  3929  903  Twitter  1387  273  25,749  4027  Instagram  782  343  18,742  3879  Facebook  1372  489 24,632 3423 Here in Eq. 6, the ∑ n i=1 w i denotes the summation of languages used. The notation max {w i } represents the word available in the dataset in terms of maximum availability. The notations n and u describe the tokens and tagging mechanism. Equation 6 is further normalized on the scaling of 0 and 1 as depicted in Eq. 7.
Here in Eq. 7 notation max(w i ) describe the labeled words. The CMI value is being computed using this equation and it provides the mixing patterns in the data which are passed to the machine for further processing (see Table 5).
The evaluation based of statistical measure is computed using Eq. 8. The token similarity is measured on the basis of Conf_Score used in the classifier. The presented Figs. 4 and 5 points out the similarity values obtained on the parameters defined at word and sentence level, respectively.
The evaluation uses the base of BLSTM learning applied to the hierarchical model. The dataset [53] contains textual    [58] 3423 MSIR [59] 8734 information posted on social media platforms with defined parameters as illustrated in Table 6.
The training data examples in regard to defined parameters are listed as follows: These defined tagging parameters are used to make the system learn the technique of HLSTM for processing the results. The word available in the data is identified based on these classified parameters for predicting the presence of hatred words. Table 4 provides the detailing of this mechanism which shows data from [55][56][57]. The embedding features used for further normalizing the process are defined in Table 7.
The table presented below as Tables 8, 9 and 10 provides the detailing testing the proposal using the context finding features.
The above table illustrates the embedding parameters in terms of N-gram model. The accuracy in terms of various N-gram parameters is depicted in table in terms of accuracy for Twitter, Facebook, WhatsApp and Instagram (see Table 11) (see Fig. 6).
The representation of cloud depicted in Fig. 7 and 8 provides visual inference of the trained model. The results seem to be clear showing several separations for the various words used in Hindi and English suggesting the correct use of labeling parameters.

Context retrieval evaluation
The process of context evaluation is based on finding the ambiguity in terms of words used in the code-mixed sentence using the left and right contexts in regard to the used pivot word. The evaluation approach is based on the selflearning approach [60]. The basic mechanism for extracting the contextual meaning is based on the condition that the left and right words to the pivot word must belong to two different languages [61][62][63]. The statistical approach based on set theory intersection concept is applied here in the proposal for annotating the words. The context word is retrieved on the basis of WX notation technique [58]. The tree representation of learning model for finding the context is represented as follows:   The model uses the Grapheme-GM and Phoneme model-PM for representation. The estimation of ambiguity in regard to context retrieval is jointly modeled using Eq. 9.
Here in regard to Eq. 9, symbol S denotes the word score, 1 . and 2 point to the learning weight provided to the model. The scores of GM and PM are estimated on the basis of probability. Table 12 describes the accuracy. Figure 9 shows the results obtained for developed HLSTM model. The notation TP signifies number of hatred words. FP denotes number of English words detected wrongly as ambiguous. TN signifies number of hatred words in English. FN denotes number ofncorrectly wds detected as English. Figure 10 depicts the matrix parameters for computing various dimensions of the result.
The dataset [60] has been further used as a baseline to measure the accuracy. The feature extraction and classifiers are used to predict the correct tag in regard to sentence context. The HLSTM gives better precision and BLSTM gives better recall. Figure 11 provides the graphical representation of the result.

Conclusions
The objectivity of this work is hereby defined in terms of use of HLSTM based classification. The HLSTM is extension of BLSTM learning model. The use of HLSTM based alignment technique is useful for tagging. The developed HLSTM model is then used for the classification on the basis of voting technique. The main logic behind this is to find multi-lingual features for predicting the language belongingness. The technique context finding using embedding approach suits the model for extracting ambiguity in the sentences. The experiments were organized keeping in mind the lingual features available in the language. The state-of-art techniques in ambiguity detection is studied and compared with the developed approach.  The developed technique is scalable and usable in intent retrieval where intention of the users for using that hatred word in the sentence is not clear. The intent identification helps in understanding the various language models for extracting the context. This intent retrieval in codemixed sentence helps in solving many linguistic problems related to polysemy. The system is scalable and flexible to carry out other experiments related to other shades of hatred identification available in the form of sarcasm and misinformation.

Conflict of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.