1 Introduction

Every day, social media generates a large amount of textual data, and it has been demonstrated that this data can be useful in a variety of fields, including healthcare [1], psychology [2], politics [4], crisis management [5], and marketing [6]. Natural Language Processing (NLP) systems can play a significant role in these fields and assist in advancing the effectiveness and efficiency of the tasks related to the field by utilizing the unprecedented stream of textual data generated from social media platforms [1, 2, 7]. These systems typically perform worse due to the noisy nature of social media writing [8], caused by users’ propensity to employ nonstandard terms [9, 10]. When dealing with noisy or non-standard text, NLP models tuned for tasks with a noisy and inconsistent format but trained on clean or curated data have demonstrated significant challenges. A potential solution for this problem is normalization [11]. The W-NUT 2021 Multilingual Lexical Normalization (MultiLexNorm) [12] aims to assess lexical normalization systems on 12 social media datasets in 11 languages, including 2 datasets for codeswitching. The evaluation process includes both internal and extrinsic components; the latter is quantified using dependency parsing on the normalized data.

There are several different types of lexical errors that can be found in social media-generated text that led to a tokenization problem in most (NLP) systems. One lexical error is a typographical error, or the mistyping of user input (e.g., bidge → bridge). Another common error is the missing apostrophe, which is usually ignored by social media users (e.g., im → i'm). There are also “split” errors, when splitting one word into many words or merge errors when joining many words into one big word, both of which happen unintentionally (e.g., pre order → preorder, screen shot → screenshot). One further error is phrasal abbreviation, which happens when the user uses the first letters of many words to create a single word to represent all of them (e.g., lol → laughing out loud). One more error is the repetition error, which is often used by social media users to emphasize a word or feeling by repeating letters (e.g., soooo → so). Another type of error is shortening, which occurs when a user writes a word without its vowels or without the final syllable or letter (e.g., pls please, rmx remix). The last error is transformation error, which occurs when the social media user replaces similar letters or ignores word endings or (e.g., hackd → hacked, wateva → whatever) [13]. Table 1 demonstrates examples of common lexical error types; the main observation is that in the majority of error types in nonstandard words, the words are altered by adding, deleting, or replacing characters.

Table 1 Common lexical error types found in social media text [13]

Recent lexical normalization methods have focused on word-level [14], sub-word level [15], syllable-level [16], or sentence level [17]. Based on the previous observation, we focus on characters at word-level's lexical normalization. To do that, we investigated a method to translate a sequence of characters into a sequence of words. In machine translation, a sequence of the input tokens is converted to a sequence of the output tokens, where the length of the input sequence may differ from the length of the output sequence, and the relations between the input and output tokens vary from one to one, one to many, many to one, and many to many. This problem is also known as alignment [18]. The same alignment problem applies to characters-to-words translation for the lexical normalization task: a sequence of characters can be translated to one word (e.g., bidge → bridge) or to multiple words (e.g., lol → laughing out loud).

Fortunately, there is plenty of research on machine translation with remarkable results. We chose the state-of-the-art transformer [19] to build a characters-to-words translation model. One idea was to use a character-based word representation, meaning that the translation will be done at the word or sub-word level, and we are going to lose any character-level relation information. Rather than words-to-words or subword-to-subword translation, the transformer-based model is used for aligning and translating a sequence of characters into a sequence of words. The encoder transformer accepts input words that are tokenized into a sequence of character tokens, and the decoder transformer accepts word tokens besides the encoded characters and outputs a sequence of word tokens, with a small tweak by using an extra token added to the training labels to classify the input. A token with value (oov) for out-of-vocabulary samples and a token with value (IV) for in-vocabulary samples. This way, the model uses the token prediction for both translating and classifying the input character sequence. The translation method acts as a denoising autoencoder and tries to construct a word or more from a given sequence of characters. The main contribution of this study is a lexical normalization generative model based on character-to-word translation. We also demonstrated how a generative model can be used for classification using special tokens.

2 Related Research

A lot of research has been conducted on lexical normalization tasks with different approaches. Most of the previous work took a two-phase approach: first, it generated a set of normalization candidates, and then it selected the best of the candidates as the normalization results. One of the first attempts to use the two-phase approach was to use a statistical machine translation model for character-level translation to generate candidates for the first phase, and in the second phase, a domain-specific language model is used for prediction [20]. Compared to word-level translation, character-level translation is more robust for unseen words. Another model was to first generate a set of normalization candidates for words, employ a Support Vector Machine (SVM) for unnormalized word detection and classification, and then an n-gram lookup for candidate selection [21]. Another character-focused model used two-step statistical machine translation [22], first detecting non-standard words using conditional random fields (CRF) sequence labeling, then translating the character sequence to a phonetic sequence, and finally translating this new sequence to words, with the words segmented based on their phonetic meaning, symbol or pronunciation improves the characters’ alignment with parts of a word. Another promising two-phase approach is based on contextual graph random walks [23]. The model generates a bipartite graph with contexts on the left and unnormalized and normalized words on the right. The edges are the co-occurrence counts of a word and its context, and Markov Random Walks to locate pairings of unnormalized and normalized words that may be considered normalization equivalency. Rather than the supervised methods, [24] proposed an unsupervised statistical model for text normalization. They used edit-distance, the longest-common sub-sequence, and word pair co-occurrence as features to define a log-linear model that describes the relationship between normalized and unnormalized words, and they trained it using a unique sequential Monte Carlo algorithm. Another approach was to use the Hidden Markov Model (HMM) [25], in which each standard and nonstandard word was first divided into orthographic syllable segments. In order to do this, characters were changed into sounds, and the word-to-word transition was handled using the standard (HMM). The HMM has four layers in total. The nonstandard words make up the first layer, followed by the standard word syllables in the second layer, the standard word syllables in the third layer, and the standard words in the last layer. In [26], MoNoise model was introduced. They used a range of techniques, such as Word2Vec, Aspell, and the feature-based Random Forest, for candidate generation and ranking. The model obtained state-of-the-art results with an F1-Score of 86.39%. One other unsupervised method first generated a set of candidate morphological variations and a similarity graph of morphological variants, clustered the morphological variants, and then searched for the minimum edit distance within the clusters to obtain the normalized version [27]. Most of the models mentioned required hand-crafted features and also did not provide a way to handle unseen abbreviations or any other forms of word substitution used on social media platforms.

One of the earliest attempts to employ a deep neural network to tackle (LN) problem was [15], which demonstrated the importance of a pre-trained language model and enhanced the Bidirectional Encoder Representations from Transformers for Language Understanding (BERT) by adding word piece tokenization and word piece alignment. Their performance on the lexnorm2015 dataset yielded an F1-score of 79.28%. [28] uses mBART [23] to translate a full sentence with unnormalized tokens to a normalized one. The main advantage of using mBart is that it can scale to multiple languages without increasing computational demands. The model obtained an average Error Reduction Rate (ERR) across languages of 10.65, where (ERR) is a proposed evaluation metric that can be used to compare the performance of systems across multiple datasets [12]. ÚFAL [29] achieved 67.3% (ERR) on the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 [12] by training the multilingual byte-level generative language model ByT5 [30] on synthetic data and subsequently fine-tuning it. Unlike the previous research, the aim of this study is to find a small yet generative model to solve the (LN) task. We investigate a word-to-character translation as remedy for lexical normalization solution since various word normalizations frequently work independently of one another while there is a major reliance on word characters while normalizing.

3 Background

A typical neural machine translation system consists of an encoder that converts all input sequences to a fixed-length representation, a decoder that generates the output sequence using the encoded fixed-length representation, and an attention layer that addresses word alignment and improves translation performance. The first suggested sequence to sequence model used Long Short-Term Memory (LSTM) [31] for both encoder and decoder [32], and the first attention mechanism was additive attention [18]. It computes the attention score using a single-hidden-layer feedforward neural network with a softmax function, then generates the requisite fixed-length vector via weighted sum of all the input sequences; this fixed-length vector becomes the input to the decoder at a particular time. Multiplicative attention is another attention technique [33]. It, like additive attention, uses all encoder output with the decoder's current state and computes the score using simple matrix multiplication.

While both attention mechanisms are equivalent in complexity, multiplicative attention is faster in practice due to well optimized matrix multiplication software libraries, and it is more space efficient. Following this study, the transformer model [19] is presented; it is based on multi-head scaled dot product self-attention. The transformer allows parallelization and avoids vanishing and exploding gradients for long sequences since it analyses the full sequence at once rather than word by word.

The main building blocks of the transformer are the multi-head scaled dot product self-attention, position encoding, position-wise feed-forward network, residual connection, and layer normalization. In the following section, we discuss those building blocks in more detail.

3.1 Scaled Dot Product Self-attention

The closest example of self-attention is seen in information retrieval systems, where a user query is mapped against a set of keys to discover the best-matched data. Similarly, the self-attention function maps a query vector against a set of key vectors, then applies the Softmax function to get a score that can be applied to a value vector.

Figure 1 shows a scaled dot product self-attention with vectorization by attending on a whole sequence at once using matrix multiplication, with masking employed to prevent the attention layer from attending future output positions. Equation (1) summarizes the operations of the scaled dot product self-attention layer.

Fig. 1
figure 1

Dot product self-attention given query Q and key K and value V

$$\mathrm{Attention}\,\left(Q,K,V\right)=\mathrm{softmax}\left(\frac{Q{K}^{\mathrm{T}}}{\sqrt{{d}_{k}}}\right)V.$$
(1)

For a set of queries \(Q\) each query is mapped to a set of keys \(K\); the output is then scaled by the model size square root \(\sqrt{{d}_{k}}\), then Softmax to get the score. And then apply the score to the value matrix \(V\), The \(Q,K\) and \(V\) matrices are created from the input data, with a trainable weight for each and the scale utilized to reduce performance decline when training bigger sequences. The idea is to use multiple self-attention heads with different weights to provide output with more information. As seen in Fig. 2, multi-head attention is the concatenation of numerous scaled dot product self-attention. (2) Demonstrates an extension for (1) that performs numerous self-attention operations and concatenates the output into a single representation.

Fig. 2
figure 2

Multi-head scaled dot product self-attention by concatenating multiple attentions heads

$$\mathrm{Multi\,Head} \, \left(Q,K,V\right)=\mathrm{Concat}\left({\mathrm{head}}_{1},\dots ,{\mathrm{head}}_{h}\right){W}^{O}$$
$$\mathrm{where} \,{\mathrm{head}}_{i}=\mathrm{Attention}(Q{W}_{i}^{Q},Q{W}_{i}^{K},Q{W}_{i}^{V})$$
(2)

The \({\mathrm{head}}_{i}\) is computed using (2) and all heads is concatenated and multiplied with the weight matrix \({W}^{O}\).

3.2 Position Encoding

In recurrent neural networks, the position of a token affects the generated output due to the recurrency nature of the network.

The transformers use the positional embedding as replacement of recurrence as following:

$${\mathrm{PE}}_{\left(\mathrm{pos},2i\right)}=\mathrm{sine}\left(\frac{\mathrm{pos}}{{10000}^{\frac{2i}{{d}_{\mathrm{model}}}}}\right),$$
(3)
$${\mathrm{PE}}_{\left(\mathrm{pos},2i+1\right)}=\mathrm{cos}\left(\frac{\mathrm{pos}}{{10000}^{\frac{2i}{{d}_{\mathrm{model}}}}}\right),$$
(4)

where pos is the position and i is the element index, and dmodel is the dimension of the embeddings. \({\mathrm{PE}}_{\left(\mathrm{pos},2i\right)}\) in Eq. (3) is used to compute elements of even indices and \({\mathrm{PE}}_{\left(\mathrm{pos},2i+1\right)}\) in Eq. (4) is used to compute the elements of odd indices.

3.3 Position-Wise Feed-Forward Network

Attention layer is followed by a fully connected layer applied to each token position.

$$\mathrm{FFN}\left(x\right)=\mathrm{max}\left(0,x{W}_{1}+{b}_{1}\right){W}_{2}+{b}_{2}.$$
(5)

In Eq. (5), a two-layer fully connected feed forward network with Rectified Linear Units (ReLU) activation [34]. \({W}_{1},{b}_{1}\) is the first layer weights and max function used for RelU activation function. \({W}_{2}+{b}_{2}\) are the last layer weights.

3.4 Residual Connection and Layer Normalization

Each layer in the transformer is followed by residual connection in addition to the layer output and layer normalization [35].

$${\mu }^{l}=\frac{1}{H}\sum_{i=1}^{H}{a}_{i}^{l},$$
(6)
$${\sigma }^{l}=\sqrt{\frac{1}{H} \sum_{i=1}^{H}{({a}_{i}^{l}- {\mu }^{l})}^{2}.}$$
(7)

In Eqs. (6) and (7) \(H\) is the size of the layer to be normalized, \(a\) is the element of the layer, in Eq. (6) the mean is computed, in Eq. (7) the mean used to normalize the layer.

4 Dataset and Baseline

The assessment methods and the dataset used for this task are covered in more depth in this section. The experiments conducted on the MultiLexNorm shared task [12], which includes texts in 12 languages, including Croatian, Danish, Dutch, English, German, Italian, Serbian, Slovenian, Spanish, Indonesian-English, and Turkish-German, as displayed in Table 2 with examples from the dataset’s training data, showing the characteristics of the MultiLexNorm datasets, such as word count, word splitting and merging, capitalization correction, and percentage of normalized words, along with instances of noise in each language and their canonical forms.

Table 2 The characteristics of the MultiLexNorm datasets, such as word count, word splitting and merging, capitalization correction, and percentage of normalized words, along with instances of noise in each language and their canonical forms

For several languages in the dataset, words are separated or merged (1-N/N-1 column), and capitalization (Caps column) is fixed. The dataset includes tweets in all languages; however, some languages also include information from other sources. Texts from Arto, Denmark's first significant social media platform, are also included in Danish [36], while texts from Dutch forums and SMS messaging are also included [37].

The organizers of the W-NUT workshop propose two evaluation methods: intrinsic, word-level, and extrinsic, downstream task performance. There are two baselines given, (1) Leave-As-Is (LAI): The result is identical to the initial raw input; no normalization is carried out. (2) Most-frequent-Replacement (MFR): depending on the training data, uses the most common replacement. It returns the input word if the training data does not contain the input word.

The introduced error reduction rate (ERR) in [12] served as an intrinsic assessment.

$$\mathrm{ERR}=\frac{\mathrm{TP} - \mathrm{Baseline acc}.(\mathrm{LAI}):}{1.0-\mathrm{Baseline acc}.(\mathrm{LAI}):}.$$
(8)

In (8) \(\mathrm{TP}\) is the true positive and \(\mathrm{Baseline acc}.(\mathrm{LAI})\) is the baseline leave-as-is accuracy. An ERR value of 0.0 indicates that a system always retains the raw words, whereas an ERR of exactly 1.0 indicates that a system is faultless. If the system normalizes more words with an incorrect form than with the proper canonical form, the ERR has a negative value.

MultiLexNorm also provided a secondary evaluation for the lexical normalization on downstream applications effectiveness. To determine whether utilizing normalized data is more performant than using the original data. A dependency parser is trained on both raw and canonical data to assess whether using normalized data is more performant than using the original data, and it is shown that lexical normalization enhances the performance for this task [12].

5 Model

The objective of this study is to design a simple yet effective generative neural machine translation model. The model leverages the vanilla transformers-based sequence to sequence introduced in [19] for the neural machine translation problem, but with a custom preprocessing and tokenization for both input and output.

As shown in Fig. 3, the input for the model is one or more words tokenized into sequence of characters, and the output is a sequence of generated words that is a normalized version of the input. We also included the actual raw data in the input to generate a feature-rich input, giving the model an opportunity to establish a useful relationship between the raw input and its characters as needed.

Fig. 3
figure 3

Example of processed input and the output of the proposed system

To do this, we update the input with two types of data, the actual input words, and a separation token < org > , to separate the characters sequences and the actual input. So, the input sequence [l, o, l] becomes [l, o, l, < org > , lol]. To create a class-aware model. The model learns to predict one extra token with the output. This extra token will be one of two tokens; the < oov > token for words that need to be normalized, and the < iv > token for standard words.

During training, the extra token is added to the output sequence label before the End-Of-Sequence token < eos > by checking the output. For example, consider the model trained to map the input sequences [l, o, l] to [laughing, out, loud, < oov >] and [l, a, u, g, h, I, n, g, < space > , o, u, t, < space > , l, o, u, d] to [laughing, out, loud, < iv >]. In this way, the model can learn the difference between the (OOV) input and the (IV) input. And the inference engine looks at the last token to decide whether to use the normalized output if the predicted token is [oov] or return the same input if the token is [iv]. During the training phase, we train a single translation model for each language, all models consisting of two transformer layers with 512 embedding sizes.

As shown in Fig. 4, the translation model receives a processed input sequence and output sequence, the input sequence is passed to the encoder, the encoder should generate a fixed representation of the input, the output with encoded sequence is fed to the decoder, and the decoder then attempts to generate tokens in sequential order to match the given output.

Fig. 4
figure 4

Illustration of training phase of the proposed system

During the inference phase, the model receives a sequence of words to normalize; the given words' language is automatically detected using a language detection library; we utilized the “langdetect” package for this task. The detected language is used to select and load the trained language model. The selected model will then normalize the input word by word. Similarly, to the training phase, the Model ignores the normalization emails, Websites, usernames, and hashtags and returns them as-is.

As illustrated in Fig. 5, the proposed model generates the output from the training mode using greedy decoding. The output is supposed to end with the class token, either the < oov > token or the < iv > token. If the < oov > token appears in the output, the input is considered for normalization, and we manually replace the < oov > token from the string before returning the rest of the prediction. Otherwise, we return the input as-is.

Fig. 5
figure 5

The inference phase of a trained language model

6 Results and Evaluation

6.1 Training

We train a model one for each of the 12 languages. The training is done on a personal laptop with a Core i5 12th CPU and an RTX 1650 GPU. As shown in Table 3, all language models are trained for 5 epochs with 512 embedding size and 8 attention heads, and 2 encoder layers with 2 decoder layers.

Table 3 Training parameters for all language models

The batch-size and layer-size hyperparameter tuning are summarized in the Ablation Study section. The embedding size is selected based on the available hardware capacity. We also manually selected the minimum epochs based on their performance in the final testing results. The model was chosen based on its performance on the English language dataset and fixed for the other language using n-fold cross-validation. Some languages have separate validation data given by the dataset provider; we used this validation data to monitor the model's performance progress. For the languages that did not have validation data. We used a subset of the testing data for this purpose. The model performs translation at the word level, but not all words in the training data are considered for training; the model ignores emails, websites, usernames, and hashtags. Table 4 shows the actual sample lines selected for training as well as the size of the source and target vocabulary.

Table 4 The vocabulary size of the selected training samples from the original samples

The selected vocabulary size is the size after by removing emails, websites, usernames, and hashtags. While some of this data can be normalized, we think that normalizing these data will not affect the NLP downstream task.

This resulted in slightly smaller training samples as well as Source and Target vocabulary.

This reduces training time without affecting the final output.

Figure 6 shows the learning curve represented as loss over the five training epochs for all the languages models.

Fig. 6
figure 6

The change of the training loss for each language model over training the five epochs

The decoder is trained to output a sequence of normalized words ending with one of the classifications tokens < oov > or < iv > and the loss reported in Fig. 6 is computed on the full output sequence and if the classification is < iv > which is a true positive, the produced normalized word(s) can be wrong. An important part of the system is the inference engine, which is a traditional conditional statement. The inference engine ignores the output sequence if the classification token is < iv > and returns the original input. And the final test is performed on the output of the inference engine not the transformer model.

And Fig. 7 presents the training time for each language. The learning progress is affected by the size and the symbols diversity in each language.

Fig. 7
figure 7

The training time for each language model in minutes

6.2 Intrinsic Evaluation

As previously mentioned, for intrinsic assessment, we utilise the Error Reduction Rate (ERR), which measures word-level accuracy normalised to the dataset's replacement rate.

The ERR macro averaged across all datasets is used to establish the final ranking. The result generated using “normEval” script provided in the dataset repository.

Table 5 shows the generated evaluation terms for each language as the output of the evaluation script. The first row shows the LAI baseline accuracy, the second and third rows show the accuracy and the ERR value generated from the evaluation script.

Table 5 Generated evaluation terms of “normEval” script after executed on each language model

Table 6 shows the results of the intrinsic evaluation for the MultiLexNorm. The table shows the results of the proposed model for each language compared to multiple teams from the workshop challenge including the winning team.

Table 6 The result of intrinsic evaluation for the MultiLexNorm

By a significant margin, the proposed model outperforms the baselines. The model achieved an average ERR across languages of 88.41, corresponding to a distinct model trained for each language. That is 21.11% improvement over the second-best performance of 67.3%. Except for Croatian and Serbian, our model achieves the best results across all datasets.

7 Extrinsic Evaluation

Extrinsic evaluation aims to evaluate the effect of the trained model on downstream tasks. The workshop challenge provides seven social-media treebanks to be normalized with the trained model for the dependency parsing task using models trained on standard text.

The MaChAmp parser [38] is used to train the dependency parsing models in a standard language and normal words from a treebank from Universal Dependencies 2.8 [39]. The label attachment score (LAS) metric is used to evaluate the dependency parsing task.

In Table 7, we showcase the results for our best model for the dependency parsing downstream task from the workshop challenge on each treebank. Our model achieves the best performance of all treebanks and the overall macro-average.

Table 7 Extrinsic evaluation results in LAS score

7.1 Ablation Study

In order to gain a better understanding of the model’s behavior, several ablation experiments conducted to evaluate the effectiveness of multiple training settings, due to the physical hardware limitation we mainly focused our experiments to the preprocessing of the input data, the batch size, and the numbers of layers. Table 8 shows intrinsic evaluation of these ablation experiments.

Table 8 Ablation research results

Input preprocessing: We tried to use raw character sequence as an input without our extra tokens, but we found that adding the non-standard word extra to the training input character sequence led to significantly better performance on all language models, from 75.44 average ERR to 86.94. As a result, since the preprocessing was necessary, we did not conduct any further experiments without preprocessing.

Batch size: We tried 8 and 32 as batch sizes; the smaller batch size considerably improved results in all languages from an average of 86.76 ERR to 88.41 in our best model.

Number of Layers: We tried two and three layers for both encoders and decoders, and we found that the two layer architecture achieved a good result in a short time.

Both the batch size and number of layers parameters have small effects on the overall model performance compared to the preprocessing. By providing more information in the input, the model had a chance to collect more evidence to support the prediction decision.

7.2 Discussion

As shown in Tables 6 and 7, the performance difference between the benchmark and measured accuracy is significantly high.

In all previous research the models trained on either character’s level, subword level, word level, syllabus level, or a full sentence level. We argue that in all of these data representations, there are not enough features to use for the learning process and extract enough patterns. We used a different approach.

First, we transform each input word into a sequence of characters and a special token [org] and a token for the word itself all in one sequence that represent the input of the model, The main idea is to supply the input with6 as many facts as possible and rely on the feature selection-like behavior produced by the transformer's attention mechanism itself to generate feature-rich encoded information from the encoder. The [org] token will summarize the character sequence in the presence of the full word token. The model can use either the characters sequence and/or the characters summary token [org] and/or the full word token for making decisions. This will provide the input with classification information about how the word is formed and how it is related to these characters.

Secondly, we force the decoder to describe the output as OOV or IV in the final token. This will give the decoder the chance to use the feature-rich encoded information to learn any possible pattern that will help classify the input as OOV or IV, along with the normalized version of the input. This will mimic multitasking learning using the same output layer and update the weights accordingly.

We believe that reshaping the inputs in order to maximize the number of features fed to the model and reshaping the outputs in order to search for more information in the inputs helps greatly in improving the overall system’s performance in general.

8 Conclusion

This study shows that the lexical normalization task can be viewed as a character-to-word alignment problem observed from the common lexical error types, and to solve this problem, we used attention-based transformers. The proposed model is an end-to-end system trained from scratch without any pretrained Language Model. Also, the normalization process requires two steps: detecting the text to be normalized, applying the normalization, and generating the normalized text. The transformers with attention mechanisms can both detect and generate in a single process; the classification can be done with an extra label that determines the nature of the input. And the generated output can be used if the input word is classified as OOV. We used the encoder–decoder transformer model to perform character-to-word translations to solve the lexical normalization task in an end-to-end system. The character to word translation improves the performance with unseen OOV, with minimum feature engineering and a relatively small parameters size to speed up the processing. The character-to-word translation improves the performance with unseen OOV, with minimum feature engineering and a relatively small parameter size to speed up the processing.

Future research needs to investigate the possibilities of combining our input and output transformation paradigm technique with a generative language model, such as LLaMA [40], for the lexical normalization task. It could be used to generate a full normalized sentence without the inference engine used in this research.