Domain Adaptation of Transformer-Based Models Using Unlabeled Data for Relevance and Polarity Classification of German Customer Feedback

Idrissi-Yaghir, Ahmad; Schäfer, Henning; Bauer, Nadja; Friedrich, Christoph M.

doi:10.1007/s42979-022-01563-6

Domain Adaptation of Transformer-Based Models Using Unlabeled Data for Relevance and Polarity Classification of German Customer Feedback

Original Research
Open access
Published: 05 January 2023

Volume 4, article number 142, (2023)
Cite this article

Download PDF

You have full access to this open access article

SN Computer Science Aims and scope Submit manuscript

Domain Adaptation of Transformer-Based Models Using Unlabeled Data for Relevance and Polarity Classification of German Customer Feedback

Download PDF

1590 Accesses
2 Altmetric
Explore all metrics

Abstract

Understanding customer feedback is becoming a necessity for companies to identify problems and improve their products and services. Text classification and sentiment analysis can play a major role in analyzing this data by using a variety of machine and deep learning approaches. In this work, different transformer-based models are utilized to explore how efficient these models are when working with a German customer feedback dataset. In addition, these pre-trained models are further analyzed to determine if adapting them to a specific domain using unlabeled data can yield better results than off-the-shelf pre-trained models. To evaluate the models, two downstream tasks from the GermEval 2017 are considered. The experimental results show that transformer-based models can reach significant improvements compared to a fastText baseline and outperform the published scores and previous models. For the subtask Relevance Classification, the best models achieve a micro-averaged F1-Score of 96.1 % on the first test set and 95.9 % on the second one, and a score of 85.1 % and 85.3 % for the subtask Polarity Classification.

Sentiment analysis in product reviews in Thai language

Article 14 June 2024

A Multi-filter BiLSTM-CNN Architecture for Vietnamese Sentiment Analysis

Transfer Learning in Sentiment Classification with Deep Neural Networks

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

For many companies, customer feedback is a valuable source of information to determine customer satisfaction and opinions about products and services. This feedback enables companies to respond to complaints or requests, which is essential for their success and its continuity. With a growing customer base, the amount of generated feedback becomes unmanageable and challenging, and a manual analysis of the data becomes very time-consuming and in some cases almost impossible. Additionally, feedback is often not directly addressed to the company, but is found on various online platforms and social media, where the relevancy to the company can only be resolved with great effort. These platforms are becoming increasingly important as communication channels between customers and companies. Usually, these platforms contain countless feedback in form of unstructured text data such as posts and tweets. Therefore, automated and intelligent data processing and analysis have become necessary to quickly and efficiently analyze large amounts of customer feedback and extract valuable insights that can be used to improve products and services.

In order to stay up-to-date with customer feedback, companies need to identify relevant feedback and then analyze it. One approach that has been gaining popularity in recent years is sentiment analysis. This method uses natural language processing (NLP) to automatically analyze text and determine whether the text carries positive, neutral, or negative opinions [1].

The aim of this paper is to develop models that can automatically classify customer feedback data according to their relevance and polarity. To achieve this objective, different machine learning and deep learning models were employed and compared in terms of their performance.

For this work, the GermEval 2017 [2] dataset was chosen to conduct different experiments. The GermEval 2017 Shared Task on Aspect-based Sentiment in Social Media Customer Feedback workshop was held to focus on the automatic processing of German language customer feedback, e.g., tweets about Deutsche Bahn, a German railroad company. The shared task was divided into four subtasks.

Subtask A: Relevance Classification - This subtask is a binary classification problem and focuses on determining whether a feedback concerns Deutsche Bahn or not.

Subtask B: Document-Level Polarity - This subtask is a multi-class classification problem. According to their polarity (sentiment), the documents should be classified into three categories (positive, negative, or neutral).

Subtask C: Aspect-Level Polarity - For this subtask, all aspects contained in feedback must be identified. Each aspect should then be classified as positive or negative.

Subtask D: Opinion Target Extraction - The goal of the final subtask is to identify and extract all opinions in a document.

Only the first subtasks (A and B) are taken into consideration in this work. The majority of the used systems utilize transformer-based [3] models, which are fairly new and have shown remarkable results in many tasks across multiple languages. The goal is to investigate whether the subtasks can benefit from these models and improve the micro-averaged F1-Score compared to the published scores. In addition, some of the language models are further pre-trained in a second phase using unlabeled domain-specific data with the aim of achieving a domain adaptation. The obtained models are then used to examine their performance on the subtasks. The main contributions of this paper can be summarized as follows:

A comparative study of different approaches and models for the classification of customer feedback data.
An analysis of the effectiveness of domain adaptation and the performance of models after pre-training on domain-specific data.
A discussion on the challenges and limitations of applying deep learning approaches to such tasks.

The remainder of this paper is structured as follows. Section “Related Work” describes related work, in particular, the used datasets and the results of the GermEval 2017 shared task. Section “Proposed Method” details the proposed approach and describes the conducted experiments. Section “Results and Discussion” discusses the obtained results and the limitations of sentiment analysis. Finally, Sect. “Conclusion” presents the conclusion derived from this work.

Related Work

Text classification is a common problem in NLP and arises in a wide range of applications, such as sentiment analysis, question answering, and topic labeling [4].

Over the years, numerous machine learning approaches have been proposed and applied to tackle the task of sentiment analysis. Earlier, non-deep learning methods, such as support vector machines (SVM) [5], naïve Bayes (NB) [6] and maximum entropy, were widely used and were considered the state-of-the-art at the time [7]. These are usually paired with text representation approaches, such as bag-of-words (BOW) or term frequency-inverse document frequency (TF-IDF) [8]. With the rise of deep neural networks, new approaches have been developed, starting with the introduction of word embedding models using approaches such as Word2Vec [9], global vectors (GloVe) [10] and fastText [11], which create word vectors with the goal of neighboring similar words in a vector space. Later, new approaches were developed such as embeddings from language models (ELMo) [12], which is a deep contextualized word representation model that outperforms Word2Vec. Afterwards, Google introduced the first transformer-based model BERT (bidirectional encoder representations from transformers) [13], which achieved remarkable results in many tasks and started the trend of large transformer-based models. These models are usually pre-trained on large-scale unlabeled task-independent corpora to learn universal language representations. After BERT, models such as RoBERTa [14], ELECTRA [15] were introduced as an improvement over BERT by incorporating new pre-training methods (see Sect. 3.2).

Although such transformer-based models outperform previous approaches in various NLP tasks, they might struggle when a task corpus is overly focused on a specific domain [16]. In this context, Gururangan et al. [17] investigated whether additional pre-training on domain-specific data can be beneficial. The authors propose a domain-adaptive pre-training (DAPT) and a task-adaptive pre-training (TAPT) and conduct experiments on eight classification tasks of four domains to verify the effectiveness of these approaches. For DAPT, unlabeled domain-specific corpora are used, whereas TAPT utilizes the unlabeled task data. The achieved results show that these approaches can improve pre-trained language models, and that the best performance can be reached when these two approaches are used in combination. The approach of adapting transformer-based models to a specific domain already showed success with models such as BioBERT [18], which was initialized from general BERT and further pre-trained using biomedical text such as PubMed data and was able to improve the results on biomedical tasks. One drawback with these approaches is that even though the model is adapted to the domain, the vocabulary does not contain domain-specific words. This leads to these words being split into multiple sub-words, which can hinder the model learning and degrades its performance. As a possible solution, Souza et al. [19] applied a language-specific adaption by using a language-specific vocabulary, which was generated over the target language text to train a Brazilian Portuguese model. The model was initialized with multilingual BERT (mBERT) and further pre-trained on Brazilian Portuguese text and was able to improve the performance on a variant of tasks.

GermEval 2017 Data

The data were collected from different internet sources including social media, microblogs, news, and Q &A sites in the span of one year (May 2015 - June 2016) and was annotated afterward [2]. The obtained dataset^{Footnote 1} consists of around 26,000 annotated documents, which were randomly split into 80 % training, 10 % development and 10 % test data. More data were collected from November 2016 to January 2017 to create a further test set. The first test set was called synchronic because it originated from the first data collection, whereas the second was created later on and was therefore named diachronic. The number of documents in each split is shown in Table 1.

Table 1 Number of documents in each split

Full size table

For the subtasks, data are available in two file formats: tab-separated values (TSV) and extensible markup language (XML). For this work, only the TSV format is used, which contains the following tab-separated fields: document ID (URL), document text, relevance (true or false), document-level polarity (neutral, positive or negative).

The Tables 2 and 3 show the distribution of each class in the different data splits for the two subtasks.

Table 2 Relevance distribution in subtask A data

Full size table

Table 3 Sentiment distribution in subtask B data

Full size table

Table 4 describes different corpus statistics of the dataset: the count of unique unigrams, bigrams, and trigrams as well as the mean length of the text documents calculated on preprocessed and lowercased data. The applied preprocessing techniques are discussed in Sect. 3.1.

Table 4 GermEval 2017 corpus statistics

Full size table

Examples from the training dataset for subtask A and subtask B are shown below in Table 5.

Table 5 Examples for document relevance and sentiment

Full size table

For this work, additional unlabeled German tweets were collected from Twitter with the goal to continue the pre-training of the language models using masked language modeling of one of the described models in Sect. 3.2. Similar to the original data, all collected tweets contain the term “bahn” and originate from the period between January 2017 and October 2021. In German, the used search term can also refer to other words that are not associated with trains or railroads, implying that a portion of the collected data is noisy and does not necessarily belong to the domain of the task. This data were crawled using snscrape^{Footnote 2} and consists of 1,199,280 tweets. Table 6 lists the statistics of the dataset, which are obtained similarly to Table 4.

Table 6 Corpus statistics of the unlabeled tweets collected using the search term “bahn”

Full size table

GermEval 2017 Results

The GermEval 2017 [2] shared task witnessed the participation of 8 teams that used a variety of approaches. All the teams participated in subtask B and 5 of them in subtask A.

Before utilizing these approaches, the majority of the teams applied a thorough data preprocessing, where they either removed or replaced e.g. URLs, hashtags, handles, and emojis with special tokens. Additionally, some teams used lemmatizers, part-of-speech taggers, and spell checkers. Furthermore, punctuation characters were removed by some teams and kept by others, which was also the case for capitalization.

To evaluate how well these systems perform on the independent test sets, a micro-average F1-Score was used. The F1-Score, which is short for $F_{\beta =1}$, is generally defined as follows:

$$\begin{aligned} F_{\beta } = \frac{\left( \beta ^{2}+1\right) \times \text{ Precision } \times \text{ Recall } }{\beta ^{2} \times \text{ Precision } + \text{ Recall } } \end{aligned}$$

(1)

The parameter $\beta$ is used to control the balance of recall and precision [20]. When using $\beta = 1$, recall and precision are equally balanced and the formula simplifies to:

$$\begin{aligned} F_{1} = \frac{2 \times \text{ Precision } \times \text{ Recall } }{ \text{ Precision } + \text{ Recall } } \end{aligned}$$

(2)

Since this measure is usually used for binary classification problems and the second subtask is a multi-class problem, a micro-averaging of the scores is needed, which aggregates the individual per-document decisions across all classes to compute the average score [21]. This averaging method gives equal weight to each classification decision, which leads to a higher impact from the performance of a large class on the results compared to that of a small class. Thus, the micro-averaged results are considered as a measure of the effectiveness on the large classes that can be preferable when dealing with imbalances in class distribution, which is the case when handling the GermEval 2017 datasets. The best results from the GermEval 2017 as well as other publications are reported in Table 7.

The winners of the first subtask [22] on the synchronic test set as well as both subtasks on the diachronic set used word and character n-grams for text representation, in combination with feature selection based on information gain and L1-regularization (Lasso) [23]. In addition, they used adaptive synthetic sampling [24] to compensate for imbalances in the class distribution. They performed classification using XGBoost [25], which is a specific implementation of regularized gradient boosted trees.

Naderalvojoud et al. [26] participated only in the second subtask and achieved the best score on the synchronic test set. They developed a model that utilizes three different German sentiment lexicons that are built using the translation of English lexicons, such as SentiWordNet [27] and SentiSpin [28]. In this system, a deep recurrent neural network (RNN) was used to learn contextual sentiment weights and thus to change the polarity of terms depending on the context of their use.

In order to solve the subtasks A and B, Hövelmann & Friedrich [29] have developed different models and systems. The best model was based on a fastText classifier [30], which was enhanced with pre-trained word vectors. This model was able to reach the second best score in both subtasks. Furthermore, a gradient boosted trees (GBT) classifier was developed, which was trained on bag-of-words features combined with the linguistic inquiry and word count (LIWC) [31] features. In addition, other models were implemented that used Word2Vec embeddings in combination with classifiers such as GBTs or feedforward multilayer perceptron (MLP).

Sidarenka [32] developed three systems for the second subtask. The first was an SVM classifier trained on a variety of different features such as character-level features, word-level features, part-of-speech features, and lexicons features. The second system was a bidirectional long short-term memory (Bi-LSTM) [33], which was trained using word embeddings. The last system combined the two systems into an ensemble. This system achieved the third best score out of the 8 participants.

Other participants tried other approaches such as using different lexica or other word embeddings like GloVe instead of Word2Vec or fastText. In addition, some teams used classifiers like conditional random field (CRF) or a stacked learner [34], which is an ensemble-based method that uses several base classifiers from scikit-learn [35] and a multilayer perceptron as a meta-classifier to combine the predictions of the base classifiers.

After the GermEval 2017 shared task, [36] conducted experiments using a lexicon-based Bi-LSTM model that yield slightly better results on the sentiment analysis subtask. These results were then outperformed by Biesialska et al. [37], which proposed a transformer-based sentiment analysis (TSA) approach that leverages ELMo contextual embeddings in a model based on the transformer architecture. This approach achieves better results than all reported results for this subtask. In regard to the first subtask, Parcheta et al. [38] experimented using multiple text encoding techniques, such as byte pair encoding (BPE) [39], GloVe and BERT. To generate the BERT embeddings, they used a small multilingual model that was trained using 104 different languages. The generated embeddings were then used with different architectures, such as a convolutional neural network (CNN) [40], RNN, LSTM, and gated recurrent units (GRU) [41]. The variety of the embeddings and architectures resulted in numerous models that performed better than the winning systems of GermEval 2017. The best model uses a combination of BERT and BPE text encoding methods with a single-layered CNN implementation. In a recent work [42], the authors re-evaluated the GermEval 2017 using pre-trained language models and achieved the best reported scores for both tasks using the German BERT model bert-base-german-dbmdz-uncased.

Table 7 Best results on the synchronic and diachronic test sets for subtask A on relevance classification and subtask B published in GermEval 2017 and in other works after the competition

Full size table

Proposed Method

In this section, the suggested method to conduct the experiments will be described, as illustrated in Fig. 1. This includes the data preprocessing and the variety of models that are used.

In the first part of this work, transformer-based language models are fine-tuned on the previously mentioned downstream tasks to investigate how they compare to earlier systems. Then, one of the pre-trained models is further trained using masked language modeling on unlabeled domain data, unlabeled task data, and combinations of both to adapt the language models to the specific domain of the downstream tasks. These models are then experimented with and compared to previous results.

Preprocessing

Before starting the process of training models and making predictions, the raw data need to be preprocessed to remove the noise existing in the text. First, duplicates and empty text documents are removed. Then punctuation marks are also deleted. As an exception, repetitions of question marks, exclamation points, and periods are replaced by the terms “strongquestion”, “strongexclamation”, and “annoyeddots”. Furthermore, URLs and numbers got replaced by the terms “URL” and “number”. Whereas, other numerical tokens such as money amounts, dates, and time are replaced by “money” and “dates”. Since many documents originate from Twitter, usernames are replaced by “twitterusername” except for the usernames related to the Deutsche Bahn like @DB_Bahn, @Bahnansagen, or @Bahn_Info, which are pooled by replacing them with “dbusername”. Additionally, the hashtags mentioned in tweets are modified by removing the “#” character. Words like “S-Bahn” and “S Bahn” are also combined to the term “sbahn”. Also, before removing all the punctuation marks, the emoticons “:(” and “:-(” are replaced by the token “sadsmiley”,“:)”, “:-)”, “;-)”, “:-))” and “:D” by “happysmiley” and “:-D” and “XD” by “laughingsmiley”. For all other possible emoticons, the term “emote” is used. Finally, whitespaces and unicode characters like emojis are removed. Excluding the fastText model and the uncased models, no lowercase folding is needed, since all models are trained on cased data. Removing stop words and replacing German umlauts (“ä”, “Ä”, “ö”, “Ö”, “ü” and “Ü”) as well as ligatures (e.g. “ß”) are briefly tested but did not show any improvements and have not been utilized. Table 8 shows an example of a document before and after applying the mentioned preprocessing techniques. For the domain adaption using masked language modeling, no data preprocessing was applied on the unlabeled data.

Table 8 Example of a document before and after the preprocessing

Full size table

System Description

In the attempt to reach high scores in the first two subtasks, different systems and models were used. For the additional pre-training and fine-tuning of transformer-based models, an NVIDIA A100 GPU was used. The pre-training time usually varied between two and three hours, whereas the fine-tuning took around 8 minutes for the base models and 20 minutes for the large ones.

As a baseline, a fastText classifier [30] was trained on the preprocessed text. This classifier is constructed with the goal to predict a class or a label instead of a word, which is the case when using an unsupervised algorithm like continuous bag-of-words (CBOW) to generate word embeddings. In addition to word embeddings, fastText also uses character-level n-grams, which makes it capable of handling morphologically rich languages like German, and sentences with a variety of words. Deviating from the default configuration, the dimensionality of the word vectors was set to 50, the learning rate was initialized with 0.1. For the loss computation, softmax was used, and the number of word n-grams was set to 4. The classifier was then trained for 20 epochs. To ensure the reproducibility of the results, the number of used threads was set to 1. As an additional model, these parameters and the collected tweets were used to generate word embeddings, which then were utilized with the fastText classifier.

To further improve results, transformer-based models were used, starting with BERT [13]. This model was designed to pre-train deep bidirectional representations from an unlabeled text by jointly exploring left and right contexts on all levels. As a result, the pre-trained BERT language model can be fine-tuned with only one additional output layer to develop models that achieve remarkable results for different tasks. The authors published two models: BERT$_{Base}$ and BERT$_{Large}$. Details of these models are shown in Table 9. In addition, they also released a multilingual model, which was trained on cased data in 104 languages.

These models were pre-trained in two phases: “masked language modeling”, and “next sentence prediction”. In the first phase, instead of predicting every next token, the model only predicts a percentage of random “masked” words from a sentence. The second phase is a binary classification task in which the model predicts whether the second sentence is the actual next sentence of the first sentence. Due to the computational expense of training such models from scratch, all used models are already pre-trained by other organizations and were made publicly available. The first German model was published by the German company Deepset AI^{Footnote 3}. It was trained on German Wikipedia dump, court decisions, and news articles. The Digitale Bibliothek Münchener Digitalisierungszentrum (DBMDZ^{Footnote 4}) released two additional German models, cased and uncased. These were trained on German Wikipedia dump, European Union Bookshop corpus, OpenSubtitles, and Web Crawls. The two teams have joined forces and released two new BERT models (GBERT$_{Base}$ and GBERT$_{Base}$) [43], which outperform the previously released models and were trained on four different datasets: the German portion of the Open Super-large Crawled ALMAnaCH coRpus (OSCAR) [44], German Wikipedia dump, The Open Parallel Corpus (OPUS) [45], and Open Legal Data [46]. Additional information about these models and the other models used to conduct the experiments are listed in Table 10.

After the release of BERT, a Robustly Optimized BERT pre-training approach (RoBERTa) [14] was introduced. It enhances the BERT approach by changing the pre-training procedure through training the model longer, over more data, on longer sequences, and by removing the next sentence prediction and using improved hyperparameters. For this work, GottBERT [47] was utilized, which is a German RoBERTa$_{Base}$ model that was trained on the German part of OSCAR data.

Table 9 Details of the different types of transformer-based models

Full size table

In addition to these models, the transformer-based model XLM RoBERTa [48] is used, which is a cross-lingual language model that was trained on 2.5TB of data across 100 languages. XLM RoBERTa outperforms previous multilingual approaches by incorporating more training data and languages, including low-resource languages. Although multiple XLM RoBERTa models are available, only the large model was considered for this work.

Another tested model is ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) [15]. This model introduced a different approach to language pre-training, where it uses another task called replaced token detection (RTD). Instead of masking the input by replacing some words with the token “[MASK]” as in BERT, ELECTRA corrupts the input tokens by replacing them with synthetically generated tokens. It then trains the model to distinguish between “real” and “fake” input data. This is achieved using a discriminator that classifies the tokens and a generator which provides plausible fake tokens. Both transformer-based components are trained jointly. In addition to the two BERT models, Chan et al. [43] also released two German ELECTRA models: a base model and a large one. In their benchmarking, the large model reached state-of-the-art performance in three downstream tasks.

To adapt the language models to the task domain, multiple experiments were conducted on GBERT$_{Large}$. These experiments were based on continuing pre-training using masked language modeling on the collected tweets and on combinations of the unlabeled task data and parts of the unlabeled tweets. Additional experiments were conducted by expanding the vocabulary of the pre-trained model with around 20k new words from the unlabeled tweets, where the embeddings of the new tokens are initialized randomly. Moreover, for some experiments instead of masking 15 % of the text during the masked language modeling, 30 % was masked, since based on a recent study [49] masking up more than 15 % of the tokens can be beneficial in some cases.

Table 10 Details about the used pre-trained models

Full size table

For the baseline classifier, the fastText^{Footnote 5} (version 0.9.1) was used. The training and fine-tuning of transformer-based models were conducted using the Transformers [50] library by HuggingFace^{Footnote 6}. For all models, the same hyperparameters were used (see Table 11). In addition, the continued pre-training of the models on the unlabeled domain and task data was performed for 5 epochs using the same hyperparameters except the max sequence length, which was set to 512.

Table 11 Hyperparameters used for the transformer-based models

Full size table

Results and Discussion

Table 12 shows the results of 5-fold cross-validation for both subtasks on the training set. All average scores were obtained using 5-fold cross-validation and are complemented by their respective standard deviation.

Table 12 Results of the average score and standard deviation using 5-fold cross-validation on the training set for subtask A on relevance classification and subtask B on sentiment detection

Full size table

To analyze the results of the cross-validation and assess how the systems perform against the fastText model, which can be considered a strong baseline, a one-sided Wilcoxon rank-sum test [52] was used. The tests were conducted using the statistical programming language R (version 4.1.2) [53] with the significance level $\alpha = 0.05$.

Since all p-values of the conducted tests are below the significance level, the null hypothesis can be rejected, which indicates that all systems outperform the fastText model.

Table 13 shows the obtained results on the test datasets for both subtasks using systems trained on the training and development sets. Based on the scores, all transformer-based systems outperform the baseline model. For the first subtask, off-the-shelf GBERT$_{Large}$ improved the results obtained by the winning system from GermEval 2017 by about +5.6 percentage points micro-averaged F1-Score on the synchronic test set, and by +4.7 percentage points on the diachronic test set. It also outperforms the best reported scores [38] by +0.2 percentage points and +0.5 percentage points, respectively. The obtained scores of this model were even slightly improved after a second pre-training phase using additional domain data, with the best scores reaching 96.1 % and 95.9 %. Out of the off-the-shelf models, GBERT$_{Large}$ and GELECTRA$_{Large}$ were also able to reach the best scores on the second subtask. GBERT$_{Large}$ improves upon the best GermEval systems by a +8.8 percentage points margin on the first test set, whereas GELECTRA$_{Large}$ reached +9.3 percentage points improvement on the second one. Compared to the best score in [42], these models reached a score that is +3.0 percentage points better on the synchronic set and +4.3 on the diachronic set. Similar to the results of the first subtask, the highest scores (85.1 % and 85.3 %) were reached when a continued pre-training on the domain and task data was applied.

These results indicate that continuing to pre-train language models on domain-specific unlabeled data as well as the task data usually improves results, which was the case for GBERT$_{Large}$ as shown in Table 13. Based on the results, increasing the masking percentage can also improve the results, but it is not always the case. Furthermore, adding new domain words to the vocabulary of the pre-trained model did not show any improvements and sometimes lead to slightly worse results than the off-the-shelf pre-trained models, which can be a result of the random initialization of the embeddings. Although this method showed some improvements, it can probably be more beneficial when using higher quality data that has been properly selected and filtered.

Also noteworthy, despite XLM RoBERTa being a multilingual model, it outperforms most of the German Base models on both subtasks.

Table 13 Results on the synchronic and diachronic test sets of the different systems trained on the training and development datasets for subtask A on relevance classification and subtask B on sentiment detection, as well as comparison with results from other publications

Full size table

Although the transformer-based models showed that they can reach remarkable results, one of their disadvantages is that using them is computationally expensive and can require specific hardware such as GPUs, which can become costly when used in production at scale. Moreover, training these models take considerably more time in comparison to models such as fastText, which usually just needs a couple of minutes only using one CPU. Besides these disadvantages, there are also some challenges and limitations when dealing with sentiment analysis in social networks. Determining sentiment in tweets for example, where the text is usually coupled with hashtags, emojis, and links can be very difficult [54]. Additionally, analyzing a textual expression from a semantic point of view can be crucial to detecting the underlying sentiment [55]. This is usually not taken into consideration when dealing with sentiment analysis, where a sentence is taken just as it is, which can result in wrong interpretations. Furthermore, using a word or phrase that entails an intentional deviation from its literal definition can hinder detecting the correct expressed sentiment. This is the case when dealing with sarcasm and irony, which are usually difficult to recognize - not only for machines, but also for humans. This difficulty can result in poor performance even for state-of-the-art systems. In addition, collecting data can also be challenging since searching for a specific term can result in collecting unwanted data. For example, the term bahn does not only refer to the train in German but can also refer to a track or anything that can be laid in straight lines, such as Laufbahn (Engl. running track) [2]. This problem can be avoided if the data is carefully and properly filtered beforehand. Another limitation, which needs to be considered, is the possibility that the provided annotations are not entirely correct and that a number of the predictions are assumed wrong even though they are actually correct. An example for this case is shown in Table 14, where the document can be considered positive but was mistakenly annotated as negative and therefore the prediction is treated as a wrong classification.

Table 14 Example for a poorly annotated document

Full size table

Conclusion

Transformer-based architectures such as BERT, ELECTRA, RoBERTa, and XLM RoBERTa were used in this work for the first two subtasks of GermEval 2017. They showed a remarkable improvement over the results that were reached when the competition was held, as well as those that were reported by subsequent works. Moreover, the conducted experiments revealed that adapting these pre-trained models to the domain using unlabeled task and domain data can even outperform the achieved results.

The findings show that continuing pre-training language models using task data and domain-specific unlabeled data is an interesting concept to consider whenever the initial language model data are non-specific for the intended use. Such improvements are the reason why these models have gained massive popularity in recent years in different NLP downstream tasks. This success is due to the improvement of context understanding that heavily benefits from pre-training huge language models on enormous amounts of data. Although the results showed using domain-specific data lead to improvements upon off-the-shelf models, an extended analysis of the used data is needed. Such an analysis can help determine how to best use domain-specific data for additional pre-training and how the quality of the data can improve the models.

In addition to the already mentioned models, other transformer-based models such as ALBERT, XLNet, DeBERTa, and T5 were also released. Unfortunately, for most of these, no German pre-trained models are available. In future works, these models can also be tested to investigate how they compare with the reported systems, assuming that in the future more German models are going to be released. On the other hand, there is already evidence that multilingual and novel transformer architectures perform similarly well as language-specific and domain-specific language models under certain conditions. Thus, a comparison of novel multilingual architectures, such as modular transformers [56], with the results of domain-adapted models is also useful for future work.

Availability of Data and Materials

Data used in preparation of this article is publicly available and was obtained from the GermEval 2017 Shared Task.https://sites.google.com/view/germeval2017-absa/ (last access: 17-05-2022).

Notes

https://sites.google.com/view/GermEval2017-absa/data?authuser=0 (last access: 19-04-2022).
https://github.com/JustAnotherArchivist/snscrape (last access: 20-04-2022).
https://deepset.ai/german-bert (last access: 10-04-2022).
https://github.com/dbmdz/berts (last access 10-04-2022).
https://github.com/facebookresearch/fastText (last access: 19-04-2022).
https://github.com/huggingface/transformers (last access: 19-04-2022).

References

Dang NC, Moreno-García MN, De la Prieta F. Sentiment analysis based on deep learning: a comparative study. Electronics. 2020;9(3):483. https://doi.org/10.3390/electronics9030483.
Article Google Scholar
Wojatzki M, Ruppert E, Holschneider S, Zesch T, Biemann C. Germeval 2017: Shared Task on Aspect-based Sentiment in Social Media Customer Feedback. In: Proceedings of the German Society for Computational Linguistics and Language Technology (GSCL) Workshop: GermEval Shared Task on Aspect-based Sentiment in Social Media Customer Feedback, Berlin, Germany, 2017;1–12 .
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is All You Need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NeurIPS. NIPS’17, pp. 6000–6010, Long Beach, California, USA 2017.
Liu B. Sentiment analysis: mining opinions, sentiments, and emotions. 1st ed. Cambridge: Cambridge University Press; 2015.
Book Google Scholar
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. https://doi.org/10.1007/bf00994018.
Article MATH Google Scholar
Rish I. An empirical study of the naive bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, pp. 41–46 2001. IBM New York
Pang B, Lee L, Vaithyanathan S. Thumbs up? Sentiment Classification using Machine Learning Techniques. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pp. 79–86. Association for Computational Linguistics, Philadelphia, PA, USA 2002. https://doi.org/10.3115/1118693.1118704
Jones KS. A statistical interpretation of term specificity and its application in retrieval. JOUR. 1972;28(1):11–21. https://doi.org/10.1108/eb026526.
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space. In: Proceedings of the International Conference on Learning Representations (ICLR), Scottsdale, Arizona 2013.
Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014;1532–1543. https://doi.org/10.3115/v1/D14-1162
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–46. https://doi.org/10.1162/tacl_a_00051.
Article Google Scholar
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep Contextualized Word Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018;2227–2237. Association for Computational Linguistics, New Orleans, Louisiana. https://doi.org/10.18653/v1/N18-1202
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota 2019. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 2019.
Clark K, Luong M-T, Le QV, Manning CD. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In: ICLR 2020. https://openreview.net/pdf?id=r1xMH1BtvB
Whang T, Lee D, Lee C, Yang K, Oh D, Lim H. An Effective Domain Adaptive Post-Training Method for BERT in Response Selection. In: Proc. Interspeech 2020, 2020;1585–1589. https://doi.org/10.21437/Interspeech.2020-2153
Gururangan S, Marasovic A, Swayamdipta S, Lo K, Beltagy I, Downey D, Smith NA. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks, 2020;8342–8360. https://doi.org/10.18653/v1/2020.acl-main.740
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinform. 2020;36(4):1234–40. https://doi.org/10.1093/bioinformatics/btz682.
Article Google Scholar
Souza F, Nogueira R, Lotufo R. Bertimbau: Pretrained bert models for brazilian portuguese. In: Cerri R, Prati RC, editors. Intell Syst. Cham: Springer; 2020. p. 403–17.
Chapter Google Scholar
Lever J, Krzywinski M, Altman N. Classification evaluation. Nat Methods. 2016;13(8):603–4. https://doi.org/10.1038/nmeth.3945.
Article Google Scholar
Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press; 2008.
Book MATH Google Scholar
Sayyed ZA, Dakota D, Kübler S. IDS IUCL: Investigating Feature Selection and Oversampling for GermEval2017. In: Proceedings of the German Society for Computational Linguistics and Language Technology (GSCL) Workshop: GermEval Shared Task on Aspect-based Sentiment in Social Media Customer Feedback, Berlin, Germany, 2017;43–48.
Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996;58(1): 267–288.
He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008;1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. Association for Computing Machinery, New York, NY, USA 2016. https://doi.org/10.1145/2939672.2939785
Naderalvojoud B, Qasemizadeh B, Kallmeyer L. HU-HHU at GermEval-2017 sub-task B: Lexicon-Based Deep Learning for Contextual Sentiment Analysis. In: Proceedings of the German Society for Computational Linguistics and Language Technology (GSCL) Workshop: GermEval Shared Task on Aspect-based Sentiment in Social Media Customer Feedback, Berlin, Germany, 2017;18–21.
Esuli A, Sebastiani F. SENTIWORDNET: A publicly available lexical resource for opinion mining. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), pp. 417–422. European Language Resources Association (ELRA), Genoa, Italy 2006.
Takamura H, Inui T, Okumura M. Extracting Semantic Orientations of Words Using Spin Model. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL ’05. ACL ’05, pp. 133–140. Association for Computational Linguistics, USA 2005. https://doi.org/10.3115/1219840.1219857
Hövelmann L, Friedrich CM. Fasttext and Gradient Boosted Trees at GermEval-2017 on Relevance Classification and Document-level Polarity. In: Proceedings of the German Society for Computational Linguistics and Language Technology (GSCL) Workshop: GermEval Shared Task on Aspect-based Sentiment in Social Media Customer Feedback, Berlin, Germany, 2017;30–35.
Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of Tricks for Efficient Text Classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL): Volume 2, Short Papers, Valencia, Spain, 2017;427–431. https://doi.org/10.18653/v1/E17-2068
Tausczik YR, Pennebaker JW. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. J Lang Soc Psychol. 2009;29(1):24–54. https://doi.org/10.1177/0261927X09351676.
Article Google Scholar
Sidarenka U. PotTS at GermEval-2017 Task B: Document-Level Polarity Detection Using Hand-Crafted SVM and Deep Bidirectional LSTM Network. In: Proceedings of the German Society for Computational Linguistics and Language Technology (GSCL) Workshop: GermEval Shared Task on Aspect-based Sentiment in Social Media Customer Feedback, Berlin, Germany, 2017;49–54.
Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9(8):1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Article Google Scholar
Eger S, Do Dinh E-L, Kuznetsov I, Kiaeeha M, Gurevych I. EELECTION at SemEval-2017 task 10: Ensemble of nEural learners for kEyphrase ClassificaTION. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 942–946. Association for Computational Linguistics, Vancouver, Canada 2017. https://doi.org/10.18653/v1/S17-2163
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
MathSciNet MATH Google Scholar
Naderalvojoud B, Qasemizadeh B, Kallmeyer L, Sezer EA. A Cross-Lingual Approach for Building Multilingual Sentiment Lexicons. In: Sojka P, Horák A, Kopeček I, Pala K, editors. Text, Speech, and Dialogue. Cham: Springer; 2018. p. 259–66.
Chapter Google Scholar
Biesialska K, Biesialska M, Rybinski H. Sentiment analysis with contextual embeddings and slf-attention. In: Helic D, Leitner G, Stettinger M, Felfernig A, Raś ZW, editors. Foundations of intelligent systems. Cham: Springer; 2020. p. 32–41.
Chapter Google Scholar
Parcheta Z, Sanchis-Trilles G, Casacuberta F, Rendahl R. Combining Embeddings of Input Data for Text Classification. Neural Processing Letters. 2020. https://doi.org/10.1007/s11063-020-10312-w.
Gage P. A New Algorithm for Data Compression. C Users J. 1994;12(2):23–38.
Google Scholar
Kim Y. Convolutional Neural Networks for Sentence Classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics, Doha, Qatar 2014. https://doi.org/10.3115/v1/D14-1181
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics, Doha, Qatar 2014. https://doi.org/10.3115/v1/D14-1179
Aßenmacher M, Corvonato A, Heumann C. Re-evaluating germeval17 using german pre-trained language models. In: Proceedings of the Swiss Text Analytics Conference 2021, Winterthur, Switzerland [Online] 2021.
Chan Branden, Schweter Stefan, Möller Timo. German’s next language model. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6788–6796. International Committee on Computational Linguistics, Barcelona, Spain (Online) 2020. https://www.aclweb.org/anthology/2020.coling-main.598
Ortiz Suárez PJ, Romary L, Sagot B. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1703–1714. Association for Computational Linguistics, Online 2020. https://doi.org/10.18653/v1/2020.acl-main.156
Tiedemann J. Parallel Data, Tools and Interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp. 2214–2218. European Language Resources Association (ELRA), Istanbul, Turkey 2012. http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
Ostendorff M, Blume T, Ostendorff S. Towards an Open Platform for Legal Information. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. JCDL ’20, pp. 385–388. Association for Computing Machinery, New York, NY, USA 2020. https://doi.org/10.1145/3383583.3398616
Scheible R, Thomczyk F, Tippmann P, Jaravine V, Boeker M. GottBERT: a pure German Language Model 2020.
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised Cross-lingual Representation Learning at Scale. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, pp. 8440–8451. Association for Computational Linguistics, Seattle, WA [Online] 2020. https://www.aclweb.org/anthology/2020.acl-main.747/
Wettig A, Gao T, Zhong Z, Chen D. Should you mask 15% in masked language modeling? ArXiv abs/2202.08005 2022.
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush AM. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online 2020. https://www.aclweb.org/anthology/2020.emnlp-demos.6
Loshchilov I, Hutter F. Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 (2019). https://openreview.net/forum?id=Bkg6RiCqY7
Wilcoxon F. Individual Comparisons by Ranking Methods. Biom Bull. 1945;1(6):80–3.
Article Google Scholar
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria 2021. R Foundation for Statistical Computing. https://www.R-project.org/
Furini M, Montangero M. TSentiment: On gamifying Twitter sentiment analysis. In: 2016 IEEE Symposium on Computers and Communication (ISCC), 2016;91–96. https://doi.org/10.1109/ISCC.2016.7543720
Pozzi FA, Fersini E, Messina E, Liu B. Chapter 1 - Challenges of Sentiment Analysis in Social Networks: An Overview. In: Pozzi, F.A., Fersini, E., Messina, E., Liu, B. (eds.) Sentiment Analysis in Social Networks, pp. 1–11. Morgan Kaufmann, Boston 2017. https://doi.org/10.1016/B978-0-12-804412-4.00001-2
Pfeiffer J, Goyal N, Lin X, Li X, Cross J, Riedel S, Artetxe M. Lifting the curse of multilinguality by pre-training modular transformers. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3479–3495. Association for Computational Linguistics, Seattle, United States 2022. https://doi.org/10.18653/v1/2022.naacl-main.255. https://aclanthology.org/2022.naacl-main.255

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. The work of Ahmad Idrissi-Yaghir and Henning Schäfer was funded by a PhD grant from the DFG Research Training Group 2535 Knowledge- and data-based personalisation of medicine at the point of care (WisPerMed).

Author information

Authors and Affiliations

Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO), Emil-Figge Str. 42, Dortmund, 44227, Germany
Ahmad Idrissi-Yaghir, Henning Schäfer, Nadja Bauer & Christoph M. Friedrich
Institute for Transfusion Medicine, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
Henning Schäfer
Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
Ahmad Idrissi-Yaghir & Christoph M. Friedrich

Authors

Ahmad Idrissi-Yaghir
View author publications
You can also search for this author in PubMed Google Scholar
Henning Schäfer
View author publications
You can also search for this author in PubMed Google Scholar
Nadja Bauer
View author publications
You can also search for this author in PubMed Google Scholar
Christoph M. Friedrich
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The conceptualization of the study was carried out by CMF and AIY. CMF and AIY planned the experiments. AIY and HS implemented the software, executed the experiments and analysed the data. AIY has written the original draft under the supervision of CMF and NB. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Christoph M. Friedrich.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Idrissi-Yaghir, A., Schäfer, H., Bauer, N. et al. Domain Adaptation of Transformer-Based Models Using Unlabeled Data for Relevance and Polarity Classification of German Customer Feedback. SN COMPUT. SCI. 4, 142 (2023). https://doi.org/10.1007/s42979-022-01563-6

Download citation

Received: 18 May 2022
Accepted: 28 November 2022
Published: 05 January 2023
DOI: https://doi.org/10.1007/s42979-022-01563-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Domain Adaptation of Transformer-Based Models Using Unlabeled Data for Relevance and Polarity Classification of German Customer Feedback

Abstract

Similar content being viewed by others

Sentiment analysis in product reviews in Thai language

A Multi-filter BiLSTM-CNN Architecture for Vietnamese Sentiment Analysis

Transfer Learning in Sentiment Classification with Deep Neural Networks

Introduction

Related Work

GermEval 2017 Data

GermEval 2017 Results

Proposed Method

Preprocessing

System Description

Results and Discussion

Conclusion

Availability of Data and Materials

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Domain Adaptation of Transformer-Based Models Using Unlabeled Data for Relevance and Polarity Classification of German Customer Feedback

Abstract

Similar content being viewed by others

Sentiment analysis in product reviews in Thai language

A Multi-filter BiLSTM-CNN Architecture for Vietnamese Sentiment Analysis

Transfer Learning in Sentiment Classification with Deep Neural Networks

Introduction

Related Work

GermEval 2017 Data

GermEval 2017 Results

Proposed Method

Preprocessing

System Description

Results and Discussion

Conclusion

Availability of Data and Materials

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation