Introduction

For many companies, customer feedback is a valuable source of information to determine customer satisfaction and opinions about products and services. This feedback enables companies to respond to complaints or requests, which is essential for their success and its continuity. With a growing customer base, the amount of generated feedback becomes unmanageable and challenging, and a manual analysis of the data becomes very time-consuming and in some cases almost impossible. Additionally, feedback is often not directly addressed to the company, but is found on various online platforms and social media, where the relevancy to the company can only be resolved with great effort. These platforms are becoming increasingly important as communication channels between customers and companies. Usually, these platforms contain countless feedback in form of unstructured text data such as posts and tweets. Therefore, automated and intelligent data processing and analysis have become necessary to quickly and efficiently analyze large amounts of customer feedback and extract valuable insights that can be used to improve products and services.

In order to stay up-to-date with customer feedback, companies need to identify relevant feedback and then analyze it. One approach that has been gaining popularity in recent years is sentiment analysis. This method uses natural language processing (NLP) to automatically analyze text and determine whether the text carries positive, neutral, or negative opinions [1].

The aim of this paper is to develop models that can automatically classify customer feedback data according to their relevance and polarity. To achieve this objective, different machine learning and deep learning models were employed and compared in terms of their performance.

For this work, the GermEval 2017 [2] dataset was chosen to conduct different experiments. The GermEval 2017 Shared Task on Aspect-based Sentiment in Social Media Customer Feedback workshop was held to focus on the automatic processing of German language customer feedback, e.g., tweets about Deutsche Bahn, a German railroad company. The shared task was divided into four subtasks.

Subtask A: Relevance Classification - This subtask is a binary classification problem and focuses on determining whether a feedback concerns Deutsche Bahn or not.

Subtask B: Document-Level Polarity - This subtask is a multi-class classification problem. According to their polarity (sentiment), the documents should be classified into three categories (positive, negative, or neutral).

Subtask C: Aspect-Level Polarity - For this subtask, all aspects contained in feedback must be identified. Each aspect should then be classified as positive or negative.

Subtask D: Opinion Target Extraction - The goal of the final subtask is to identify and extract all opinions in a document.

Only the first subtasks (A and B) are taken into consideration in this work. The majority of the used systems utilize transformer-based [3] models, which are fairly new and have shown remarkable results in many tasks across multiple languages. The goal is to investigate whether the subtasks can benefit from these models and improve the micro-averaged F1-Score compared to the published scores. In addition, some of the language models are further pre-trained in a second phase using unlabeled domain-specific data with the aim of achieving a domain adaptation. The obtained models are then used to examine their performance on the subtasks. The main contributions of this paper can be summarized as follows:

  • A comparative study of different approaches and models for the classification of customer feedback data.

  • An analysis of the effectiveness of domain adaptation and the performance of models after pre-training on domain-specific data.

  • A discussion on the challenges and limitations of applying deep learning approaches to such tasks.

The remainder of this paper is structured as follows. Section “Related Work” describes related work, in particular, the used datasets and the results of the GermEval 2017 shared task. Section “Proposed Method” details the proposed approach and describes the conducted experiments. Section “Results and Discussion” discusses the obtained results and the limitations of sentiment analysis. Finally, Sect. “Conclusion” presents the conclusion derived from this work.

Related Work

Text classification is a common problem in NLP and arises in a wide range of applications, such as sentiment analysis, question answering, and topic labeling [4].

Over the years, numerous machine learning approaches have been proposed and applied to tackle the task of sentiment analysis. Earlier, non-deep learning methods, such as support vector machines (SVM) [5], naïve Bayes (NB) [6] and maximum entropy, were widely used and were considered the state-of-the-art at the time [7]. These are usually paired with text representation approaches, such as bag-of-words (BOW) or term frequency-inverse document frequency (TF-IDF) [8]. With the rise of deep neural networks, new approaches have been developed, starting with the introduction of word embedding models using approaches such as Word2Vec [9], global vectors (GloVe) [10] and fastText [11], which create word vectors with the goal of neighboring similar words in a vector space. Later, new approaches were developed such as embeddings from language models (ELMo) [12], which is a deep contextualized word representation model that outperforms Word2Vec. Afterwards, Google introduced the first transformer-based model BERT (bidirectional encoder representations from transformers) [13], which achieved remarkable results in many tasks and started the trend of large transformer-based models. These models are usually pre-trained on large-scale unlabeled task-independent corpora to learn universal language representations. After BERT, models such as RoBERTa [14], ELECTRA [15] were introduced as an improvement over BERT by incorporating new pre-training methods (see Sect. 3.2).

Although such transformer-based models outperform previous approaches in various NLP tasks, they might struggle when a task corpus is overly focused on a specific domain [16]. In this context, Gururangan et al. [17] investigated whether additional pre-training on domain-specific data can be beneficial. The authors propose a domain-adaptive pre-training (DAPT) and a task-adaptive pre-training (TAPT) and conduct experiments on eight classification tasks of four domains to verify the effectiveness of these approaches. For DAPT, unlabeled domain-specific corpora are used, whereas TAPT utilizes the unlabeled task data. The achieved results show that these approaches can improve pre-trained language models, and that the best performance can be reached when these two approaches are used in combination. The approach of adapting transformer-based models to a specific domain already showed success with models such as BioBERT [18], which was initialized from general BERT and further pre-trained using biomedical text such as PubMed data and was able to improve the results on biomedical tasks. One drawback with these approaches is that even though the model is adapted to the domain, the vocabulary does not contain domain-specific words. This leads to these words being split into multiple sub-words, which can hinder the model learning and degrades its performance. As a possible solution, Souza et al. [19] applied a language-specific adaption by using a language-specific vocabulary, which was generated over the target language text to train a Brazilian Portuguese model. The model was initialized with multilingual BERT (mBERT) and further pre-trained on Brazilian Portuguese text and was able to improve the performance on a variant of tasks.

GermEval 2017 Data

The data were collected from different internet sources including social media, microblogs, news, and Q &A sites in the span of one year (May 2015 - June 2016) and was annotated afterward [2]. The obtained datasetFootnote 1 consists of around 26,000 annotated documents, which were randomly split into 80 % training, 10 % development and 10 % test data. More data were collected from November 2016 to January 2017 to create a further test set. The first test set was called synchronic because it originated from the first data collection, whereas the second was created later on and was therefore named diachronic. The number of documents in each split is shown in Table 1.

Table 1 Number of documents in each split

For the subtasks, data are available in two file formats: tab-separated values (TSV) and extensible markup language (XML). For this work, only the TSV format is used, which contains the following tab-separated fields: document ID (URL), document text, relevance (true or false), document-level polarity (neutral, positive or negative).

The Tables 2 and 3 show the distribution of each class in the different data splits for the two subtasks.

Table 2 Relevance distribution in subtask A data
Table 3 Sentiment distribution in subtask B data

Table 4 describes different corpus statistics of the dataset: the count of unique unigrams, bigrams, and trigrams as well as the mean length of the text documents calculated on preprocessed and lowercased data. The applied preprocessing techniques are discussed in Sect. 3.1.

Table 4 GermEval 2017 corpus statistics

Examples from the training dataset for subtask A and subtask B are shown below in Table 5.

Table 5 Examples for document relevance and sentiment

For this work, additional unlabeled German tweets were collected from Twitter with the goal to continue the pre-training of the language models using masked language modeling of one of the described models in Sect. 3.2. Similar to the original data, all collected tweets contain the term “bahn” and originate from the period between January 2017 and October 2021. In German, the used search term can also refer to other words that are not associated with trains or railroads, implying that a portion of the collected data is noisy and does not necessarily belong to the domain of the task. This data were crawled using snscrapeFootnote 2 and consists of 1,199,280 tweets. Table 6 lists the statistics of the dataset, which are obtained similarly to Table 4.

Table 6 Corpus statistics of the unlabeled tweets collected using the search term “bahn”

GermEval 2017 Results

The GermEval 2017 [2] shared task witnessed the participation of 8 teams that used a variety of approaches. All the teams participated in subtask B and 5 of them in subtask A.

Before utilizing these approaches, the majority of the teams applied a thorough data preprocessing, where they either removed or replaced e.g. URLs, hashtags, handles, and emojis with special tokens. Additionally, some teams used lemmatizers, part-of-speech taggers, and spell checkers. Furthermore, punctuation characters were removed by some teams and kept by others, which was also the case for capitalization.

To evaluate how well these systems perform on the independent test sets, a micro-average F1-Score was used. The F1-Score, which is short for \(F_{\beta =1}\), is generally defined as follows:

$$\begin{aligned} F_{\beta } = \frac{\left( \beta ^{2}+1\right) \times \text{ Precision } \times \text{ Recall } }{\beta ^{2} \times \text{ Precision } + \text{ Recall } } \end{aligned}$$
(1)

The parameter \(\beta\) is used to control the balance of recall and precision [20]. When using \(\beta = 1\), recall and precision are equally balanced and the formula simplifies to:

$$\begin{aligned} F_{1} = \frac{2 \times \text{ Precision } \times \text{ Recall } }{ \text{ Precision } + \text{ Recall } } \end{aligned}$$
(2)

Since this measure is usually used for binary classification problems and the second subtask is a multi-class problem, a micro-averaging of the scores is needed, which aggregates the individual per-document decisions across all classes to compute the average score [21]. This averaging method gives equal weight to each classification decision, which leads to a higher impact from the performance of a large class on the results compared to that of a small class. Thus, the micro-averaged results are considered as a measure of the effectiveness on the large classes that can be preferable when dealing with imbalances in class distribution, which is the case when handling the GermEval 2017 datasets. The best results from the GermEval 2017 as well as other publications are reported in Table 7.

The winners of the first subtask [22] on the synchronic test set as well as both subtasks on the diachronic set used word and character n-grams for text representation, in combination with feature selection based on information gain and L1-regularization (Lasso) [23]. In addition, they used adaptive synthetic sampling [24] to compensate for imbalances in the class distribution. They performed classification using XGBoost [25], which is a specific implementation of regularized gradient boosted trees.

Naderalvojoud et al. [26] participated only in the second subtask and achieved the best score on the synchronic test set. They developed a model that utilizes three different German sentiment lexicons that are built using the translation of English lexicons, such as SentiWordNet [27] and SentiSpin [28]. In this system, a deep recurrent neural network (RNN) was used to learn contextual sentiment weights and thus to change the polarity of terms depending on the context of their use.

In order to solve the subtasks A and B, Hövelmann & Friedrich [29] have developed different models and systems. The best model was based on a fastText classifier [30], which was enhanced with pre-trained word vectors. This model was able to reach the second best score in both subtasks. Furthermore, a gradient boosted trees (GBT) classifier was developed, which was trained on bag-of-words features combined with the linguistic inquiry and word count (LIWC) [31] features. In addition, other models were implemented that used Word2Vec embeddings in combination with classifiers such as GBTs or feedforward multilayer perceptron (MLP).

Sidarenka [32] developed three systems for the second subtask. The first was an SVM classifier trained on a variety of different features such as character-level features, word-level features, part-of-speech features, and lexicons features. The second system was a bidirectional long short-term memory (Bi-LSTM) [33], which was trained using word embeddings. The last system combined the two systems into an ensemble. This system achieved the third best score out of the 8 participants.

Other participants tried other approaches such as using different lexica or other word embeddings like GloVe instead of Word2Vec or fastText. In addition, some teams used classifiers like conditional random field (CRF) or a stacked learner [34], which is an ensemble-based method that uses several base classifiers from scikit-learn [35] and a multilayer perceptron as a meta-classifier to combine the predictions of the base classifiers.

After the GermEval 2017 shared task, [36] conducted experiments using a lexicon-based Bi-LSTM model that yield slightly better results on the sentiment analysis subtask. These results were then outperformed by Biesialska et al. [37], which proposed a transformer-based sentiment analysis (TSA) approach that leverages ELMo contextual embeddings in a model based on the transformer architecture. This approach achieves better results than all reported results for this subtask. In regard to the first subtask, Parcheta et al. [38] experimented using multiple text encoding techniques, such as byte pair encoding (BPE) [39], GloVe and BERT. To generate the BERT embeddings, they used a small multilingual model that was trained using 104 different languages. The generated embeddings were then used with different architectures, such as a convolutional neural network (CNN) [40], RNN, LSTM, and gated recurrent units (GRU) [41]. The variety of the embeddings and architectures resulted in numerous models that performed better than the winning systems of GermEval 2017. The best model uses a combination of BERT and BPE text encoding methods with a single-layered CNN implementation. In a recent work [42], the authors re-evaluated the GermEval 2017 using pre-trained language models and achieved the best reported scores for both tasks using the German BERT model bert-base-german-dbmdz-uncased.

Table 7 Best results on the synchronic and diachronic test sets for subtask A on relevance classification and subtask B published in GermEval 2017 and in other works after the competition

Proposed Method

In this section, the suggested method to conduct the experiments will be described, as illustrated in Fig. 1. This includes the data preprocessing and the variety of models that are used.

Fig. 1
figure 1

Outline of the proposed method

In the first part of this work, transformer-based language models are fine-tuned on the previously mentioned downstream tasks to investigate how they compare to earlier systems. Then, one of the pre-trained models is further trained using masked language modeling on unlabeled domain data, unlabeled task data, and combinations of both to adapt the language models to the specific domain of the downstream tasks. These models are then experimented with and compared to previous results.

Preprocessing

Before starting the process of training models and making predictions, the raw data need to be preprocessed to remove the noise existing in the text. First, duplicates and empty text documents are removed. Then punctuation marks are also deleted. As an exception, repetitions of question marks, exclamation points, and periods are replaced by the terms “strongquestion”, “strongexclamation”, and “annoyeddots”. Furthermore, URLs and numbers got replaced by the terms “URL” and “number”. Whereas, other numerical tokens such as money amounts, dates, and time are replaced by “money” and “dates”. Since many documents originate from Twitter, usernames are replaced by “twitterusername” except for the usernames related to the Deutsche Bahn like @DB_Bahn, @Bahnansagen, or @Bahn_Info, which are pooled by replacing them with “dbusername”. Additionally, the hashtags mentioned in tweets are modified by removing the “#” character. Words like “S-Bahn” and “S Bahn” are also combined to the term “sbahn”. Also, before removing all the punctuation marks, the emoticons “:(” and “:-(” are replaced by the token “sadsmiley”,“:)”, “:-)”, “;-)”, “:-))” and “:D” by “happysmiley” and “:-D” and “XD” by “laughingsmiley”. For all other possible emoticons, the term “emote” is used. Finally, whitespaces and unicode characters like emojis are removed. Excluding the fastText model and the uncased models, no lowercase folding is needed, since all models are trained on cased data. Removing stop words and replacing German umlauts (“ä”, “Ä”, “ö”, “Ö”, “ü” and “Ü”) as well as ligatures (e.g. “ß”) are briefly tested but did not show any improvements and have not been utilized. Table 8 shows an example of a document before and after applying the mentioned preprocessing techniques. For the domain adaption using masked language modeling, no data preprocessing was applied on the unlabeled data.

Table 8 Example of a document before and after the preprocessing

System Description

In the attempt to reach high scores in the first two subtasks, different systems and models were used. For the additional pre-training and fine-tuning of transformer-based models, an NVIDIA A100 GPU was used. The pre-training time usually varied between two and three hours, whereas the fine-tuning took around 8 minutes for the base models and 20 minutes for the large ones.

As a baseline, a fastText classifier [30] was trained on the preprocessed text. This classifier is constructed with the goal to predict a class or a label instead of a word, which is the case when using an unsupervised algorithm like continuous bag-of-words (CBOW) to generate word embeddings. In addition to word embeddings, fastText also uses character-level n-grams, which makes it capable of handling morphologically rich languages like German, and sentences with a variety of words. Deviating from the default configuration, the dimensionality of the word vectors was set to 50, the learning rate was initialized with 0.1. For the loss computation, softmax was used, and the number of word n-grams was set to 4. The classifier was then trained for 20 epochs. To ensure the reproducibility of the results, the number of used threads was set to 1. As an additional model, these parameters and the collected tweets were used to generate word embeddings, which then were utilized with the fastText classifier.

To further improve results, transformer-based models were used, starting with BERT [13]. This model was designed to pre-train deep bidirectional representations from an unlabeled text by jointly exploring left and right contexts on all levels. As a result, the pre-trained BERT language model can be fine-tuned with only one additional output layer to develop models that achieve remarkable results for different tasks. The authors published two models: BERT\(_{Base}\) and BERT\(_{Large}\). Details of these models are shown in Table 9. In addition, they also released a multilingual model, which was trained on cased data in 104 languages.

These models were pre-trained in two phases: “masked language modeling”, and “next sentence prediction”. In the first phase, instead of predicting every next token, the model only predicts a percentage of random “masked” words from a sentence. The second phase is a binary classification task in which the model predicts whether the second sentence is the actual next sentence of the first sentence. Due to the computational expense of training such models from scratch, all used models are already pre-trained by other organizations and were made publicly available. The first German model was published by the German company Deepset AIFootnote 3. It was trained on German Wikipedia dump, court decisions, and news articles. The Digitale Bibliothek Münchener Digitalisierungszentrum (DBMDZFootnote 4) released two additional German models, cased and uncased. These were trained on German Wikipedia dump, European Union Bookshop corpus, OpenSubtitles, and Web Crawls. The two teams have joined forces and released two new BERT models (GBERT\(_{Base}\) and GBERT\(_{Base}\)) [43], which outperform the previously released models and were trained on four different datasets: the German portion of the Open Super-large Crawled ALMAnaCH coRpus (OSCAR) [44], German Wikipedia dump, The Open Parallel Corpus (OPUS) [45], and Open Legal Data [46]. Additional information about these models and the other models used to conduct the experiments are listed in Table 10.

After the release of BERT, a Robustly Optimized BERT pre-training approach (RoBERTa) [14] was introduced. It enhances the BERT approach by changing the pre-training procedure through training the model longer, over more data, on longer sequences, and by removing the next sentence prediction and using improved hyperparameters. For this work, GottBERT [47] was utilized, which is a German RoBERTa\(_{Base}\) model that was trained on the German part of OSCAR data.

Table 9 Details of the different types of transformer-based models

In addition to these models, the transformer-based model XLM RoBERTa [48] is used, which is a cross-lingual language model that was trained on 2.5TB of data across 100 languages. XLM RoBERTa outperforms previous multilingual approaches by incorporating more training data and languages, including low-resource languages. Although multiple XLM RoBERTa models are available, only the large model was considered for this work.

Another tested model is ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) [15]. This model introduced a different approach to language pre-training, where it uses another task called replaced token detection (RTD). Instead of masking the input by replacing some words with the token “[MASK]” as in BERT, ELECTRA corrupts the input tokens by replacing them with synthetically generated tokens. It then trains the model to distinguish between “real” and “fake” input data. This is achieved using a discriminator that classifies the tokens and a generator which provides plausible fake tokens. Both transformer-based components are trained jointly. In addition to the two BERT models, Chan et al. [43] also released two German ELECTRA models: a base model and a large one. In their benchmarking, the large model reached state-of-the-art performance in three downstream tasks.

To adapt the language models to the task domain, multiple experiments were conducted on GBERT\(_{Large}\). These experiments were based on continuing pre-training using masked language modeling on the collected tweets and on combinations of the unlabeled task data and parts of the unlabeled tweets. Additional experiments were conducted by expanding the vocabulary of the pre-trained model with around 20k new words from the unlabeled tweets, where the embeddings of the new tokens are initialized randomly. Moreover, for some experiments instead of masking 15 % of the text during the masked language modeling, 30 % was masked, since based on a recent study [49] masking up more than 15 % of the tokens can be beneficial in some cases.

Table 10 Details about the used pre-trained models

For the baseline classifier, the fastTextFootnote 5 (version 0.9.1) was used. The training and fine-tuning of transformer-based models were conducted using the Transformers [50] library by HuggingFaceFootnote 6. For all models, the same hyperparameters were used (see Table 11). In addition, the continued pre-training of the models on the unlabeled domain and task data was performed for 5 epochs using the same hyperparameters except the max sequence length, which was set to 512.

Table 11 Hyperparameters used for the transformer-based models

Results and Discussion

Table 12 shows the results of 5-fold cross-validation for both subtasks on the training set. All average scores were obtained using 5-fold cross-validation and are complemented by their respective standard deviation.

Table 12 Results of the average score and standard deviation using 5-fold cross-validation on the training set for subtask A on relevance classification and subtask B on sentiment detection

To analyze the results of the cross-validation and assess how the systems perform against the fastText model, which can be considered a strong baseline, a one-sided Wilcoxon rank-sum test [52] was used. The tests were conducted using the statistical programming language R (version 4.1.2) [53] with the significance level \(\alpha = 0.05\).

Since all p-values of the conducted tests are below the significance level, the null hypothesis can be rejected, which indicates that all systems outperform the fastText model.

Table 13 shows the obtained results on the test datasets for both subtasks using systems trained on the training and development sets. Based on the scores, all transformer-based systems outperform the baseline model. For the first subtask, off-the-shelf GBERT\(_{Large}\) improved the results obtained by the winning system from GermEval 2017 by about +5.6 percentage points micro-averaged F1-Score on the synchronic test set, and by +4.7 percentage points on the diachronic test set. It also outperforms the best reported scores [38] by +0.2 percentage points and +0.5 percentage points, respectively. The obtained scores of this model were even slightly improved after a second pre-training phase using additional domain data, with the best scores reaching 96.1 % and 95.9 %. Out of the off-the-shelf models, GBERT\(_{Large}\) and GELECTRA\(_{Large}\) were also able to reach the best scores on the second subtask. GBERT\(_{Large}\) improves upon the best GermEval systems by a +8.8 percentage points margin on the first test set, whereas GELECTRA\(_{Large}\) reached +9.3 percentage points improvement on the second one. Compared to the best score in [42], these models reached a score that is +3.0 percentage points better on the synchronic set and +4.3 on the diachronic set. Similar to the results of the first subtask, the highest scores (85.1 % and 85.3 %) were reached when a continued pre-training on the domain and task data was applied.

These results indicate that continuing to pre-train language models on domain-specific unlabeled data as well as the task data usually improves results, which was the case for GBERT\(_{Large}\) as shown in Table 13. Based on the results, increasing the masking percentage can also improve the results, but it is not always the case. Furthermore, adding new domain words to the vocabulary of the pre-trained model did not show any improvements and sometimes lead to slightly worse results than the off-the-shelf pre-trained models, which can be a result of the random initialization of the embeddings. Although this method showed some improvements, it can probably be more beneficial when using higher quality data that has been properly selected and filtered.

Also noteworthy, despite XLM RoBERTa being a multilingual model, it outperforms most of the German Base models on both subtasks.

Table 13 Results on the synchronic and diachronic test sets of the different systems trained on the training and development datasets for subtask A on relevance classification and subtask B on sentiment detection, as well as comparison with results from other publications

Although the transformer-based models showed that they can reach remarkable results, one of their disadvantages is that using them is computationally expensive and can require specific hardware such as GPUs, which can become costly when used in production at scale. Moreover, training these models take considerably more time in comparison to models such as fastText, which usually just needs a couple of minutes only using one CPU. Besides these disadvantages, there are also some challenges and limitations when dealing with sentiment analysis in social networks. Determining sentiment in tweets for example, where the text is usually coupled with hashtags, emojis, and links can be very difficult [54]. Additionally, analyzing a textual expression from a semantic point of view can be crucial to detecting the underlying sentiment [55]. This is usually not taken into consideration when dealing with sentiment analysis, where a sentence is taken just as it is, which can result in wrong interpretations. Furthermore, using a word or phrase that entails an intentional deviation from its literal definition can hinder detecting the correct expressed sentiment. This is the case when dealing with sarcasm and irony, which are usually difficult to recognize - not only for machines, but also for humans. This difficulty can result in poor performance even for state-of-the-art systems. In addition, collecting data can also be challenging since searching for a specific term can result in collecting unwanted data. For example, the term bahn does not only refer to the train in German but can also refer to a track or anything that can be laid in straight lines, such as Laufbahn (Engl. running track) [2]. This problem can be avoided if the data is carefully and properly filtered beforehand. Another limitation, which needs to be considered, is the possibility that the provided annotations are not entirely correct and that a number of the predictions are assumed wrong even though they are actually correct. An example for this case is shown in Table 14, where the document can be considered positive but was mistakenly annotated as negative and therefore the prediction is treated as a wrong classification.

Table 14 Example for a poorly annotated document

Conclusion

Transformer-based architectures such as BERT, ELECTRA, RoBERTa, and XLM RoBERTa were used in this work for the first two subtasks of GermEval 2017. They showed a remarkable improvement over the results that were reached when the competition was held, as well as those that were reported by subsequent works. Moreover, the conducted experiments revealed that adapting these pre-trained models to the domain using unlabeled task and domain data can even outperform the achieved results.

The findings show that continuing pre-training language models using task data and domain-specific unlabeled data is an interesting concept to consider whenever the initial language model data are non-specific for the intended use. Such improvements are the reason why these models have gained massive popularity in recent years in different NLP downstream tasks. This success is due to the improvement of context understanding that heavily benefits from pre-training huge language models on enormous amounts of data. Although the results showed using domain-specific data lead to improvements upon off-the-shelf models, an extended analysis of the used data is needed. Such an analysis can help determine how to best use domain-specific data for additional pre-training and how the quality of the data can improve the models.

In addition to the already mentioned models, other transformer-based models such as ALBERT, XLNet, DeBERTa, and T5 were also released. Unfortunately, for most of these, no German pre-trained models are available. In future works, these models can also be tested to investigate how they compare with the reported systems, assuming that in the future more German models are going to be released. On the other hand, there is already evidence that multilingual and novel transformer architectures perform similarly well as language-specific and domain-specific language models under certain conditions. Thus, a comparison of novel multilingual architectures, such as modular transformers [56], with the results of domain-adapted models is also useful for future work.