1 Introduction

Part-of-speech (POS) tagging is a standard component in many linguistic processing pipelines, so its performance is likely to impact the performance of all subsequent steps in the pipeline, such as morphological analysis or syntactic parsing. In the newswire domain, modern POS taggers can reach accuracy scores beyond 97%, close to human performance (Manning 2011). For “non-standard” texts like social media or web texts, however, tagger performance is usually much lower. For the EmpiriST 2015 shared task dataset considered in this paper, Beißwenger et al. (2016) report accuracy scores of 80–82% for off-the-shelf taggers.

One important reason for this decline in accuracy is that datasets which are large enough to train a tagger are typically from the newswire domain. For social media and web texts, no large training sets are available. At the same time, these texts differ substantially from newswire text. They contain a lot of “bad” language (Eisenstein 2013) such as misspellings, phrasal abbreviations or intentional orthographical variations as well as phenomena like contractions or interaction words which are not covered by standard tagsets.

On a technical level, the problem can be traced back, at least to some extent, to out-of-vocabulary (“unknown”) words which do not occur in the training set. Giesbrecht and Evert (2009) observe that typical web texts contain, compared to newswire texts, more unknown words, and that tagger performance on unknown words is much lower. We make similar observations for the dataset considered in this paper.

One way to address this problem is to add small amounts of manually annotated in-domain data to existing (out-of-domain) training sets when training the tagger. For German, this approach has been explored by Horbach et al. (2014) and Neunerdt et al. (2014). The approach is appealing, as it is conceptually very simple, easy to implement and quite effective. Yet, it can only address part of the problem, as many words remain out-of-vocabulary. Another approach is to exploit distributional similarity information about unknown words. The underlying observation is that distributionally similar words tend to belong to the same lexical class, so POS information of out-of-vocabulary words can be derived from distributionally similar in-vocabulary words (Schütze 1995). Several approaches to POS tagging of various kinds of non-standard texts that exploit this idea have been proposed in the past few years. Gimpel et al. (2011) train a CRF-based tagger using features derived from a reduced co-occurrence matrix; Owoputi et al. (2013), Ritter et al. (2011) and Rehbein (2013) use clustering to derive features to train a discriminative tagger model. Prange et al. (2015) use distributional similarity information to learn a POS lexicon for out-of-vocabulary tokens, and combine it with a Hidden Markov Model (HMM) based tagger.

In this paper, we present an approach that is conceptually similar to the one of Prange et al. (2015) but which uses distributional similarity information to estimate emission probabilities of the HMM, rather than deriving an external POS lexicon. Results on the EmpiriST 2015 shared task dataset (Beißwenger et al. 2016) show that our approach improves accuracy on out-of-vocabulary words by up to 5.8%; overall, we improve state-of-the-art by 0.4% to 90.9% accuracy.

2 Model

We briefly present the underlying tagger model in Sect. 2.1 before presenting our distributional approach to estimating lexical probabilities for out-of-vocabulary tokens in Sect. 2.2. Section 2.3 describes the lookup procedure implemented by the tagger.

2.1 Baseline Model

We use a second order Hidden Markov Model to implement our baseline tagger. To tag a given input sequence \(w_1\ldots w_n\) of words, we calculate

$$ \mathop {\mathrm {arg\,max}}_{t_1,\ldots ,t_n}\left[ \prod _{i=1}^n P(t_i\mid t_{i-1},t_{i-2}) P(w_i\mid t_i) \right] P(t_{n+1} \mid t_n) $$

where \(t_1 \ldots t_n\) are elements of the tagset and \(t_{{-}1}, t_0\) and \(t_{n+1}\) are additional tags marking the beginning and the end of the sequence, respectively.

Our implementation closely follows Brants (2000). Transition probabilities \(P(t_i\mid t_{i-1},t_{i-2})\) are computed using a linear combination of unigrams, bigrams and trigrams, which are estimated from a tagged training corpus using maximum likelihood. For the tokens in the training corpus, we estimate emission probabilities \(P(w_i \mid t_i)\) using maximum likelihood and for out-of-vocabulary tokens emission probabilities are estimated based on the word’s suffix. Our implementation differs slightly from (Brants 2000) in that we use, for purely practical reasons, a maximal suffix length of 5 instead of 10 in the computation of suffix distributions, and that we do not maintain different suffix distributions for uppercase and lowercase words.

2.2 Distributional Smoothing

We use a large, automatically POS-tagged corpus and estimate \(P(w\mid t)\) by considering all contexts in which w occurs in the corpus, and estimating the emission probability of w based on the emission probability of all in-vocabulary words \(w'\) that occur in the same contexts as w. We set:

$$\begin{aligned} P(t\mid w) = \sum _{w'}\sum _{C} P(t\mid w') \, P(w'\mid C) \, P(C\mid w) \end{aligned}$$
(1)

where \(w'\) ranges over all in-vocabulary words in the manually annotated training corpus used to train the baseline model and C ranges over all n-grams consisting of the POS tags of the two words on either side of an unknown word w in the automatically tagged corpus. \(P(t\mid w')\) is the probability of a tag t of an in-vocabulary word \(w'\), \(P(w'\mid C)\) is the probability that \(w'\) occurs in a given context C and \(P(C\mid w)\) is the probability of context C given an out-of-vocabulary word w. The probabilities are estimated on the automatically tagged corpus using maximum likelihood. Following recommendations by Prange et al. (2015), we consider only contexts in which the two surrounding words are in-vocabulary; the idea is that in-vocabulary tokens are tagged with much higher precision and thus give us more reliable context information.

While using (1) to estimate emission probabilities of out-of-vocabulary tokens improves tagger performance beyond the baseline model, (1) is still somewhat noisy. We further improve tagger performance by combining (1) with a second distribution \(P(t\mid w)\) which estimates the probability of a tag t of an unknown word w based on the suffix of w. In principle, we could simply use the corresponding distribution of the baseline tagger, but it turns out that the following approach works much better:

$$\begin{aligned} P(t\mid w) = \sum _{w'}\sum _{s} P(t\mid w')\,P(w'\mid s)\,P(s\mid w) \end{aligned}$$
(2)

where s ranges over all possible suffixes. The distributions \(P(s\mid w)\) and \(P(w'\mid s)\) are estimated on the type level, i.e., \(P(s\mid w)=1\) if s is a suffix of w, 0 otherwise, and \(P(w' \mid s) = \frac{1}{n}\), where n is the number of types with suffix s.

We combine (1) and (2) using multiplication, re-normalize the result and apply Bayes’ theorem to obtain the final emission probabilities \(P(w\mid t)\).

2.3 Lookup

Our tagger implements the following lookup strategy: When reading in a token w, we first try to look up w in the lexicon; if that fails, we redo the lookup with w mapped to lower case; if that fails, we consult the distributional lexicon; as a fallback, we use the suffix lexicon of the baseline tagger.

We follow common practice and normalize all numerical expressions (sequences of digits) into a single token type. To improve tagger performance on social media texts, we additionally normalize all tokens beginning with an “@” or “#”.

3 Evaluation

We evaluate our approach on the dataset of the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication and social media (Beißwenger et al. 2016) and compare it to the two systems that performed best on the share task as baselines.

3.1 Datasets

EmpiriST. This dataset has been provided by the EmpiriST 2015 shared task. It has been compiled from data samples considered representative for two types of corpus data. The CMC subset consists of selections of microposts from Twitter, a subset of the Dortmund Chat Corpus (Beißwenger 2013), threads from Wikipedia talk pages, WhatsApp interactions and blog comments. The Web subset consists of selections of websites and blogs covering various genres and topics like hobbies and travel, Wikipedia articles on topics like biology and botany and Wikinews on topics like IT security and ecology. The dataset is split into two parts, one for training and one for testing. The CMC subset consists of 5109 tokens for training and 5234 tokens for testing; the Web subset consists of 4944 tokens for training 7568 tokens for testing.

The dataset has been annotated using the “STTS IBK” tagset (Beißwenger et al. 2015), which is based on the STTS tagset (Schiller et al. 1999). STTS is the standard tagset for German. It distinguishes 11 parts of speech which are subdivided into 54 subcategories. STTS IBK adds 16 new tags for phenomena that occur frequently in social media texts, such as interaction words, addressing terms or contractions.

Schreibgebrauch. This dataset has been provided by (Horbach et al. 2015) and has been used as additional in-domain training data by the best-performing system of the EmpitiST shared task (Prange et al. 2016). It consists of manual annotations of forum posts of the German online cooking community http://www.chefkoch.de, a subset of the Dortmund Chat-Korpus and microposts from Twitter. In total, the annotated dataset consists of 34 173 tokens. Since the dataset has been annotated with a tagset that differs in some details from STTS IBK, Prange et al. (2016) re-annotated the dataset so that it matches the annotation scheme and guidelines of the shared task. We use the re-annotated version in our experiments.

We also use the complete Chefkoch corpus from which the annotated subset was selected to train lexical probabilities of out-of-vocabulary tokens. The corpus contains 470M tokens and covers a relatively large range of everyday topics.

TIGER. The TIGER corpus (Brants et al. 2004) is one of the standard corpora used for German POS tagging. It consists of 888 238 tokens which have been semi-automatically annotated with POS information, using the standard STTS tagset.

3.2 Experimental Setup

We train two different models: The TE model is trained on a combination of the TIGER corpus and the EmpiriST training set. The TES model additionally uses the Schreibgebrauch dataset. Since the two in-domain datasets are very small compared to TIGER, we follow Prange et al. (2016) and oversample them by a factor of 5. We automatically annotate the Chefkoch corpus using each of the two tagger models to estimate emission probabilities for out-of-vocabulary words as described in Sect. 2.2.

Fig. 1.
figure 1

Accuracy comparison for different configurations of our tagger and the two best performing shared task models on the EmpiriST test set.

3.3 Results

Figure 1 shows the results of our approach on the EmpiriST evaluation dataset. We consider two different configurations for each of our two models: TE/BL and TES/BL use suffix-based emission probabilities of the baseline tagger for out-of-vocabulary tokens, while TE/DS and TES/DS use distributional smoothing. To set the results into perspective, we compare our models to two state-of-the-art approaches: UdS refers to the system of Prange et al. (2016), which performed best in the EmpiriST shared task. The tagger is based on a Hidden Markov Model trained on EmpiriST, Schreibgebrauch and TIGER and uses distributional information obtained from the Chefkoch corpus to automatically learn a POS dictionary. UDE refers to the system of Horsmann and Zesch (2016). The tagger is based on Conditional Random Fields (CRFs) trained on EmpiriST and TIGER and was the best system in the shared task that does not use any in-domain data in addition to the training data provided by the shared task. In addition to standard features of a CRF-based tagger, the system uses word cluster information from Twitter messages, a POS lexicon and a morphological lexicon.

We compare our TE model to the UDE system and the TES model to the UdS system. Figure 1 shows that already our baseline configurations outperform state of the art (except UdS on Web). This is particularly surprising when comparing TES to UdS on CMC, since both models are based on trigram HMMs trained on the same datasets. To some extent, the difference can be explained by our use of simple patterns for @- and #-expressions, but we note that even without these patterns our basic tagger still outperforms UdS on CMC by 0.2%.

We also see that distributional smoothing is effective across all four configurations. On the CMC subset, the performance gain increases quite substantially for the TES model compared to the TE model (+0.49 vs. +0.30). This is to be expected, since the emission probabilities are derived from an automatically annotated corpus, which is tagged with higher accuracy when the TES model is used. For the Web subset, the performance gain is even larger. The relative performance gain is a bit lower for the TES model (+0.62) compared to the TE model (+0.75), which can be explained by the fact that the TES model generally performs better than the TE model on out-of-vocabulary items; see Sect. 3.4 below for details.

Overall, our tagger improves state-of-the-art substantially. Our best configuration (TES/DS) outperforms the previous best system by 0.42% accuracy.

3.4 Performance on Unknown Words

In a second experiment, we investigate the performance of our distributional smoothing approach in more detail. We split the test set into three parts—in-vocabulary tokens (IV), out-of-vocabulary tokens covered by our distributional smoothing approach (OOV/DS) and out-of-vocabulary tokens which do not occur in the Chefkoch corpus and are thus dealt with using suffix probabilities only (OOV/BL)—and measure accuracy of our models on these three subsets separately. Figure 2 shows, for each of the three subsets, the number of tokens in the subset, the performance of the DS models and the performance gain of the DS models over the corresponding BL models, for both TE and TES. We see that distributional smoothing is very effective and improves accuracy over the baseline by 7–8%, except for the TE model on the CMC subset where we obtain only a moderate improvement of approx. 3%. Overall, the improvement over the baseline is 5.1% (TE) and 5.8% (TES) on all out-of-vocabulary tokens.

Fig. 2.
figure 2

Accuracy comparison of the DS and BL models for in- (IV) and out-of-vocabulary (OOV) tokens on the CMC and the Web subset. The rows give, for each group, the number of tokens, the accuracy of the DS model and the accuracy gain of the DS model over the BL model.

4 Conclusions

In this paper, we presented work on part-of-speech tagging of German social media and web texts, using a fine grained tagset. Our tagger is based on a simple trigram Hidden Markov Model, which we extend with a distributional approach to estimating emission probabilities of out-of-vocabulary tokens. While technically very simple, our tagger is very effective and outperforms, or comes very close to, state-of-the-art systems even in the baseline configuration without distributional smoothing. Using distributional smoothing improves accuracy of out-of-vocabulary tokens by up to 5.8%. Overall, we improve state-of-the-art by 0.4% to 90.9% accuracy.