In-Memory Distributed Training of Linear-Chain Conditional Random Fields with an Application to Fine-Grained Named Entity Recognition
Recognizing fine-grained named entities, i.e., street and city instead of just the coarse type location, has been shown to increase task performance in several contexts. Fine-grained types, however, amplify the problem of data sparsity during training, which is why larger amounts of training data are needed. In this contribution we address scalability issues caused by the larger training sets. We distribute and parallelize feature extraction and parameter estimation in linear-chain conditional random fields, which are a popular choice for sequence labeling tasks such as named entity recognition (NER) and part of speech (POS) tagging. To this end, we employ the parallel stream processing framework Apache Flink which supports in-memory distributed iterations. Due to this feature, contrary to prior approaches, our system becomes iteration-aware during gradient descent. We experimentally demonstrate the scalability of our approach and also validate the parameters learned during distributed training in a fine-grained NER task.
Fine-grained named entity recognition and typing has recently attracted much interest, as NLP applications increasingly require domain- and topic-specific entity recognition beyond standard, coarse types such as persons, organizations and locations (Ling and Weld 2012; Del Corro et al. 2015; Abhishek et al. 2017). In NLP tasks such as relation extraction or question answering, using fine-grained types for entities can significantly increase task performance (Ling and Weld 2012; Koch et al. 2014; Dong et al. 2015). At the same time, freely-available, large-scale knowledge bases, such as Freebase (Bollacker et al. 2008), DBpedia (Auer et al. 2007) and Microsoft’s Concept Graph (Wang et al. 2015) provide rich entity type taxonomies for labeling entities. However, training models for fine-grained NER requires large amounts of training data in order to overcome data sparsity issues (e.g., for low-frequency categories or features), as well as labeling noise, e.g., as introduced by training datasets created with distant supervision (Plank et al. 2014; Abhishek et al. 2017). Furthermore, the diversity of entity type taxonomies and application scenarios often requires the frequent adaptation or re-training of models. The speed and efficiency with which we can (re-)train models thus becomes a major criterion for selecting learning algorithms, if we want to fully make use of these larger datasets and richer type taxonomies.
Linear-chain CRFs (Lafferty et al. 2001) are a very popular approach to solve sequence labeling tasks such as NER (Strauss et al. 2016). Parameter estimation in CRFs is typically performed in a supervised manner. Training, however, is time-consuming with larger datasets and many features or labels. For instance, it took more than three days to train a part-of-speech tagging model (45 labels, around 500k parameters) with less than 1 million training tokens on a 2.4 GHz Intel Xeon machine, Sutton and McCallum (2011) report. This is due to the fact that during training, linear-chain CRFs require to perform inference for each training sequence at each iteration.
Fortunately, linear-chain CRFs hold potential for parallelization. During gradient descent optimization it is possible to compute local gradients on subsets of the training data which then need to be accumulated into a global gradient. Li et al. (2015) recently demonstrated this approach by parallelizing model training within the MapReduce framework (Dean and Ghemawat 2008). The authors distributed subsets of the training data among the mappers of their cluster, which computed local gradients in a map phase. The local gradients were then accumulated into a global gradient in a subsequent reduce step. The map and reduce steps can be repeated until convergence, using the global gradient to update the model parameters at each iteration step. For large data sets, NER experiments showed that their approach improves performance in terms of run times. However, for each learning step, their system invokes a new Hadoop job, which is very time-consuming due to JVM startup times and disk IO for re-reading the training data. As the authors themselves point out, in-memory strategies would be much more efficient.
In this paper, we employ a very similar parallelization approach as Li et al., but implement the training within an efficient, iteration-aware distributed processing framework. The framework we choose allows us to efficiently store model parameters and other pre-computed data in memory, in order to keep the de/serialization overhead across iterations to a minimum (Alexandrov et al. 2014; Ewen et al. 2013).
a proof-of-concept implementation of a distributed, iteration-aware linear-chain CRF training (Sect. 3),
In what follows, we first define linear-chain CRFs more formally and explain in detail how parameter estimation can be parallelized. We then discuss the details of our implementation, followed by several experimental evaluations.
2 Parallelization of Conditional Random Fields
We used the global gradient to update the current model parameters at each iteration. The information flow is depicted in Fig. 1. As can be seen, before the first iteration, we also distributed feature extraction among the mappers.
Our system marries two powerful tools, the probabilistic modeling library FACTORIE1 (McCallum et al. 2009) and the parallel processing engine Apache Flink2 (Alexandrov et al. 2014). It inherits features and functions from both tools.
The authors of FACTORIE convincingly promote it as a tool which preserves the ‘traditional, declarative, statistical semantics of factor graphs while allowing imperative definitions of the model structure and operation.’ Furthermore, Passos et al. (2013) compared FACTORIE’s performance with established libraries such as scikit-learn, MALLET and CRFSuite and found that it is competitive in terms of accuracy and efficiency.
We distributed the model we implemented in FACTORIE with the help of Apache Flink. Flink provides primitives for massively parallel iterations and when compiling a distributed program which contains iterations, it analyses the data flow, identifies iteration-invariant parts and caches them to prevent unnecessary recomputations, Ewen et al. (2013) explain. Thus, contrary to prior approaches, due to Flink, our distributed system becomes ‘iteration-aware’.
FACTORIE already supported local thread-level parallelism as well as distributed hyper-parameter optimization. Nonetheless, we had to overcome several obstacles when we ported the library into the cluster. For instance, in FACTORIE, object hash identities are used to map gradient tensors onto corresponding weight tensors during training. These identities get lost when an object is serialized in one JVM and deserialized in another JVM. To preserve identities throughout de/serialization among the virtual machines within the cluster, we cached relevant object hashes. We thus ended up using a slightly modified library.
4 Scalability Experiments
We tested our system on a NER task with seven types (including the default type). We compared our distributed parallel system with a local sequential counterpart in which we removed all Flink directives. In both versions our model consisted of a label-label factor and an observation-label factor3. During training, we used a likelihood objective, a belief propagation inference method which was tailored to linear chains and a constant step-size optimizer; all of which FACTORIE’s modular design allows to plug in easily.
the level of parallelism,
the amount of training instances, and
the number of parameters, K.
Points (2)–(3) were varied for both the local version and the distributed version. When we tested the local version we kept the number of participating computational nodes constant at one. In particular, no local thread parallelism was allowed, which is why point one does not apply to the local version.
All distributed experiments were conducted on an Apache Hadoop YARN cluster consisting of four computers (+1 master node). The local experiments were carried out using the master node. Two of the computers were running Intel Xeon CPUs E5-2630L v3 @ 1.80 GHz with 8 cores, 16 threads and 20 MB cache (as was the master node), while the other two computers were running Intel Xeon CPUs E5-2630 v3 @ 2.40 GHz again with 8 cores, 16 threads and 20 MB cache.
Each yarn task manager was assigned 8 GB of memory, of which a fraction of 30% was reserved for Flink’s internal memory management. We used the Xmx option on the master node (with a total of 32 GB RAM). The nodes in the cluster thus had slightly less RAM available for the actual task than the master node. However, as a general purpose computer, the master node was also carrying out other tasks. We observed a maximal fluctuation of 1.7% (470 s vs. 478 s) for the same task carried out on different days. Loading the data from local files and balancing it between the mappers in the cluster was considered part of the training, as was feature extraction.
5 Scalability Evaluation
We first performed several sanity checks. For example, we made sure that multiple physical machines were involved during parameter estimation and that disjoint chunks of the training data reached the different machines. We also checked that the gradients computed by the mappers differed and that they were accumulated correctly during the reduce phase.
The most convincing fact that led us to believe that we correctly distributed feature extraction and parameter estimation was that after we trained the local version and the distributed version using the same training set – with just a few parameters and for just a few iterations – extremely similar parameters were in place. Consequently, the two models predicted identical labels on the same test set containing 5k tokens. The parameters diverge the more features are used and the more training steps are taken. We suspect that this is due to floating point imprecisions that result in different gradients at some point.
Figure 3 shows to what extent we were able to counterbalance an increase in training size with an increase in parallelism. The results suggest that our model was indeed able to dampen the effect of increasing amounts of training examples. The average rate of change in execution times was higher when we kept the number of nodes constant. As we doubled the level of parallelism along with the training size, the rate of change reduced significantly. We also compared the distributed implementation with the local implementation in Fig. 3. As can be seen, the average rate of change is higher for the local version than for the distributed version with an increasing level of parallelism. However, it is still much lower when compared to the distributed runs with a fixed level of parallelism.
6 Accuracy Experiments
The sections above address the scalability of our approach. In this section we report on experiments which demonstrate that our distributed linear-chain CRF learns meaningful parameters. We tested our model in a NER task.
The task was to recognize six fine-grained geospatial concepts in German texts, namely streets (‘Berliner Straße’), cities (‘Berlin’), public transportation hubs (‘Berliner Straße’), routes (‘U6’), distances (‘5 km’) and the super-type location (‘Germany’). The task involves the typical challenges of NER, such as disambiguation. Furthermore, the training sets (which were annotated by trained linguists) contained user-generated content, which is why noise was also an issue.
Table 1 characterizes the two datasets and explains what we refer to as noise. The RSS dataset consists of a sample of traffic reports crawled from more than 100 RSS feeds that provide traffic and transportation information about road blocks, construction sites, traffic jams, or rail replacement services. Feed sources include federal and state police, radio stations, Deutsche Bahn, and air travel sources. Traffic reports are typically very brief, may be semi-structured (e.g., location, cause and length of a traffic jam), and often contain telegraph-style sentences or phrases. The Twitter dataset consists of a sample of German-language tweets that were retrieved via the Twitter search API using a list of approximately 150 domain-relevant users/channels and 300 search terms. Channels include, e.g., airline companies, traffic information sources, and railway companies. Search terms comprise event-related keywords such as “traffic jam” or “roadworks”, but also major highway names, railway route identifiers, and airport codes. Both datasets therefore consist of documents which contain traffic- and mobility-related information that refer to the fine-grained location types defined previously.
Besides the well-established features in NER (e.g., word shape, affixes) our application (‘Locator’) also considered task specific features and took measures towards text normalization. In the end, a larger number of parameters (100k–150k) was in place than during the scalability experiments.
We again used the FACTORIE components listed in Sect. 4, such as the BP method for chains and a constant step size optimizer. FACTORIE provides more sophisticated optimizers such as LBFGS or Adagrad. In our current system, however, only the model parameters survive a Flink-iteration step but methods like LBFGS and Adagrad need further information about past update steps.
Datasets. Size and noise. We refer to noise as the percentage of tokens that the Enchant v 1.6.0 Myspell de_DE dictionary did not recognize.
| || |
To put the performance of our model into perspective, we also conducted a ten-fold cross-validation using the Stanford NER (v. 3.6.0) in its standard configuration4. The Stanford NER used the same tokenizer as our system.
7 Accuracy Evaluation
The results of our accuracy experiments are summarized in Table 2. The F-scores achieved on the Twitter dataset and the scores achieved on the RSS dataset reveal similar trends for both systems: In both cases, the RSS-score is higher than the Twitter-score.
Our distributed model slightly outperforms the Stanford NER on the Twitter dataset but is beaten on the RSS dataset. Since the Twitter dataset is noisier than the RSS dataset, we suspect that the task-specific features and text normalization methods of our system have a greater impact in this case.
Results of 10-fold NER experiments. Classification performance was evaluated on token level so that multiple-token spans resulted in multiple true positives or false negatives, for instance. To compensate class imbalances, for each fold, we weighted the fine-grained scores (i.e., precision (P), recall (R) and F1-score (F1) of the entity ‘street’) by the support of the entity in the test set and averaged over all fine-grained scores. The listed scores are averages over the ten fold scores.
| || |
| || |
8 Discussion and Conclusion
We distributed and parallelized feature extraction and parameter estimation in linear-chain CRFs. The sequence labeling experiments we conducted suggest that our system learns meaningful parameters and is able to counterbalance growing amounts of training data with an increase in the level of parallelism (see Table 2 and Figs. 2 and 3). We reached speedups greater than one and F-scores comparable to the ones produced by a state-of-the-art approach.
To achieve this, we combined the parallel processing engine Apache Flink with the probabilistic modeling library FACTORIE. Our proof-of-concept implementation now inherits functions and features from both tools.
Contrary to prior approaches, for instance, it is iteration-aware during distributed gradient descent. In addition, our system also benefits from FACTORIE’s modular design and rich pool of functions. With little programming effort it is possible to plug in alternative choices for an optimizer or an inference method.
There is, however, room for improvement. The choice for an optimizer, for instance, is restricted by the fact that currently, only the model parameters survive an iteration. But some optimization procedures that FACTORIE provides, like LBFGS, require additional information about past updates. Enhancing the system to provide this feature remains future work.
Furthermore, the increase in runtime in Fig. 4 seems disproportionate. Working with sparse vectors to reduce the amount of data that needs to be cached will most likely reduce runtime. There might also be a serialization bottleneck. Registering customized serializers for FACTORIE’s types with Flink may thus also improve performance. Fortunately, the number of features is typically fixed at some point in most settings. At this point the amount of available training data and the number of mappers in the cluster determine from when on our approach pays off.
This research was partially supported by the German Federal Ministry of Economics and Energy (BMWi) through the projects SD4M (01MD15007B) and SDW (01MD15010A), and by the German Federal Ministry of Education and Research (BMBF) through the project BBDC (01IS14013E).
- Abhishek, A., Anand, A., Awekar, A.: Fine-grained entity type classification by jointly learning representations and label embeddings. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 797–807. Association for Computational Linguistics, Valencia, April 2017. http://www.aclweb.org/anthology/E17-1075
- Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1247–1250. ACM, New York (2008). https://doi.org/10.1145/1376616.1376746. http://18.104.22.168/w/images/9/98/SC17.pdf. ISBN 978-1-60558-102-6
- Del Corro, L., Abujabal, A., Gemulla, R., Weikum, G.: Finet: context-aware fine-grained named entity typing. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 868–878. Association for Computational Linguistics, Lisbon, September 2015. http://aclweb.org/anthology/D15-1103
- Dong, L., Wei, F., Sun, H., Zhou, M., Xu, K.: A hybrid neural model for type classification of entity mentions. In: Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI 2015, pp. 1243–1249. AAAI Press, Buenos Aires (2015). http://dl.acm.org/citation.cfm?id=2832415.2832422. ISBN 978-1-57735-738-4
- Ewen, S., Schelter, S., Tzoumas, K., Warneke, D., Markl, V.: Iterative parallel data processing with stratosphere: an inside look. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1053–1056. ACM (2013)Google Scholar
- Koch, M., Gilmer, J., Soderland, S., Weld, D.S.: Type-aware distantly supervised relation extraction with linked arguments. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1891–1901. Association for Computational Linguistics, Doha, October 2014. http://www.aclweb.org/anthology/D14-1203
- Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the ICML, vol. 1, pp. 282–289 (2001)Google Scholar
- Ling, X., Weld, D.: Fine-grained entity recognition. In: Proceedings of AAAI 2012 (2012). http://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/5152
- McCallum, A., Schultz, K., Singh, S.: Factorie: probabilistic programming via imperatively defined factor graphs. In: Advances in Neural Information Processing Systems, pp. 1249–1257 (2009)Google Scholar
- Passos, A., Vilnis, L., McCallum, A.: Optimization and learning in FACTORIE. In: NIPS Workshop on Optimization for Machine Learning (OPT) (2013)Google Scholar
- Plank, B., Hovy, D., McDonald, R.T., Søgaard, A.: Adapting taggers to Twitter with not-so-distant supervision. In: COLING, pp. 1783–1792 (2014)Google Scholar
- Strauss, B., Toma, B.E., Ritter, A., de Marneffe, M.-C., Xu, W.: Results of the WNUT16 named entity recognition shared task. In: Proceedings of the 2nd Workshop on Noisy User-Generated Text, pp. 138–144 (2016)Google Scholar
- Wang, Z., Wang, H., Wen, J.-R., Xiao, Y.: An inference approach to basic level of categorization. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM 2015, pp. 653–662. ACM, New York (2015). https://doi.org/10.1145/2806416.2806533. ISBN 978-1-4503-3794-6
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.