NER in Tweets Using Bagging and a Small Crowdsourced Dataset

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8686)


Named entity recognition (NER) systems for Twitter are very sensitive to cross-sample variation, and the performance of off-the-shelf systems vary from reasonable (F 1: 60–70%) to completely useless (F 1: 40–50%) across available Twitter datasets. This paper introduces a semi-supervised wrapper method for robust learning of sequential problems with many negative examples, such as NER, and shows that using a simple conditional random fields (CRF) model and a small crowdsourced dataset [4], leads to good NER performance across datasets.


Twitter semi-supervised learning bagging crowdsourcing named entity recognition unlabeled data 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)zbMATHMathSciNetGoogle Scholar
  2. 2.
    Collins, M.: Discriminative training methods for Hidden Markov Models. In: EMNLP (2002)Google Scholar
  3. 3.
    Eisenstein, J.: What to do about bad language on the internet. In: NAACL (2013)Google Scholar
  4. 4.
    Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: NAACL Workshop on Creating Speech and Language Data with Amazons Mechanical Turk (2010)Google Scholar
  5. 5.
    Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: ACL (2005)Google Scholar
  6. 6.
    Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., Hovy, E.: Learning whom to trust with MACE. In: NAACL (2013)Google Scholar
  7. 7.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML (2001)Google Scholar
  8. 8.
    Liu, X., Zhang, S., Wei, F., Zhou, M.: Recognizing named entities in tweets. In: ACL (2011)Google Scholar
  9. 9.
    Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: NAACL (2013)Google Scholar
  10. 10.
    Piskorski, J., Ehrmann, M.: Named entity recognition in targeted Twitter streams in Polish. In: ACL Workshop on Balto-Slavic NLP (2013)Google Scholar
  11. 11.
    Poibeau, T., Kosseim, L.: Proper name extraction from non-journalistic texts. In: CLIN (2000)Google Scholar
  12. 12.
    Ritter, A., Clark, S., Etzioni, M., Etzioni, O.: Named entity recognition in tweets: an experimental study. In: EMNLP (2011)Google Scholar
  13. 13.
    Rodrigues, F., Pereira, F., Ribeiro, B.: Sequence labeling with multiple annotators. Machine Learning, 1–17 (2013)Google Scholar
  14. 14.
    Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: HTL-NAACL (2003)Google Scholar
  15. 15.
    Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: ACL, Columbus, Ohio, pp. 665–673 (2008)Google Scholar
  16. 16.
    Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: ACL (2010)Google Scholar
  17. 17.
    Wang, C.-K., Hsu, B.-J., Chang, M.-W., Kiciman, E.: Simple and knowledge-intensive generative model for named entity recognition. Technical report, Microsoft Research (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Center for Language TechnologyUniversity of CopenhagenDenmark

Personalised recommendations