Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams

  • David Graus
  • Manos Tsagkias
  • Lars Buitinck
  • Maarten de Rijke
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8416)


The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.


Knowledge Base Ground Truth Entity Recognition Textual Quality Anchor Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Basave, A.E.C., Varga, A., Rowe, M., Stankovic, M., Dadzie, A.-S.: Making sense of microposts (#msm2013) concept extraction challenge. In: MSM 2013, pp. 1–15 (2013)Google Scholar
  2. 2.
    Becker, M., Hachey, B., Alex, B., Grover, C.: Optimising selective sampling for bootstrapping named entity recognition. In: ICML-LMV 2005, pp. 5–11 (2005)Google Scholar
  3. 3.
    Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: TwitIE: An open-source information extraction pipeline for microblog text. In: RANLP 2013, ACL (2013)Google Scholar
  4. 4.
    Buitinck, L., Marx, M.: Two-stage named-entity recognition using averaged perceptrons. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.) NLDB 2012. LNCS, vol. 7337, pp. 171–176. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Bunescu, R.: Using encyclopedic knowledge for named entity disambiguation. In: EACL 2006, pp. 9–16 (2006)Google Scholar
  6. 6.
    Cassidy, T., Ji, H., Ratinov, L.-A., Zubiaga, A., Huang, H.: Analysis and enhancement of wikification for microblogs with context expansion. In: COLING 2012, pp. 441–456 (2012)Google Scholar
  7. 7.
    Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: ACL 2002, pp. 1–8 (2002)Google Scholar
  8. 8.
    Davis, A., Veloso, A., da Silva, A.S., Meira Jr., W., Laender, A.H.F.: Named entity disambiguation in streaming data. In: ACL 2012, pp. 815–824 (2012)Google Scholar
  9. 9.
    Derczynski, L., Maynard, D., Aswani, N., Bontcheva, K.: Microblog-genre noise and impact on semantic annotation accuracy. In: HT 2013, pp. 21–30. ACM (2013)Google Scholar
  10. 10.
    Finkel, J.R., Manning, C.D., Ng, A.Y.: Solving the problem of cascading errors: approximate bayesian inference for linguistic annotation pipelines. In: EMNLP 2006, pp. 618–626 (2006)Google Scholar
  11. 11.
    Firth, J.R.: A Synopsis of Linguistic Theory, 1930-1955. In: Studies in Linguistic Analysis, pp. 1–32 (1957)Google Scholar
  12. 12.
    Guo, S., Chang, M.-W., Kiciman, E.: To link or not to link? a study on end-to-end tweet entity linking. In: NAACL 2013, pp. 1020–1030 (2013)Google Scholar
  13. 13.
    Hu, M., Sun, A., Lim, E.-P.: Comments-oriented document summarization: understanding documents with readers’ feedback. In: SIGIR 2008, pp. 291–298. ACM (2008)Google Scholar
  14. 14.
    Jijkoun, V., Khalid, M., Marx, M., de Rijke, M.: Named entity normalization in user generated content. In: AND 2008 (2008)Google Scholar
  15. 15.
    Kozareva, Z.: Bootstrapping named entity recognition with automatically generated gazetteer lists. In: EACL-SRW 2006, pp. 15–21. ACL (2006)Google Scholar
  16. 16.
    Lin, J., Macdonald, C., Ounis, I., Soboroff, I.: Overview of the TREC 2011 Microblog track. In: TREC 2011 (2012a)Google Scholar
  17. 17.
    Lin, T., Mausam, Etzioni, O.: No noun phrase left behind: detecting and typing unlinkable entities. In: EMNLP-CoNLL 2012, pp. 893–903 (2012b)Google Scholar
  18. 18.
    Massoudi, K., Tsagkias, M., de Rijke, M., Weerkamp, W.: Incorporating query expansion and quality indicators in searching microblog posts. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 362–367. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  19. 19.
    Meij, E., Weerkamp, W., de Rijke, M.: Adding semantics to microblog posts. In: WSDM 2012, pp. 563–572 (2012)Google Scholar
  20. 20.
    Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: CIKM 2007, pp. 233–242 (2007)Google Scholar
  21. 21.
    Milne, D., Witten, I.H.: Learning to link with wikipedia. In: CIKM 2008, pp. 509–518 (2008)Google Scholar
  22. 22.
    Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from wikipedia. Artificial Intelligence 194, 151–175 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Odijk, D., Meij, E., de Rijke, M.: Feeding the second screen: Semantic linking based on subtitles. In: OAIR 2013 (2013)Google Scholar
  24. 24.
    Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora. ACL (1995)Google Scholar
  25. 25.
    Ritter, A., Mausam, O.E., Clark, S.: Open domain event extraction from twitter. In: KDD 2012 (2012)Google Scholar
  26. 26.
    Weerkamp, W., de Rijke, M.: Credibility-inspired ranking for blog post retrieval. Information Retrieval 15(3-4), 243–277 (2012)CrossRefGoogle Scholar
  27. 27.
    Wu, D., Lee, W.S., Ye, N., Chieu, H.L.: Domain adaptive bootstrapping for named entity recognition. In: EMNLP 2009, pp. 1523–1532. ACL (2009)Google Scholar
  28. 28.
    Zhou, Y., Nie, L., Rouhani-Kalleh, O., Vasile, F., Gaffney, S.: Resolving surface forms to wikipedia topics. In: COLING 2010, pp. 1335–1343 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • David Graus
    • 1
  • Manos Tsagkias
    • 1
  • Lars Buitinck
    • 1
  • Maarten de Rijke
    • 1
  1. 1.ISLAUniversity of AmsterdamAmsterdamThe Netherlands

Personalised recommendations