TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

  • Sepideh MesbahEmail author
  • Christoph Lofi
  • Manuel Valle Torre
  • Alessandro Bozzon
  • Geert-Jan Houben
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11136)


Named Entity Recognition and Typing (NER/NET) is a challenging task, especially with long-tail entities such as the ones found in scientific publications. These entities (e.g. “WebKB”,“StatSnowball”) are rare, often relevant only in specific knowledge domains, yet important for retrieval and exploration purposes. State-of-the-art NER approaches employ supervised machine learning models, trained on expensive type-labeled data laboriously produced by human annotators. A common workaround is the generation of labeled training data from knowledge bases; this approach is not suitable for long-tail entity types that are, by definition, scarcely represented in KBs. This paper presents an iterative approach for training NER and NET classifiers in scientific publications that relies on minimal human input, namely a small seed set of instances for the targeted entity type. We introduce different strategies for training data extraction, semantic expansion, and result entity filtering. We evaluate our approach on scientific publications, focusing on the long-tail entities types Datasets, Methods in computer science publications, and Proteins in biomedical publications.


  1. 1.
    Agerri, R., Rigau, G.: Robust multilingual named entity recognition with shallow semi-supervised features. Artif. Intell. 238, 63–82 (2016)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Bada, M., et al.: Concept annotation in the craft corpus. BMC bioinf. 13(1), 161 (2012)CrossRefGoogle Scholar
  3. 3.
    Brambilla, M., Ceri, S., Della Valle, E., Volonterio, R., Acero Salazar, F.X.: Extracting emerging knowledge from social media. In: International Conference on World Wide Web, pp. 795–804 (2017)Google Scholar
  4. 4.
    Derczynski, L., Nichols, E., van Erp, M., Limsopatham, N.: Results of the WNUT2017 shared task on novel and emerging entity recognition. In: Proceedings of the 3rd Workshop on Noisy User-Generated Text, pp. 140–147 (2017)Google Scholar
  5. 5.
    Funk, C., et al.: Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC bioinf. 15(1), 59 (2014)CrossRefGoogle Scholar
  6. 6.
    García-Pablos, A., Cuadros, M., Rigau, G.: W2VLDA: almost unsupervised system for aspect based sentiment analysis. Expert Syst. Appl. 91, 127–137 (2018)CrossRefGoogle Scholar
  7. 7.
    Goldberg, S., Wang, D.Z., Grant, C.: A probabilistically integrated system for crowd-assisted text labeling and extraction. J. Data Inf. Qual. (JDIQ) 8(2), 10 (2017)Google Scholar
  8. 8.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(suppl 1), 5228–5235 (2004)CrossRefGoogle Scholar
  9. 9.
    Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)CrossRefGoogle Scholar
  10. 10.
    Harris, Z.: Distributional structure. Word 10, 146–162 (1954)CrossRefGoogle Scholar
  11. 11.
    Kejriwal, M., Szekely, P.: Information extraction in illicit web domains. In: International Conference on World Wide Web, pp. 997–1006 (2017)Google Scholar
  12. 12.
    Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: International Conference on Machine Learning, vol. 951, pp. 282–289 (2001)Google Scholar
  13. 13.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning (ICML-14), pp. 1188–1196 (2014)Google Scholar
  14. 14.
    Lofi, C.: Measuring semantic similarity and relatedness with distributional and knowledge-based approaches. Inf. Media Tech. 10(3), 493–501 (2015)Google Scholar
  15. 15.
    Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 473–474. Springer, Heidelberg (2009). Scholar
  16. 16.
    Mesbah, S., Fragkeskos, K., Lofi, C., Bozzon, A., Houben, G.-J.: Semantic annotation of data processing pipelines in scientific publications. In: Blomqvist, E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds.) ESWC 2017. LNCS, vol. 10249, pp. 321–336. Springer, Cham (2017). Scholar
  17. 17.
    Mesbah, S., Lofi, C., Bozzon, A., Houben, G.-J.: SmartPub: a platform for long-tail entity extraction from scientific publications. In: The Web Conference (2018)Google Scholar
  18. 18.
    Mesbah, S., Lofi, C., Bozzon, A., Houben, G.-J.: TSE-NER companion page (2018).
  19. 19.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  20. 20.
    Osborne, F., de Ribaupierre, H., Motta, E.: TechMiner: extracting technologies from academic publications. In: Blomqvist, E., Ciancarini, P., Poggi, F., Vitali, F. (eds.) EKAW 2016. LNCS (LNAI), vol. 10024, pp. 463–479. Springer, Cham (2016). Scholar
  21. 21.
    Qu, L., Ferraro, G., Zhou, L., Hou, W., Baldwin, T.: Named entity recognition for novel types by transfer learning. In: EMNLP (2016)Google Scholar
  22. 22.
    Reinanda, R., Meij, E., de Rijke, M.: Document filtering for long-tail entities. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 771–780. ACM (2016)Google Scholar
  23. 23.
    Sateli, B., Witte, R.: What’s in this paper?: Combining rhetorical entities with linked open data for semantic literature querying. In: International Conference on World Wide Web, pp. 1023–1028 (2015)Google Scholar
  24. 24.
    Seitner, J., et al.: A large database of hypernymy relations extracted from the web. In: LREC (2016)Google Scholar
  25. 25.
    Shubankar, K., Singh, A., Pudi, V.: A frequent keyword-set based algorithm for topic modeling and clustering of research papers. In: 2011 3rd Conference on Data Mining and Optimization (DMO), pp. 96–102. IEEE (2011)Google Scholar
  26. 26.
    Siddiqui, T., Ren, X., Parameswaran, A., Han, J.: FacetGist: collective extraction of document facets in large technical corpora. In: International Conference on Information and Knowledge Management, pp. 871–880. ACM (2016)Google Scholar
  27. 27.
    Tsai, C.-T., Kundu, G., Roth, D.: Concept-based analysis of scientific literature. In: International Conference on Information Knowledge Management. ACM (2013)Google Scholar
  28. 28.
    Tseytlin, E., Mitchell, K., Legowski, E., Corrigan, J., Chavan, G., Jacobson, R.S.: Noble-flexible concept recognition for large-scale biomedical natural language processing. BMC bioinf. 17(1), 32 (2016)CrossRefGoogle Scholar
  29. 29.
    Tuarob, S., Bhatia, S., Mitra, P., Giles, C.L.: Algorithmseer: a system for extracting and searching for algorithms in scholarly big data. IEEE Trans. Big Data 2(1), 3–17 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Sepideh Mesbah
    • 1
    Email author
  • Christoph Lofi
    • 1
  • Manuel Valle Torre
    • 1
  • Alessandro Bozzon
    • 1
  • Geert-Jan Houben
    • 1
  1. 1.Delft University of TechnologyDelftThe Netherlands

Personalised recommendations