Active Learning with Adaptive Density Weighted Sampling for Information Extraction from Scientific Papers

  • Roman SuvorovEmail author
  • Artem Shelmanov
  • Ivan Smirnov
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 789)


The paper addresses the task of information extraction from scientific literature with machine learning methods. In particular, the tasks of definition and result extraction from scientific publications in Russian are considered. We note that annotation of scientific texts for creation of training dataset is very labor insensitive and expensive process. To tackle this problem, we propose methods and tools based on active learning. We describe and evaluate a novel adaptive density-weighted sampling (ADWeS) meta-strategy for active learning. The experiments demonstrate that active learning can be a very efficient technique for scientific text mining, and the proposed meta-strategy can be beneficial for corpus annotation with strongly skewed class distribution. We also investigate informative task-independent features for information extraction from scientific texts and present an openly available tool for corpus annotation, which is equipped with ADWeS and compatible with well-known sampling strategies.


Information extraction Deep linguistic analysis Active machine learning Scientific texts analysis 



The project is supported by the Russian Foundation for Basic Research, project number: 16-29-07210 “ofi_m”.


  1. 1.
    Settles, B.: Active learning literature survey. University of Wisconsin, Madison, 52(55–66), 11 (2010)Google Scholar
  2. 2.
    Lewis, D., Gale, W.: Training text classifiers by uncertainty sampling (1994)Google Scholar
  3. 3.
    Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 546–555 (2017)Google Scholar
  4. 4.
    Del Gaudio, R.: Automatic extraction of definitions. PhD thesis, University of Lisbon (2014)Google Scholar
  5. 5.
    Navigli, R., Velardi, P.: Learning word-class lattices for definition and hypernym extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1318–1327 (2010)Google Scholar
  6. 6.
    Bolshakova, E., Efremova, N., Noskov, A.: LSPL-patterns as a tool for information extraction from natural language texts. In: New Trends in Classification and Data Mining, pp. 110–118 (2010)Google Scholar
  7. 7.
    Chiticariu, L., Li, Y., Reiss, F.R.: Rule-based information extraction is dead! Long live rule-based information extraction systems! In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 827–832 (2013)Google Scholar
  8. 8.
    Gupta, S., Manning, C.: SPIED: Stanford pattern based information extraction and diagnostics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pp. 38–44 (2014)Google Scholar
  9. 9.
    Augenstein, I., Maynard, D., Ciravegna, F.: Relation extraction from the web using distant supervision. In: Janowicz, K., Schlobach, S., Lambrix, P., Hyvönen, E. (eds.) EKAW 2014. LNCS (LNAI), vol. 8876, pp. 26–41. Springer, Cham (2014). Scholar
  10. 10.
    Jun, K.S., Zhu, J., Settles, B., Rogers, T.: Learning from human-generated lists. In: International Conference on Machine Learning, pp. 181–189 (2013)Google Scholar
  11. 11.
    Kholghi, M., De Vine, L., Sitbon, L., Zuccon, G., Nguyen, A.: The benefits of word embeddings features for active learning in clinical information extraction. arXiv preprint arXiv:1607.02810 (2016)
  12. 12.
    Kholghi, M., Sitbon, L., Zuccon, G., Nguyen, A.: External knowledge and query strategies in active learning: a study in clinical information extraction. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 143–152. ACM (2015)Google Scholar
  13. 13.
    Dalvi, B., Bhakthavatsalam, S., Clark, C., Clark, P., Etzioni, O., Fader, A., Groeneveld, D.: IKE-an interactive tool for knowledge extraction. In: Proceedings of the 5th Workshop on Automated Knowledge Base Construction, AKBC@ NAACL-HLT, pp. 12–17 (2016)Google Scholar
  14. 14.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab (1999)Google Scholar
  15. 15.
    Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kübler, S., Marinov, S., Marsi, E.: MaltParser: A language-independent system for data-driven dependency parsing. Nat. Lang. Eng. 13(2), 95–135 (2007)Google Scholar
  16. 16.
    Nivre, J., Boguslavsky, I.M., Iomdin, L.L.: Parsing the SynTagRus treebank of Russian. In: Proceedings of the 22nd International Conference on Computational Linguistics, pp. 641–648 (2008)Google Scholar
  17. 17.
    Gildea, D., Jurafsky, D.: Automatic labeling of semantic roles. Comput. Linguist. 28(3), 245–288 (2002)CrossRefGoogle Scholar
  18. 18.
    Shelmanov, A.O., Smirnov, I.V.: Methods for semantic role labeling of Russian texts. In: Computational Linguistics and Intellectual Technologies, Papers from the Annual International Conference “Dialogue-2014", vol. 13, pp. 607–620 (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Federal Research Center “Computer Science and Control” of the Russian Academy of SciencesMoscowRussia

Personalised recommendations