Skip to main content
Log in

A system review on bootstrapping information extraction

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Focusing on the supervision problems caused by high-cost and low-quality labeling in information extraction, we provided a detailed overview of the various approaches that were proposed to solve the sub-tasks of bootstrapping information extraction. We summarized current principal approaches and depicted the specific issues addressed in recent research. To provide inspiration and reference for similar studies in terms of mainstream data sources, evaluation specifications and applications, we summarized the relevant datasets, evaluation metrics, and systematic applications of bootstrapping information extraction. In addition, we reflected on the remaining problems of bootstrapping information extraction and highlighted some directions for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

  1. https://www.sciencedirect.com/

  2. https://ieeexplore.ieee.org/Xplore/home.jsp

  3. https://dl.acm.org/

  4. https://link.springer.com/

  5. https://scholar.google.com.hk/?hl=zh-CN

  6. https://aclanthology.org/

  7. https://labelstud.io/

References

  1. Chen M, Huang L, Li M, Zhou B, Ji H, Roth D (2022) New frontiers of information extraction. In: Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: human language technologies: tutorial abstracts. Association for Computational Linguistics, Seattle, United States, pp 14–25. https://doi.org/10.18653/v1/2022.naacl-tutorials.3. https://aclanthology.org/2022.naacl-tutorials.3

  2. Riloff E, Jones R et al (1999) Learning dictionaries for information extraction by multi-level bootstrapping. In: AAAI/IAAI, pp 474–479

  3. Abney S (2002) Bootstrapping. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 360–367

  4. Abe N (1998) Query learning strategies using boosting and bagging. Proc. of 15th int. cmf. on machine learning (ICML98), pp 1–9

  5. Brin S (1999) Extracting patterns and relations from the world wide web. In: The world wide web and databases: international workshop WebDB’98, Valencia, Spain, March 27–28, 1998. Selected Papers. Springer, pp 172–183

  6. Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: Proceedings of the fifth ACM conference on digital libraries, pp 85–94

  7. Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER, Mitchell TM (2010) Toward an architecture for never-ending language learning. In: Twenty-fourth AAAI conference on artificial intelligence

  8. Gao T, Han X, Xie R, Liu Z, Lin F, Lin L, Sun M (2020) Neural snowball for few-shot relation learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 7772–7779

  9. Cheng J, Liu J, Xu X, Xia D, Liu L, Sheng VS (2021) A review of chinese named entity recognition. KSII Trans Internet Inf Syst 15(6)

  10. Zhou S, Yu B, Sun A, Long C, Li J, Yu H, Sun J, Li Y (2022) A survey on neural open information extraction: current status and future directions. arXiv preprint arXiv:2205.11725

  11. Yang Y, Wu Z, Yang Y, Lian S, Guo F, Wang Z (2022) A survey of information extraction based on deep learning. Appl Sci 12(19):9691

    Article  Google Scholar 

  12. Zhang T, Huang Z, Wang Y, Wen C, Peng Y, Ye Y et al (2022) Information extraction from the text data on traditional chinese medicine: a review on tasks, challenges, and methods from 2010 to 2021. Evid Based Complement Alternat Med 2022

  13. Landolsi MY, Hlaoua L, Ben Romdhane L (2023) Information extraction from electronic medical documents: state of the art and future research directions. Knowl Inf Syst 65(2):463–516

    Article  Google Scholar 

  14. Abdullah MHA, Aziz N, Abdulkadir SJ, Alhussian HSA, Talpur N (2023) Systematic literature review of information extraction from textual data: recent methods, applications, trends, and challenges. IEEE Access

  15. Arksey H, O’Malley L (2005) Scoping studies: towards a methodological framework. Int J Soc Res Methodol 8(1):19–32

    Article  Google Scholar 

  16. Moher D, Liberati A, Tetzlaff J, Altman DG, Group* P (2009) Preferred reporting items for systematic reviews and meta-analyses: the prisma statement. Ann Intern Med 151(4):264–269

    Article  Google Scholar 

  17. Canisius S, Sporleder C (2007) Bootstrapping information extraction from field books. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 827–836

  18. Yahya M, Whang S, Gupta R, Halevy A (2014) Renoun: fact extraction for nominal attributes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 325–335

  19. Wang Q (2017) Research on entity relationship extraction based on convolutional neural network. Master’s thesis, Nanjing University

  20. Qichen H, Yawei Z, Zheng Y, Lijun F (2018) Automatic algorithm for initial seed set generation of domain knowledge graph based on syllogism table. Chin J Inf 32(8):1–8

    Google Scholar 

  21. Tuo J, Yan S, Li B, Wang H, You X (2017) Aspect extraction and aspect terms expansion in chinese reviews using cluster semi-supervised expansion model. In: 2017 4th international conference on information science and control engineering (ICISCE). IEEE, pp 212–217

  22. Phi V-T, Santoso J, Shimbo M, Matsumoto Y (2018) Ranking-based automatic seed selection and noise reduction for weakly supervised relation extraction. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 89–95

  23. Xiong G, Fang Y, Liu Q (2017) Automatic construction of domain-specific sentiment lexicon based on the semantics graph. In: 2017 IEEE international conference on signal processing, communications and computing (ICSPCC). IEEE, pp 1–6

  24. Saha S, Pal H et al (2017) Bootstrapping for numerical open ie. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 317–323

  25. Chen P-Y, Lee Y-H, Wu Y-H, Ma W-Y (2017) Iexm: information extraction system for movies. In: Proceedings of the 26th international conference on world wide web companion, pp 189–193

  26. Zhang C, Xu W, Gao S, Guo J (2014) A bottom-up kernel of pattern learning for relation extraction. In: The 9th international symposium on Chinese spoken language processing, IEEE, pp 609–613

  27. Vechtomova O (2012) A semi-supervised approach to extracting multiword entity names from user reviews. In: Proceedings of the 1st joint international workshop on entity-oriented and semantic search, pp 1–6

  28. Zhang C, Zhao S, Wang H (2013) Bootstrapping large-scale named entities using url-text hybrid patterns. In: Proceedings of the sixth international joint conference on natural language processing, pp 293–301

  29. Zupon A, Alexeeva M, Valenzuela-Escárcega M, Nagesh A, Surdeanu M (2019) Lightly-supervised representation learning with global interpretability. In: Proceedings of the third workshop on structured prediction for NLP, pp 18–28

  30. Jianshu J, Guang C, Chunyun Z (2014) A bootstrapping and mv-rnn mixed method for relation extraction. In: 2014 4th IEEE international conference on network infrastructure and digital content. IEEE, pp 117–120

  31. Tandon N, Rajagopal D, Melo G (2012) Markov chains for robust graph-based commonsense information extraction. In: Proceedings of coling 2012: demonstration papers, pp 439–446

  32. Ding H, Riloff E (2018) Human needs categorization of affective events using labeled and unlabeled data. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1(Long Papers), pp 1919–1929

  33. Li P, Zhou G, Zhu Q (2016) Minimally supervised chinese event extraction from multiple views. ACM Trans Asian Low-Resour Lang Inf Process (TALLIP) 16(2):1–16

    MathSciNet  Google Scholar 

  34. Feng X (2016) Research and application of chinese comparative sentence elements extraction technique. Master’s thesis, Beijing University of Posts and Telecommunications

  35. Chen C, He L, Lin X (2012) Rev: extracting entity relations from world wide web. In: Proceedings of the 6th international conference on ubiquitous information management and communication, pp 1–5

  36. Zhang C, Zhang Y, Xu W, Ma Z, Leng Y, Guo J (2015) Mining activation force defined dependency patterns for relation extraction. Knowl-Based Syst 86:278–287

    Article  Google Scholar 

  37. Ziering P, Plas L, Schuetze H (2013) Bootstrapping semantic lexicons for technical domains. In: Proceedings of the sixth international joint conference on natural language processing, pp 1321–1329

  38. Yada S, Ikeda K, Hoashi K, Kageura K (2017) A bootstrap method for automatic rule acquisition on emotion cause extraction. In: 2017 IEEE international conference on data mining workshops (ICDMW). IEEE, pp 414–421

  39. Schmitz M, Soderland S, Bart R, Etzioni O et al (2012) Open language learning for information extraction. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 523–534

  40. Kozareva Z (2012) Learning verbs on the fly. In: Proceedings of coling 2012: posters, pp 599–610

  41. Dalvi B, Bhakthavatsalam S, Clark C, Clark P, Etzioni O, Fader A, Groeneveld D (2016) Ike-an interactive tool for knowledge extraction. In: Proceedings of the 5th workshop on automated knowledge base construction, pp 12–17

  42. Tai L, Qin S, Guo F (2017) A pattern learning method based on kernel function. In: Proceedings of the 2017 2nd international conference on communication and information systems, pp 324–328

  43. FengYingHui (2016) Research on information extraction techniques for tibetan cultural field. Master’s thesis, Central University for Nationalities

  44. Shi B, Zhang Z, Sun L, Han X (2014) A probabilistic co-bootstrapping method for entity set expansion

  45. Cheng Z, Zheng D, Li S (2013) Multi-pattern fusion based semi-supervised name entity recognition. In: 2013 international conference on machine learning and cybernetics. IEEE, vol 1, pp 45–50

  46. Makarov P (2018) Automated acquisition of patterns for coding political event data: two case studies. In: Proceedings of the second joint SIGHUM workshop on computational linguistics for cultural heritage, social sciences, humanities and literature, pp 103–112

  47. Alashri S, Tsai J-Y., Koppela AR, Davulcu H (2018) Snowball: extracting causal chains from climate change text corpora. In: 2018 1st international conference on data intelligence and security (ICDIS). IEEE, pp 234–241

  48. Batista DS, Martins B, Silva MJ (2015) Semi-supervised bootstrapping of relationship extractors with distributional semantics. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 499–504

  49. Xia X (2014) Research on semi-supervised chinese event extraction. PhD thesis, Suzhou: Soochow University

  50. Liu Y (2014) The information gain based binary entity relationship extraction on web corpus. PhD thesis, East China Normal University

  51. Cheng Z (2014) Research on named entity recognition and relation extraction facing to domain-oriented knowledge base construction. PhD thesis, Harbin: Harbin Institute of Technology

  52. McNeil N, Bridges RA, Iannacone MD, Czejdo B, Perez N, Goodall JR (2013) Pace: pattern accurate computationally efficient bootstrapping for timely discovery of cyber-security concepts. In: 2013 12th international conference on machine learning and applications. IEEE, vol 2, pp 60–65

  53. Thomas A, Sivanesan S (2022) An adaptable, high-performance relation extraction system for complex sentences. Knowl-Based Syst 251:108956

    Article  Google Scholar 

  54. Wu Z (2019) Research and application on content understanding algorithm for conditional semi-structured text. Master’s thesis, South China University of Technology

  55. Long L, Yan J, Fang L, Li P, Liu X (2014) The identification of chinese named entity in the field of medicine based on bootstrapping method. In: 2014 International conference on multisensor fusion and information integration for intelligent systems (MFI), IEEE, pp 1–6

  56. Yan L, Han X, Sun L, He B (2019) Learning to bootstrap for entity set expansion. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 292–301

  57. Tai L-T (2018) Research on entity relation extraction algorithm based on semi-supervised machine learning. PhD thesis, Beijing University of Posts and Telecommunications

  58. Kurihara K, Shimada K (2015) Trouble information extraction based on a bootstrap approach from twitter. In: Proceedings of the 29th pacific asia conference on language, information and computation, pp 471–479

  59. Gupta S, Manning CD (2014) Improved pattern learning for bootstrapped entity extraction. In: Proceedings of the eighteenth conference on computational natural language learning, pp 98–108

  60. Ziering P, Plas L, Schütze H (2013) Multilingual lexicon bootstrapping-improving a lexicon induction system using a parallel corpus. In: Proceedings of the sixth international joint conference on natural language processing, pp 844–848

  61. Yildirim S, Yildiz T (2012) Automatic extraction of turkish hypernym-hyponym pairs from large corpus. In: Proceedings of coling 2012: demonstration papers, pp 493–500

  62. Yan L, Han X, He B, Sun L (2020) End-to-end bootstrapping neural network for entity set expansion. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 9402–9409

  63. Yan L, Han X, He B, Sun L (2020) Global bootstrapping neural network for entity set expansion. In: Findings of the association for computational linguistics: EMNLP 2020, pp 3705–3714

  64. Yan L, Han X, Sun L (2021) Progressive adversarial learning for bootstrapping: a case study on entity set expansion. arXiv preprint arXiv:2109.12082

  65. Ji J (2015) A grammar and dependency information based relation extraction system for streaming data. Master’s thesis, Beijing University of Posts and Telecommunications

  66. Sijia C (2014) Research on entity relationship extraction. Master’s thesis, Beijing University of Posts and Telecommunications

  67. Tang Z, Surdeanu M (2021) Interpretability rules: Jointly bootstrapping a neural relation extractorwith an explanation decoder. In: Proceedings of the first workshop on trustworthy natural language processing, pp 1–7

  68. Lin H, Yan J, Qu M, Ren X (2019) Learning dual retrieval module for semi-supervised relation extraction. In: The world wide web conference, pp 1073–1083

  69. Deepika S, Geetha T (2021) Pattern-based bootstrapping framework for biomedical relation extraction. Eng Appl Artif Intell 99:104130

    Article  Google Scholar 

  70. Zhuang Y, Jiang T, Riloff E (2020) Affective event classification with discourse-enhanced self-training. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 5608–5617

  71. Li Z, He Y, Gu B, Liu A, Li H, Wang H, Zhou X (2017) Diagnosing and minimizing semantic drift in iterative bootstrapping extraction. IEEE Trans Knowl Data Eng 30(5):852–865

    Article  Google Scholar 

  72. Wu W, Li H, Wang H, Zhu KQ (2016) Semantic bootstrapping: a theoretical perspective. IEEE Trans Knowl Data Eng 29(2):446–457

    Article  Google Scholar 

  73. Phi V-T, Matsumoto Y (2016) Integrating word embedding offsets into the espresso system for part-whole relation extraction. In: Proceedings of the 30th Pacific Asia conference on language, information and computation: oral papers, pp 173–181

  74. Bhutani N, Jagadish H, Radev D (2016) Nested propositions in open information extraction. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 55–64

  75. He Y, Grishman R (2015) Ice: rapid information extraction customization for nlp novices. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: demonstrations, pp 31–35

  76. Rondon A, Caseli H, Ramisch C (2015) Never-ending multiword expressions learning. In: Proceedings of the 11th workshop on multiword expressions, pp 45–53

  77. Ye F, Shi H, Wu S (2014) Research on pattern representation method in semi-supervised semantic relation extraction based on bootstrapping. In: 2014 Seventh international symposium on computational intelligence and design. IEEE, vol 1, pp 568–572

  78. Zhang C, Niu Z, Jiang P, Fu H (2012) Domain-specific term extraction from free texts. In: 2012 9th International conference on fuzzy systems and knowledge discovery. IEEE, pp 1290–1293

  79. Qadir A, Riloff E (2012) Ensemble-based semantic lexicon induction for semantic tagging. In: * SEM 2012: the first joint conference on lexical and computational semantics–volume 1: proceedings of the main conference and the shared task, and volume 2: proceedings of the sixth international workshop on semantic evaluation (SemEval 2012), pp 199–208

  80. Momtazi S, Moradiannasab O (2019) A statistical approach to knowledge discovery: bootstrap analysis of language models for knowledge base population from unstructured text. Sci Iran 26(Special Issue on: Socio-Cognitive Engineering):26–39

  81. Zhao H, Feng C, Luo Z, Tian C (2018) Entity set expansion from twitter. In: Proceedings of the 2018 ACM SIGIR international conference on theory of information retrieval, pp 155–162

  82. Wang C, Wang F (2012) A bootstrapping method for extracting sentiment words using degree adverb patterns. In: 2012 International conference on computer science and service system. IEEE, pp 2173–2176

  83. Thelen M, Riloff E (2002) A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002), pp 214–221

  84. Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the web: An experimental study. Artif Intell 165(1):91–134

    Article  Google Scholar 

  85. Rosenfeld B, Feldman R (2006) Ures: an unsupervised web relation extraction system. In: Proceedings of the COLING/ACL 2006 main conference poster sessions, pp 667–674

  86. Pantel P, Pennacchiotti M (2006) Espresso: leveraging generic patterns for automatically harvesting semantic relations. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pp 113–120

  87. Etzioni O, Banko M, Soderland S, Weld DS (2008) Open information extraction from the web. Commun ACM 51(12):68–74

    Article  Google Scholar 

  88. Li W-G, Liu T, Li S (2007) Automated entity relation tuple extraction using web mining. Acta Electon Sin 35(11):2111

    Google Scholar 

  89. Banko M, Etzioni O (2008) The tradeoffs between open and traditional relation extraction. In: Proceedings of ACL-08: HLT, pp 28–36

  90. Zhu J, Nie Z, Liu X, Zhang B, Wen J-R (2009) Statsnowball: a statistical approach to extracting entity relationships. In: Proceedings of the 18th international conference on world wide web, pp 101–110

  91. Komachi M, Kudo T, Shimbo M, Matsumoto Y (2008) Graph-based analysis of semantic drift in espresso-like bootstrapping algorithms. In: Proceedings of the 2008 conference on empirical methods in natural language processing, pp 1011–1020

  92. Curran JR, Murphy T, Scholz B (2007) Minimising semantic drift with mutual exclusion bootstrapping. In: Proceedings of the 10th conference of the pacific association for computational linguistics. Citeseer, vol 6, pp 172–180

  93. Zhang Y, Shen J, Shang J, Han J (2020) Empower entity set expansion via language model probing. arXiv preprint arXiv:2004.13897

  94. Huang J, Xie Y, Meng Y, Shen J, Zhang Y, Han J (2020) Guiding corpus-based set expansion by auxiliary sets generation and co-expansion. In: Proceedings of the web conference 2020, pp 2188–2198

  95. Liang J, Feng S, Xie C, Xiao Y, Chen J, Hwang S-W (2021) Bootstrapping information extraction via conceptualization. In: 2021 IEEE 37th international conference on data engineering (ICDE). IEEE, pp 49–60

  96. Alba A, Coden A, Gentile AL, Gruhl D, Ristoski P, Welch S (2017) Multi-lingual concept extraction with linked data and human-in-the-loop. In: Proceedings of the knowledge capture conference, pp 1–8

  97. Gentile AL, Gruhl D, Ristoski P, Welch S (2019) Explore and exploit. dictionary expansion with human-in-the-loop. In: European semantic web conference. Springer, pp 131–145

  98. Kirsch B, Niyazova Z, Mock M, Rüping S (2020) Noise reduction in distant supervision for relation extraction using probabilistic soft logic. In: Machine learning and knowledge discovery in databases: international workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part II. Springer, pp 63–78

  99. Rahman S, Kandogan E (2022) Characterizing practices, limitations, and opportunities related to text information extraction workflows: a human-in-the-loop perspective. In: CHI conference on human factors in computing systems, pp 1–15

  100. Deng B, Fan X, Yang L (2007) Entity relation extraction method using semantic pattern. Jisuanji Gongcheng/ Comput Eng 33(10):212–214

    Google Scholar 

  101. Pengfei L, Zheng Y, Chunning W, Yueqin Z, Wei L (2022) Research on the geological entities business relation extraction based on the bootstrapping method. Transform Bus Econ 21(2)

  102. Yang C, Xiao D, Luo Y, Li B, Zhao X, Zhang H (2022) A hybrid method based on semi-supervised learning for relation extraction in chinese emrs. BMC Medical Inform Decis Mak 22(1):169

    Article  Google Scholar 

  103. Li Y, Yu X, Liu Y, Chen H, Liu C (2023) Uncertainty-aware bootstrap learning for joint extraction on distantly-supervised data. arXiv preprint arXiv:2305.03827

  104. Novotnỳ V, Luger K, Štefánik M, Vrabcová T, Horák A (2023) People and places of historical europe: bootstrapping annotation pipeline and a new corpus of named entities in late medieval texts. arXiv preprint arXiv:2305.16718

  105. Sheikhpour R, Berahmand K, Forouzandeh S (2023) Hessian-based semi-supervised feature selection using generalized uncorrelated constraint. Knowl-Based Syst 269:110521

    Article  Google Scholar 

  106. Doumari SA, Berahmand K, Ebadi M et al (2023) Early and high-accuracy diagnosis of parkinson’s disease: outcomes of a new model. Comput Math Methods Med

  107. Menhour H, Şahin HB, Sarıkaya RN, Aktaş M, Sağlam R, Ekinci E, Eken S (2023) Searchable turkish ocred historical newspaper collection 1928–1942. J Inf Sci 49(2):335–347

    Article  Google Scholar 

  108. Yurtsever MME, Özcan M, Taruz Z, Eken S, Sayar A (2022) Figure search by text in large scale digital document collections. Concurr Comput: Prac Exp 34(1):6529

  109. Omurca SI, Ekinci E, Sevim S, Edinc EB, Eken S, Sayar A (2023) A document image classification system fusing deep and machine learning models. Appl Intell 53(12):15295–15310

    Article  Google Scholar 

Download references

Acknowledgements

This research is supported by the following projects: Central Leading Local Project “Fujian Mental Health Human-Computer Interaction Technology Research Center”, under the authorization number 2020L3024.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui Fang.

Ethics declarations

Conflict of interest

Authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Ge Xu, Yunfei Long, Yin Guan, Xiaoyan Yang, and Zhou Chen contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, H., Xu, G., Long, Y. et al. A system review on bootstrapping information extraction. Multimed Tools Appl 83, 38329–38353 (2024). https://doi.org/10.1007/s11042-023-17005-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17005-1

Keywords

Navigation