Abstract
Focusing on the supervision problems caused by high-cost and low-quality labeling in information extraction, we provided a detailed overview of the various approaches that were proposed to solve the sub-tasks of bootstrapping information extraction. We summarized current principal approaches and depicted the specific issues addressed in recent research. To provide inspiration and reference for similar studies in terms of mainstream data sources, evaluation specifications and applications, we summarized the relevant datasets, evaluation metrics, and systematic applications of bootstrapping information extraction. In addition, we reflected on the remaining problems of bootstrapping information extraction and highlighted some directions for future work.
Similar content being viewed by others
Data availability statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
References
Chen M, Huang L, Li M, Zhou B, Ji H, Roth D (2022) New frontiers of information extraction. In: Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: human language technologies: tutorial abstracts. Association for Computational Linguistics, Seattle, United States, pp 14–25. https://doi.org/10.18653/v1/2022.naacl-tutorials.3. https://aclanthology.org/2022.naacl-tutorials.3
Riloff E, Jones R et al (1999) Learning dictionaries for information extraction by multi-level bootstrapping. In: AAAI/IAAI, pp 474–479
Abney S (2002) Bootstrapping. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 360–367
Abe N (1998) Query learning strategies using boosting and bagging. Proc. of 15th int. cmf. on machine learning (ICML98), pp 1–9
Brin S (1999) Extracting patterns and relations from the world wide web. In: The world wide web and databases: international workshop WebDB’98, Valencia, Spain, March 27–28, 1998. Selected Papers. Springer, pp 172–183
Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: Proceedings of the fifth ACM conference on digital libraries, pp 85–94
Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER, Mitchell TM (2010) Toward an architecture for never-ending language learning. In: Twenty-fourth AAAI conference on artificial intelligence
Gao T, Han X, Xie R, Liu Z, Lin F, Lin L, Sun M (2020) Neural snowball for few-shot relation learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 7772–7779
Cheng J, Liu J, Xu X, Xia D, Liu L, Sheng VS (2021) A review of chinese named entity recognition. KSII Trans Internet Inf Syst 15(6)
Zhou S, Yu B, Sun A, Long C, Li J, Yu H, Sun J, Li Y (2022) A survey on neural open information extraction: current status and future directions. arXiv preprint arXiv:2205.11725
Yang Y, Wu Z, Yang Y, Lian S, Guo F, Wang Z (2022) A survey of information extraction based on deep learning. Appl Sci 12(19):9691
Zhang T, Huang Z, Wang Y, Wen C, Peng Y, Ye Y et al (2022) Information extraction from the text data on traditional chinese medicine: a review on tasks, challenges, and methods from 2010 to 2021. Evid Based Complement Alternat Med 2022
Landolsi MY, Hlaoua L, Ben Romdhane L (2023) Information extraction from electronic medical documents: state of the art and future research directions. Knowl Inf Syst 65(2):463–516
Abdullah MHA, Aziz N, Abdulkadir SJ, Alhussian HSA, Talpur N (2023) Systematic literature review of information extraction from textual data: recent methods, applications, trends, and challenges. IEEE Access
Arksey H, O’Malley L (2005) Scoping studies: towards a methodological framework. Int J Soc Res Methodol 8(1):19–32
Moher D, Liberati A, Tetzlaff J, Altman DG, Group* P (2009) Preferred reporting items for systematic reviews and meta-analyses: the prisma statement. Ann Intern Med 151(4):264–269
Canisius S, Sporleder C (2007) Bootstrapping information extraction from field books. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 827–836
Yahya M, Whang S, Gupta R, Halevy A (2014) Renoun: fact extraction for nominal attributes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 325–335
Wang Q (2017) Research on entity relationship extraction based on convolutional neural network. Master’s thesis, Nanjing University
Qichen H, Yawei Z, Zheng Y, Lijun F (2018) Automatic algorithm for initial seed set generation of domain knowledge graph based on syllogism table. Chin J Inf 32(8):1–8
Tuo J, Yan S, Li B, Wang H, You X (2017) Aspect extraction and aspect terms expansion in chinese reviews using cluster semi-supervised expansion model. In: 2017 4th international conference on information science and control engineering (ICISCE). IEEE, pp 212–217
Phi V-T, Santoso J, Shimbo M, Matsumoto Y (2018) Ranking-based automatic seed selection and noise reduction for weakly supervised relation extraction. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 89–95
Xiong G, Fang Y, Liu Q (2017) Automatic construction of domain-specific sentiment lexicon based on the semantics graph. In: 2017 IEEE international conference on signal processing, communications and computing (ICSPCC). IEEE, pp 1–6
Saha S, Pal H et al (2017) Bootstrapping for numerical open ie. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 317–323
Chen P-Y, Lee Y-H, Wu Y-H, Ma W-Y (2017) Iexm: information extraction system for movies. In: Proceedings of the 26th international conference on world wide web companion, pp 189–193
Zhang C, Xu W, Gao S, Guo J (2014) A bottom-up kernel of pattern learning for relation extraction. In: The 9th international symposium on Chinese spoken language processing, IEEE, pp 609–613
Vechtomova O (2012) A semi-supervised approach to extracting multiword entity names from user reviews. In: Proceedings of the 1st joint international workshop on entity-oriented and semantic search, pp 1–6
Zhang C, Zhao S, Wang H (2013) Bootstrapping large-scale named entities using url-text hybrid patterns. In: Proceedings of the sixth international joint conference on natural language processing, pp 293–301
Zupon A, Alexeeva M, Valenzuela-Escárcega M, Nagesh A, Surdeanu M (2019) Lightly-supervised representation learning with global interpretability. In: Proceedings of the third workshop on structured prediction for NLP, pp 18–28
Jianshu J, Guang C, Chunyun Z (2014) A bootstrapping and mv-rnn mixed method for relation extraction. In: 2014 4th IEEE international conference on network infrastructure and digital content. IEEE, pp 117–120
Tandon N, Rajagopal D, Melo G (2012) Markov chains for robust graph-based commonsense information extraction. In: Proceedings of coling 2012: demonstration papers, pp 439–446
Ding H, Riloff E (2018) Human needs categorization of affective events using labeled and unlabeled data. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1(Long Papers), pp 1919–1929
Li P, Zhou G, Zhu Q (2016) Minimally supervised chinese event extraction from multiple views. ACM Trans Asian Low-Resour Lang Inf Process (TALLIP) 16(2):1–16
Feng X (2016) Research and application of chinese comparative sentence elements extraction technique. Master’s thesis, Beijing University of Posts and Telecommunications
Chen C, He L, Lin X (2012) Rev: extracting entity relations from world wide web. In: Proceedings of the 6th international conference on ubiquitous information management and communication, pp 1–5
Zhang C, Zhang Y, Xu W, Ma Z, Leng Y, Guo J (2015) Mining activation force defined dependency patterns for relation extraction. Knowl-Based Syst 86:278–287
Ziering P, Plas L, Schuetze H (2013) Bootstrapping semantic lexicons for technical domains. In: Proceedings of the sixth international joint conference on natural language processing, pp 1321–1329
Yada S, Ikeda K, Hoashi K, Kageura K (2017) A bootstrap method for automatic rule acquisition on emotion cause extraction. In: 2017 IEEE international conference on data mining workshops (ICDMW). IEEE, pp 414–421
Schmitz M, Soderland S, Bart R, Etzioni O et al (2012) Open language learning for information extraction. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 523–534
Kozareva Z (2012) Learning verbs on the fly. In: Proceedings of coling 2012: posters, pp 599–610
Dalvi B, Bhakthavatsalam S, Clark C, Clark P, Etzioni O, Fader A, Groeneveld D (2016) Ike-an interactive tool for knowledge extraction. In: Proceedings of the 5th workshop on automated knowledge base construction, pp 12–17
Tai L, Qin S, Guo F (2017) A pattern learning method based on kernel function. In: Proceedings of the 2017 2nd international conference on communication and information systems, pp 324–328
FengYingHui (2016) Research on information extraction techniques for tibetan cultural field. Master’s thesis, Central University for Nationalities
Shi B, Zhang Z, Sun L, Han X (2014) A probabilistic co-bootstrapping method for entity set expansion
Cheng Z, Zheng D, Li S (2013) Multi-pattern fusion based semi-supervised name entity recognition. In: 2013 international conference on machine learning and cybernetics. IEEE, vol 1, pp 45–50
Makarov P (2018) Automated acquisition of patterns for coding political event data: two case studies. In: Proceedings of the second joint SIGHUM workshop on computational linguistics for cultural heritage, social sciences, humanities and literature, pp 103–112
Alashri S, Tsai J-Y., Koppela AR, Davulcu H (2018) Snowball: extracting causal chains from climate change text corpora. In: 2018 1st international conference on data intelligence and security (ICDIS). IEEE, pp 234–241
Batista DS, Martins B, Silva MJ (2015) Semi-supervised bootstrapping of relationship extractors with distributional semantics. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 499–504
Xia X (2014) Research on semi-supervised chinese event extraction. PhD thesis, Suzhou: Soochow University
Liu Y (2014) The information gain based binary entity relationship extraction on web corpus. PhD thesis, East China Normal University
Cheng Z (2014) Research on named entity recognition and relation extraction facing to domain-oriented knowledge base construction. PhD thesis, Harbin: Harbin Institute of Technology
McNeil N, Bridges RA, Iannacone MD, Czejdo B, Perez N, Goodall JR (2013) Pace: pattern accurate computationally efficient bootstrapping for timely discovery of cyber-security concepts. In: 2013 12th international conference on machine learning and applications. IEEE, vol 2, pp 60–65
Thomas A, Sivanesan S (2022) An adaptable, high-performance relation extraction system for complex sentences. Knowl-Based Syst 251:108956
Wu Z (2019) Research and application on content understanding algorithm for conditional semi-structured text. Master’s thesis, South China University of Technology
Long L, Yan J, Fang L, Li P, Liu X (2014) The identification of chinese named entity in the field of medicine based on bootstrapping method. In: 2014 International conference on multisensor fusion and information integration for intelligent systems (MFI), IEEE, pp 1–6
Yan L, Han X, Sun L, He B (2019) Learning to bootstrap for entity set expansion. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 292–301
Tai L-T (2018) Research on entity relation extraction algorithm based on semi-supervised machine learning. PhD thesis, Beijing University of Posts and Telecommunications
Kurihara K, Shimada K (2015) Trouble information extraction based on a bootstrap approach from twitter. In: Proceedings of the 29th pacific asia conference on language, information and computation, pp 471–479
Gupta S, Manning CD (2014) Improved pattern learning for bootstrapped entity extraction. In: Proceedings of the eighteenth conference on computational natural language learning, pp 98–108
Ziering P, Plas L, Schütze H (2013) Multilingual lexicon bootstrapping-improving a lexicon induction system using a parallel corpus. In: Proceedings of the sixth international joint conference on natural language processing, pp 844–848
Yildirim S, Yildiz T (2012) Automatic extraction of turkish hypernym-hyponym pairs from large corpus. In: Proceedings of coling 2012: demonstration papers, pp 493–500
Yan L, Han X, He B, Sun L (2020) End-to-end bootstrapping neural network for entity set expansion. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 9402–9409
Yan L, Han X, He B, Sun L (2020) Global bootstrapping neural network for entity set expansion. In: Findings of the association for computational linguistics: EMNLP 2020, pp 3705–3714
Yan L, Han X, Sun L (2021) Progressive adversarial learning for bootstrapping: a case study on entity set expansion. arXiv preprint arXiv:2109.12082
Ji J (2015) A grammar and dependency information based relation extraction system for streaming data. Master’s thesis, Beijing University of Posts and Telecommunications
Sijia C (2014) Research on entity relationship extraction. Master’s thesis, Beijing University of Posts and Telecommunications
Tang Z, Surdeanu M (2021) Interpretability rules: Jointly bootstrapping a neural relation extractorwith an explanation decoder. In: Proceedings of the first workshop on trustworthy natural language processing, pp 1–7
Lin H, Yan J, Qu M, Ren X (2019) Learning dual retrieval module for semi-supervised relation extraction. In: The world wide web conference, pp 1073–1083
Deepika S, Geetha T (2021) Pattern-based bootstrapping framework for biomedical relation extraction. Eng Appl Artif Intell 99:104130
Zhuang Y, Jiang T, Riloff E (2020) Affective event classification with discourse-enhanced self-training. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 5608–5617
Li Z, He Y, Gu B, Liu A, Li H, Wang H, Zhou X (2017) Diagnosing and minimizing semantic drift in iterative bootstrapping extraction. IEEE Trans Knowl Data Eng 30(5):852–865
Wu W, Li H, Wang H, Zhu KQ (2016) Semantic bootstrapping: a theoretical perspective. IEEE Trans Knowl Data Eng 29(2):446–457
Phi V-T, Matsumoto Y (2016) Integrating word embedding offsets into the espresso system for part-whole relation extraction. In: Proceedings of the 30th Pacific Asia conference on language, information and computation: oral papers, pp 173–181
Bhutani N, Jagadish H, Radev D (2016) Nested propositions in open information extraction. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 55–64
He Y, Grishman R (2015) Ice: rapid information extraction customization for nlp novices. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: demonstrations, pp 31–35
Rondon A, Caseli H, Ramisch C (2015) Never-ending multiword expressions learning. In: Proceedings of the 11th workshop on multiword expressions, pp 45–53
Ye F, Shi H, Wu S (2014) Research on pattern representation method in semi-supervised semantic relation extraction based on bootstrapping. In: 2014 Seventh international symposium on computational intelligence and design. IEEE, vol 1, pp 568–572
Zhang C, Niu Z, Jiang P, Fu H (2012) Domain-specific term extraction from free texts. In: 2012 9th International conference on fuzzy systems and knowledge discovery. IEEE, pp 1290–1293
Qadir A, Riloff E (2012) Ensemble-based semantic lexicon induction for semantic tagging. In: * SEM 2012: the first joint conference on lexical and computational semantics–volume 1: proceedings of the main conference and the shared task, and volume 2: proceedings of the sixth international workshop on semantic evaluation (SemEval 2012), pp 199–208
Momtazi S, Moradiannasab O (2019) A statistical approach to knowledge discovery: bootstrap analysis of language models for knowledge base population from unstructured text. Sci Iran 26(Special Issue on: Socio-Cognitive Engineering):26–39
Zhao H, Feng C, Luo Z, Tian C (2018) Entity set expansion from twitter. In: Proceedings of the 2018 ACM SIGIR international conference on theory of information retrieval, pp 155–162
Wang C, Wang F (2012) A bootstrapping method for extracting sentiment words using degree adverb patterns. In: 2012 International conference on computer science and service system. IEEE, pp 2173–2176
Thelen M, Riloff E (2002) A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002), pp 214–221
Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the web: An experimental study. Artif Intell 165(1):91–134
Rosenfeld B, Feldman R (2006) Ures: an unsupervised web relation extraction system. In: Proceedings of the COLING/ACL 2006 main conference poster sessions, pp 667–674
Pantel P, Pennacchiotti M (2006) Espresso: leveraging generic patterns for automatically harvesting semantic relations. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pp 113–120
Etzioni O, Banko M, Soderland S, Weld DS (2008) Open information extraction from the web. Commun ACM 51(12):68–74
Li W-G, Liu T, Li S (2007) Automated entity relation tuple extraction using web mining. Acta Electon Sin 35(11):2111
Banko M, Etzioni O (2008) The tradeoffs between open and traditional relation extraction. In: Proceedings of ACL-08: HLT, pp 28–36
Zhu J, Nie Z, Liu X, Zhang B, Wen J-R (2009) Statsnowball: a statistical approach to extracting entity relationships. In: Proceedings of the 18th international conference on world wide web, pp 101–110
Komachi M, Kudo T, Shimbo M, Matsumoto Y (2008) Graph-based analysis of semantic drift in espresso-like bootstrapping algorithms. In: Proceedings of the 2008 conference on empirical methods in natural language processing, pp 1011–1020
Curran JR, Murphy T, Scholz B (2007) Minimising semantic drift with mutual exclusion bootstrapping. In: Proceedings of the 10th conference of the pacific association for computational linguistics. Citeseer, vol 6, pp 172–180
Zhang Y, Shen J, Shang J, Han J (2020) Empower entity set expansion via language model probing. arXiv preprint arXiv:2004.13897
Huang J, Xie Y, Meng Y, Shen J, Zhang Y, Han J (2020) Guiding corpus-based set expansion by auxiliary sets generation and co-expansion. In: Proceedings of the web conference 2020, pp 2188–2198
Liang J, Feng S, Xie C, Xiao Y, Chen J, Hwang S-W (2021) Bootstrapping information extraction via conceptualization. In: 2021 IEEE 37th international conference on data engineering (ICDE). IEEE, pp 49–60
Alba A, Coden A, Gentile AL, Gruhl D, Ristoski P, Welch S (2017) Multi-lingual concept extraction with linked data and human-in-the-loop. In: Proceedings of the knowledge capture conference, pp 1–8
Gentile AL, Gruhl D, Ristoski P, Welch S (2019) Explore and exploit. dictionary expansion with human-in-the-loop. In: European semantic web conference. Springer, pp 131–145
Kirsch B, Niyazova Z, Mock M, Rüping S (2020) Noise reduction in distant supervision for relation extraction using probabilistic soft logic. In: Machine learning and knowledge discovery in databases: international workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part II. Springer, pp 63–78
Rahman S, Kandogan E (2022) Characterizing practices, limitations, and opportunities related to text information extraction workflows: a human-in-the-loop perspective. In: CHI conference on human factors in computing systems, pp 1–15
Deng B, Fan X, Yang L (2007) Entity relation extraction method using semantic pattern. Jisuanji Gongcheng/ Comput Eng 33(10):212–214
Pengfei L, Zheng Y, Chunning W, Yueqin Z, Wei L (2022) Research on the geological entities business relation extraction based on the bootstrapping method. Transform Bus Econ 21(2)
Yang C, Xiao D, Luo Y, Li B, Zhao X, Zhang H (2022) A hybrid method based on semi-supervised learning for relation extraction in chinese emrs. BMC Medical Inform Decis Mak 22(1):169
Li Y, Yu X, Liu Y, Chen H, Liu C (2023) Uncertainty-aware bootstrap learning for joint extraction on distantly-supervised data. arXiv preprint arXiv:2305.03827
Novotnỳ V, Luger K, Štefánik M, Vrabcová T, Horák A (2023) People and places of historical europe: bootstrapping annotation pipeline and a new corpus of named entities in late medieval texts. arXiv preprint arXiv:2305.16718
Sheikhpour R, Berahmand K, Forouzandeh S (2023) Hessian-based semi-supervised feature selection using generalized uncorrelated constraint. Knowl-Based Syst 269:110521
Doumari SA, Berahmand K, Ebadi M et al (2023) Early and high-accuracy diagnosis of parkinson’s disease: outcomes of a new model. Comput Math Methods Med
Menhour H, Şahin HB, Sarıkaya RN, Aktaş M, Sağlam R, Ekinci E, Eken S (2023) Searchable turkish ocred historical newspaper collection 1928–1942. J Inf Sci 49(2):335–347
Yurtsever MME, Özcan M, Taruz Z, Eken S, Sayar A (2022) Figure search by text in large scale digital document collections. Concurr Comput: Prac Exp 34(1):6529
Omurca SI, Ekinci E, Sevim S, Edinc EB, Eken S, Sayar A (2023) A document image classification system fusing deep and machine learning models. Appl Intell 53(12):15295–15310
Acknowledgements
This research is supported by the following projects: Central Leading Local Project “Fujian Mental Health Human-Computer Interaction Technology Research Center”, under the authorization number 2020L3024.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Ge Xu, Yunfei Long, Yin Guan, Xiaoyan Yang, and Zhou Chen contributed equally to this work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fang, H., Xu, G., Long, Y. et al. A system review on bootstrapping information extraction. Multimed Tools Appl 83, 38329–38353 (2024). https://doi.org/10.1007/s11042-023-17005-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17005-1