Extracting Schema Knowledge from Text Documents: Part I. Problem Formulation and Method Overview

Suvorova, M. I.; Kobozeva, M. V.; Sokolova, E. G.; Toldova, S. Y.

doi:10.3103/S0147688221060125

Extracting Schema Knowledge from Text Documents: Part I. Problem Formulation and Method Overview

Published: 17 March 2022

Volume 48, pages 517–523, (2021)
Cite this article

Scientific and Technical Information Processing Aims and scope

M. I. Suvorova¹,
M. V. Kobozeva¹,
E. G. Sokolova² &
…
S. Y. Toldova²

95 Accesses
2 Citations
Explore all metrics

Abstract—

This article discusses the importance of automatic schema knowledge extraction for natural language processing. A broad overview of methods of and approaches to describing and extracting schemas is given; theoretical approaches to formalizing schemas are considered. Some problems that involve knowledge on the schema structure of texts are listed. Popular approaches to automatic schema knowledge extraction and quality control methods for these approaches are presented. A list of open corpora that can be used for schema extraction, some of which are created specifically for this purpose and some are universal, is provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Representation, Analysis, and Extraction of Knowledge from Unstructured Natural Language Texts

Article 27 May 2021

A survey of methods for the extraction of information from Web resources

Article 16 September 2016

Deriving of Thematic Facts from Unstructured Texts and Background Knowledge

Notes

REFERENCES

Mann, W. C. and Thompson, S. A., Rhetorical structure theory: Toward a functional theory of text organization, Text Interdiscip. J. Study Discourse, 8, no. 3, pp. 243–281. https://doi.org/10.1515/text.1.1988.8.3.243
Chambers, N. and Jurafsky, D., A database of narrative schemas, Proc. of the Seventh Int. Conf. on Language Resources and Evaluation (LREC’10), Valletta, Malta, 2010. http://www.lrec-conf.org/proceedings/lrec2010/ pdf/58_Paper.pdf.
Propp, V., Morphology of the Folktale, Austin: Univ. Texas Press, 2010.
Google Scholar
Mitrofanova, O.A., Analysis of fiction text structure by means of topic modelling: Case study of “Master and Margarita” novel by M. A. Bulgakov, Korpusnaya Lingvistika – 2019. Trudy Mezhdunarodnoi Konferentsii (Corpus Linguistics 2019: Theses of Int. Conf.), St. Petersburg, 2019, St. Petersburg: St. Petersburg Gos. Univ., 2019, pp. 387–394.
Martem’yanov, Yu., Logika situatsii. Sroenie teksta. Terminologichnost’ slov (Logic of situations. Text structure. Termhood of words), Moscow: Yazyki Slavyanskikh Kul’tur, 2004.
Baranov, A.N., Vvedenie v prikladnuyu lingvistiku (Introduction to Applied Linguistics), Moscow: Editorial URSS, 2001.
Bodrova, A.A. and Bocharov, V.V., Relationship extraction from literary fiction, Dialogue: Int. Conf. on Computational Linguistics, 2014.
Iyyer, M., Guha, A., Chaturvedi, S., and Boyd-Graber, J., Feuding families and former friends: Unsupervised learning for dynamic fictional relationships, Proc. of the 2016 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, Calif., 2016, pp. 1534–1544. https://doi.org/10.18653/v1/N16-1180
Shenk, R., Birnbaum, L., and May, J., Integrating semantics and pragmatics, Novoe Zarubezhnoi Lingvist., 1989, no. 24, pp. 32–47.
Minsky, M.L., Frame-system theory, Thinking, Johnson-Laird, P.N. and Wason, P.C., Eds., Readings in Cognitive Science, Cambridge, Mass.: Cambridge Univ. Press, 1977.
Charniak, E., On the use of framed knowledge in language comprehension, Artif. Intell., 1978, vol. 11, no. 3, pp. 225–265. https://doi.org/10.1016/0004-3702(78)90002-4
Article MathSciNet Google Scholar
Schank, R.C. and Abelson, R.P., Scripts, Plans, Goals and Understanding, New York: Wiley, 1977.
MATH Google Scholar
Fillmore, C.J., Frame semantics and the nature of language, Ann. New York Acad. Sci., 1976, vol. 280, no. 1, pp. 20–32. https://doi.org/10.1111/j.1749-6632.1976.tb25467.x
Article Google Scholar
Schank, R.C. and Abelson, R.P., Scripts, plans, and knowledge, Proc. of the 4th Int. Joint Conf. on Artificial Intelligence, Tbilisi, 1975, pp. 151–157.
Darbanov, B., Theory of scheme, frame, script, scenario as a model of text understanding, Aktual’nye Probl. Gumanitarnykh Estestv. Nauk, 2017, no. 6-2, pp. 75–78.
Tkhostov, A. and Nelyubina, A.S., Illness perceptions in patients with coronary heart disease and their doctors, Procedia Soc. Behav. Sci., 2013, vol. 86, pp. 574–577. https://doi.org/10.1016/j.sbspro.2013.08.616
Article Google Scholar
Chambers, N. and Jurafsky, D., Unsupervised learning of narrative schemas and their participants, Proc. of the Joint Conf. of the 47th Annual Meeting of the ACL and the 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Stroudsburg, Pa., 2009, vol. 2, pp. 602–610.
Chambers, N. and Jurafsky, D., Unsupervised learning of narrative event chains, Proc. of ACL-08: HLT, Columbus, Ohio, 2008, pp. 789–797.
Kozerenko, E.B., Kuznetsov, K.I., and Romanov, D.A., Semantic processing of unstructured textual data based on the linguistic processor PullEnti, Inf. Primenenie, 2018, vol. 12, no. 3, pp. 91–98. https://doi.org/10.14357/19922264180313
Article Google Scholar
Shelmanov, A.O., Isakov, V.A., Stankevich, M.A., and Smirnov, I.V., Open information extraction. Part I. The task and the review of the state of the art, Iskusstv. Intellekt Prinyatie Reshenii, 2018, no. 2, pp. 47–61. https://doi.org/10.14357/20718594180204
Chambers, N. and Jurafsky, D., Template-based information extraction without the templates, Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Ore., 2011, vol. 1, pp. 976–986.
Azzam, S., Humphreys, K., and Gaizauskas, R., Using coreference chains for text summarization, Proc. of the Workshop on Coreference and Its Applications, College Park, Md., 1999, pp. 77–84.
Filatova, E. and Hatzivassiloglou, V., Event-based extractive summarization, Text Summarization Branches Out, Barcelona: Association for Computational Linguistics, 2004, pp. 104–111. https://aclanthology.org/W04-1017.
Google Scholar
DeJong, G., An overview of the FRUMP system, in Strategies for Natural Language Processing, Lehner, W.G. and Ringle, M.H., New York: Psychology Press, 1982, pp. 149–176.
Xu J. Gan, Z., Cheng, Y., and Liu, J., Discourse-aware neural extractive model for text summarization. arXiv:1910.14142 [cs.CL]
Bean, D. and Riloff, E., Unsupervised learning of contextual role knowledge for coreference resolution, Proc. of the Human Language Technology Conf. of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, Boston, 2004, pp. 297–304. https://aclanthology.org/N04-1038.
Irwin, J., Komachi, M., and Matsumoto, Y., Narrative schema as world knowledge for coreference resolution, Proc. of the Fifteenth Conf. on Computational Natural Language Learning: Share Task, Portland, Ore., 2011, pp. 86–92. https://aclanthology.org/W11-1913.
Simonson, D. and Davis, A., NASTEA: Investigating narrative schemas through annotated entities, Proc. of the 2nd Workshop on Computing News Storylines (CNS 2016), Austin, Texas, 2016, pp. 57–66. https://doi.org/10.18653/v1/W16-5707
Doust, R. and Piwek, P., A model of suspense for narrative generation, Proc. of the 10th Int. Conf. on Natural Language Generation, Santiago de Compostela, Spain, 2017, pp. 178–187. https://doi.org/10.18653/v1/W17-3527
Balasubramanian, N., Soderland, S., Mausam, and Etzioni, O., Generating coherent event schemas at scale, Proc. of the 2013 Conf. on Empirical Methods in Natural Language Processing, Seattle, 2013, pp. 1721–1731. https://aclanthology.org/D13-1178.
Pichotta, K. and Mooney, R., Learning statistical scripts with LSTM recurrent neural networks, Proc. AAAI Conf. Artif. Intell., 2016, vol. 30, no. 1. https://ojs.aaai.org/index.php/AAAI/article/view/10347.
Shibata, T., Kohama, S., and Kurohashi, S., A large scale database of strongly-related events in Japanese, Proc. of the Ninth Int. Conf. on Language Resources and Evaluation (LREC’14), Reykjavik, 2014, 3283–3288. http://www.lrec-conf.org/proceedings/lrec2014/pdf/ 1107_Paper.pdf.
Borgelt, C. and Kruse, R., Induction of association rules: Apriori implementation, Compstat, Härdle, W. and Rönz, B., Eds., Heidelberg: Physica, 2002, pp. 395–400. https://doi.org/10.1007/978-3-642-57489-4_59
Book MATH Google Scholar
Regneri, M., Koller, A., and Pinkal, M., Learning script knowledge with web experiments, Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010, pp. 979–988. https://aclanthology.org/P10-1100.
Taylor, W.L., “Cloze procedure”: A new tool for measuring readability, Journalism Q., 1953, vol. 30, no. 4, pp. 415–433. https://doi.org/10.1177/107769905303000401
Article Google Scholar
Mostafazadeh N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., and Allen, J., A corpus and close evaluation for deeper understanding of commonsense stories, Proc. of the 2016 North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, Calif., 2016, pp. 839–849. https://doi.org/10.18653/v1/N16-1098
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J., Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems (NIPS 2013), Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K.Q., Eds., 2013, vol. 26, pp. 3111–3119. https://proceedings.neurips.cc/paper/2013/file/9aa42-b31882ec039965f3c4923ce901b-Paper.pdf.
Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S., Skip-thought vectors, Advances in Neural Information Processing Systems (NIPS 2015), Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., Eds., 2015, vol. 28, pp. 3294–3302. https://proceedings.neurips.cc/paper/2015/file/f442d33fa06832082290ad8544a8da27-Paper.pdf.
Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., and Heck, L., Learning deep structured semantic models for web search using clickthrough data, Proc. of the 22nd ACM Int. Conf. on Information & Knowledge Management, San Francisco, 2013, pp. 2333–2338. https://doi.org/10.1145/2505515.2505665
Devlin J., Chang, M.-W., Lee, K., and Toutanova, K., BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs.CL]
Settles, B., Active learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, 2012, vol. 6, no. 1, pp. 1–114.
Article MathSciNet Google Scholar
Suvorov, R., Shelmanov, A., and Smirnov, I., Active learning with adaptive density weighted sampling for information extraction from scientific papers, Artificial Intelligence and Natural Language. AINL 2017, Filchenkov, A., Pivovarova, L., and Žižka, J., Eds., Communications in Computer and Information Science, vol. 789, Cham: Springer, 2018, pp. 77–90. https://doi.org/10.1007/978-3-319-71746-3_7
Snell, J., Swersky, K., and Zemel, R., Prototypical networks for few-shot learning, Advances in Neural Information Processing Systems (NIPS 2017), Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., Eds., 2017, vol. 30, pp. 4077–4087. https://proceedings.neurips.cc/paper/2017/file/cb8da6767461f2812ae4290eac7cbc42-Paper.pdf.
Sandhaus, E., The New York Times annotated corpus, Philadelphia: Linguistic Data Consortium, 2008.
Pustejovsky, J., Hanks, P., Sauri, R., See, A., Gaizauskas, R., Setzer, A., Radev, D., Sundheim, B., Day, D., Ferro, L., and Lazo, M., The timebank corpus, Proc. of Corpus Linguistics, 2003, pp. 647–656.
Google Scholar

Download references

Funding

The study was supported by the Russian Foundation for Basic Research, project no. 17-29-07033.

Author information

Authors and Affiliations

Federal Research Center Computer Science and Control, Russian Academy of Sciences, 119333, Moscow, Russia
M. I. Suvorova & M. V. Kobozeva
HSE University, 101000, Moscow, Russia
E. G. Sokolova & S. Y. Toldova

Authors

M. I. Suvorova
View author publications
You can also search for this author in PubMed Google Scholar
M. V. Kobozeva
View author publications
You can also search for this author in PubMed Google Scholar
E. G. Sokolova
View author publications
You can also search for this author in PubMed Google Scholar
S. Y. Toldova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. I. Suvorova.

Additional information

Translated by A. Ovchinnikova

About this article

Cite this article

Suvorova, M.I., Kobozeva, M.V., Sokolova, E.G. et al. Extracting Schema Knowledge from Text Documents: Part I. Problem Formulation and Method Overview. Sci. Tech. Inf. Proc. 48, 517–523 (2021). https://doi.org/10.3103/S0147688221060125

Download citation

Received: 30 January 2020
Published: 17 March 2022
Issue Date: December 2021
DOI: https://doi.org/10.3103/S0147688221060125

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting Schema Knowledge from Text Documents: Part I. Problem Formulation and Method Overview

Abstract—

Access this article

Similar content being viewed by others

Representation, Analysis, and Extraction of Knowledge from Unstructured Natural Language Texts

A survey of methods for the extraction of information from Web resources

Deriving of Thematic Facts from Unstructured Texts and Background Knowledge

Notes

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Keywords:

Navigation

Extracting Schema Knowledge from Text Documents: Part I. Problem Formulation and Method Overview

Abstract—

Access this article

Similar content being viewed by others

Representation, Analysis, and Extraction of Knowledge from Unstructured Natural Language Texts

A survey of methods for the extraction of information from Web resources

Deriving of Thematic Facts from Unstructured Texts and Background Knowledge

Notes

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Share this article

Keywords:

Search

Navigation