Skip to main content
Log in

Extracting Schema Knowledge from Text Documents: Part I. Problem Formulation and Method Overview

  • Published:
Scientific and Technical Information Processing Aims and scope

Abstract—

This article discusses the importance of automatic schema knowledge extraction for natural language processing. A broad overview of methods of and approaches to describing and extracting schemas is given; theoretical approaches to formalizing schemas are considered. Some problems that involve knowledge on the schema structure of texts are listed. Popular approaches to automatic schema knowledge extraction and quality control methods for these approaches are presented. A list of open corpora that can be used for schema extraction, some of which are created specifically for this purpose and some are universal, is provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. http://cs.rochester.edu/nlp/rocstories/.

  2. https://www.usna.edu/Users/cs/nchamber/.

  3. http://www.coli.uni-saarland.de/projects/smile/.

  4. https://catalog.ldc.upenn.edu/LDC2008T19.

  5. https://catalog.ldc.upenn.edu/LDC2012T21.

  6. https://www.usna.edu/Users/cs/nchamber/data/chains/.

  7. http://www.timeml.org/timebank/timebank.html.

REFERENCES

  1. Mann, W. C. and Thompson, S. A., Rhetorical structure theory: Toward a functional theory of text organization, Text Interdiscip. J. Study Discourse, 8, no. 3, pp. 243–281.  https://doi.org/10.1515/text.1.1988.8.3.243

  2. Chambers, N. and Jurafsky, D., A database of narrative schemas, Proc. of the Seventh Int. Conf. on Language Resources and Evaluation (LREC’10), Valletta, Malta, 2010. http://www.lrec-conf.org/proceedings/lrec2010/ pdf/58_Paper.pdf.

  3. Propp, V., Morphology of the Folktale, Austin: Univ. Texas Press, 2010.

    Google Scholar 

  4. Mitrofanova, O.A., Analysis of fiction text structure by means of topic modelling: Case study of “Master and Margarita” novel by M. A. Bulgakov, Korpusnaya Lingvistika – 2019. Trudy Mezhdunarodnoi Konferentsii (Corpus Linguistics 2019: Theses of Int. Conf.), St. Petersburg, 2019, St. Petersburg: St. Petersburg Gos. Univ., 2019, pp. 387–394.

  5. Martem’yanov, Yu., Logika situatsii. Sroenie teksta. Terminologichnost’ slov (Logic of situations. Text structure. Termhood of words), Moscow: Yazyki Slavyanskikh Kul’tur, 2004.

  6. Baranov, A.N., Vvedenie v prikladnuyu lingvistiku (Introduction to Applied Linguistics), Moscow: Editorial URSS, 2001.

  7. Bodrova, A.A. and Bocharov, V.V., Relationship extraction from literary fiction, Dialogue: Int. Conf. on Computational Linguistics, 2014.

  8. Iyyer, M., Guha, A., Chaturvedi, S., and Boyd-Graber, J., Feuding families and former friends: Unsupervised learning for dynamic fictional relationships, Proc. of the 2016 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, Calif., 2016, pp. 1534–1544. https://doi.org/10.18653/v1/N16-1180

  9. Shenk, R., Birnbaum, L., and May, J., Integrating semantics and pragmatics, Novoe Zarubezhnoi Lingvist., 1989, no. 24, pp. 32–47.

  10. Minsky, M.L., Frame-system theory, Thinking, Johnson-Laird, P.N. and Wason, P.C., Eds., Readings in Cognitive Science, Cambridge, Mass.: Cambridge Univ. Press, 1977.

  11. Charniak, E., On the use of framed knowledge in language comprehension, Artif. Intell., 1978, vol. 11, no. 3, pp. 225–265. https://doi.org/10.1016/0004-3702(78)90002-4

    Article  MathSciNet  Google Scholar 

  12. Schank, R.C. and Abelson, R.P., Scripts, Plans, Goals and Understanding, New York: Wiley, 1977.

    MATH  Google Scholar 

  13. Fillmore, C.J., Frame semantics and the nature of language, Ann. New York Acad. Sci., 1976, vol. 280, no. 1, pp. 20–32. https://doi.org/10.1111/j.1749-6632.1976.tb25467.x

    Article  Google Scholar 

  14. Schank, R.C. and Abelson, R.P., Scripts, plans, and knowledge, Proc. of the 4th Int. Joint Conf. on Artificial Intelligence, Tbilisi, 1975, pp. 151–157.

  15. Darbanov, B., Theory of scheme, frame, script, scenario as a model of text understanding, Aktual’nye Probl. Gumanitarnykh Estestv. Nauk, 2017, no. 6-2, pp. 75–78.

  16. Tkhostov, A. and Nelyubina, A.S., Illness perceptions in patients with coronary heart disease and their doctors, Procedia Soc. Behav. Sci., 2013, vol. 86, pp. 574–577.  https://doi.org/10.1016/j.sbspro.2013.08.616

    Article  Google Scholar 

  17. Chambers, N. and Jurafsky, D., Unsupervised learning of narrative schemas and their participants, Proc. of the Joint Conf. of the 47th Annual Meeting of the ACL and the 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Stroudsburg, Pa., 2009, vol. 2, pp. 602–610.

  18. Chambers, N. and Jurafsky, D., Unsupervised learning of narrative event chains, Proc. of ACL-08: HLT, Columbus, Ohio, 2008, pp. 789–797.

  19. Kozerenko, E.B., Kuznetsov, K.I., and Romanov, D.A., Semantic processing of unstructured textual data based on the linguistic processor PullEnti, Inf. Primenenie, 2018, vol. 12, no. 3, pp. 91–98. https://doi.org/10.14357/19922264180313

    Article  Google Scholar 

  20. Shelmanov, A.O., Isakov, V.A., Stankevich, M.A., and Smirnov, I.V., Open information extraction. Part I. The task and the review of the state of the art, Iskusstv. Intellekt Prinyatie Reshenii, 2018, no. 2, pp. 47–61. https://doi.org/10.14357/20718594180204

  21. Chambers, N. and Jurafsky, D., Template-based information extraction without the templates, Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Ore., 2011, vol. 1, pp. 976–986.

  22. Azzam, S., Humphreys, K., and Gaizauskas, R., Using coreference chains for text summarization, Proc. of the Workshop on Coreference and Its Applications, College Park, Md., 1999, pp. 77–84.

  23. Filatova, E. and Hatzivassiloglou, V., Event-based extractive summarization, Text Summarization Branches Out, Barcelona: Association for Computational Linguistics, 2004, pp. 104–111. https://aclanthology.org/W04-1017.

    Google Scholar 

  24. DeJong, G., An overview of the FRUMP system, in Strategies for Natural Language Processing, Lehner, W.G. and Ringle, M.H., New York: Psychology Press, 1982, pp. 149–176.

  25. Xu J. Gan, Z., Cheng, Y., and Liu, J., Discourse-aware neural extractive model for text summarization. arXiv:1910.14142 [cs.CL]

  26. Bean, D. and Riloff, E., Unsupervised learning of contextual role knowledge for coreference resolution, Proc. of the Human Language Technology Conf. of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, Boston, 2004, pp. 297–304. https://aclanthology.org/N04-1038.

  27. Irwin, J., Komachi, M., and Matsumoto, Y., Narrative schema as world knowledge for coreference resolution, Proc. of the Fifteenth Conf. on Computational Natural Language Learning: Share Task, Portland, Ore., 2011, pp. 86–92. https://aclanthology.org/W11-1913.

  28. Simonson, D. and Davis, A., NASTEA: Investigating narrative schemas through annotated entities, Proc. of the 2nd Workshop on Computing News Storylines (CNS 2016), Austin, Texas, 2016, pp. 57–66. https://doi.org/10.18653/v1/W16-5707

  29. Doust, R. and Piwek, P., A model of suspense for narrative generation, Proc. of the 10th Int. Conf. on Natural Language Generation, Santiago de Compostela, Spain, 2017, pp. 178–187. https://doi.org/10.18653/v1/W17-3527

  30. Balasubramanian, N., Soderland, S., Mausam, and Etzioni, O., Generating coherent event schemas at scale, Proc. of the 2013 Conf. on Empirical Methods in Natural Language Processing, Seattle, 2013, pp. 1721–1731. https://aclanthology.org/D13-1178.

  31. Pichotta, K. and Mooney, R., Learning statistical scripts with LSTM recurrent neural networks, Proc. AAAI Conf. Artif. Intell., 2016, vol. 30, no. 1. https://ojs.aaai.org/index.php/AAAI/article/view/10347.

  32. Shibata, T., Kohama, S., and Kurohashi, S., A large scale database of strongly-related events in Japanese, Proc. of the Ninth Int. Conf. on Language Resources and Evaluation (LREC’14), Reykjavik, 2014, 3283–3288. http://www.lrec-conf.org/proceedings/lrec2014/pdf/ 1107_Paper.pdf.

  33. Borgelt, C. and Kruse, R., Induction of association rules: Apriori implementation, Compstat, Härdle, W. and Rönz, B., Eds., Heidelberg: Physica, 2002, pp. 395–400.  https://doi.org/10.1007/978-3-642-57489-4_59

    Book  MATH  Google Scholar 

  34. Regneri, M., Koller, A., and Pinkal, M., Learning script knowledge with web experiments, Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010, pp. 979–988. https://aclanthology.org/P10-1100.

  35. Taylor, W.L., “Cloze procedure”: A new tool for measuring readability, Journalism Q., 1953, vol. 30, no. 4, pp. 415–433. https://doi.org/10.1177/107769905303000401

    Article  Google Scholar 

  36. Mostafazadeh N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., and Allen, J., A corpus and close evaluation for deeper understanding of commonsense stories, Proc. of the 2016 North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, Calif., 2016, pp. 839–849. https://doi.org/10.18653/v1/N16-1098

  37. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J., Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems (NIPS 2013), Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K.Q., Eds., 2013, vol. 26, pp. 3111–3119. https://proceedings.neurips.cc/paper/2013/file/9aa42-b31882ec039965f3c4923ce901b-Paper.pdf.

  38. Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S., Skip-thought vectors, Advances in Neural Information Processing Systems (NIPS 2015), Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., Eds., 2015, vol. 28, pp. 3294–3302. https://proceedings.neurips.cc/paper/2015/file/f442d33fa06832082290ad8544a8da27-Paper.pdf.

  39. Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., and Heck, L., Learning deep structured semantic models for web search using clickthrough data, Proc. of the 22nd ACM Int. Conf. on Information & Knowledge Management, San Francisco, 2013, pp. 2333–2338. https://doi.org/10.1145/2505515.2505665

  40. Devlin J., Chang, M.-W., Lee, K., and Toutanova, K., BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs.CL]

  41. Settles, B., Active learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, 2012, vol. 6, no. 1, pp. 1–114.

    Article  MathSciNet  Google Scholar 

  42. Suvorov, R., Shelmanov, A., and Smirnov, I., Active learning with adaptive density weighted sampling for information extraction from scientific papers, Artificial Intelligence and Natural Language. AINL 2017, Filchenkov, A., Pivovarova, L., and Žižka, J., Eds., Communications in Computer and Information Science, vol. 789, Cham: Springer, 2018, pp. 77–90. https://doi.org/10.1007/978-3-319-71746-3_7

  43. Snell, J., Swersky, K., and Zemel, R., Prototypical networks for few-shot learning, Advances in Neural Information Processing Systems (NIPS 2017), Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., Eds., 2017, vol. 30, pp. 4077–4087. https://proceedings.neurips.cc/paper/2017/file/cb8da6767461f2812ae4290eac7cbc42-Paper.pdf.

  44. Sandhaus, E., The New York Times annotated corpus, Philadelphia: Linguistic Data Consortium, 2008.

  45. Pustejovsky, J., Hanks, P., Sauri, R., See, A., Gaizauskas, R., Setzer, A., Radev, D., Sundheim, B., Day, D., Ferro, L., and Lazo, M., The timebank corpus, Proc. of Corpus Linguistics, 2003, pp. 647–656.

    Google Scholar 

Download references

Funding

The study was supported by the Russian Foundation for Basic Research, project no. 17-29-07033.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. I. Suvorova.

Additional information

Translated by A. Ovchinnikova

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suvorova, M.I., Kobozeva, M.V., Sokolova, E.G. et al. Extracting Schema Knowledge from Text Documents: Part I. Problem Formulation and Method Overview. Sci. Tech. Inf. Proc. 48, 517–523 (2021). https://doi.org/10.3103/S0147688221060125

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0147688221060125

Keywords:

Navigation