Skip to main content
Log in

Open information extraction as an intermediate semantic structure for Persian text summarization

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Semantic applications typically exploit structures such as dependency parse trees, phrase-chunking, semantic role labeling or open information extraction. In this paper, we introduce a novel application of Open IE as an intermediate layer for text summarization. Text summarization is an important method for providing relevant information in large digital libraries. Open IE is referred to the process of extracting machine-understandable structural propositions from text. We use these propositions as a building block to shorten the sentence and generate a summary of the text. The proposed system offers a new form of summarization that is able to break the structure of the sentence and extract the most significant sub-sentential elements. Other advantages include the ability to identify and eliminate less important sections of the sentence (such as adverbs, adjectives, appositions or dependent clauses), or duplicate pieces of sentences which in turn opens up the space for entering more sentences in the summary to enhance the coverage and coherency of it. The proposed system is localized for Persian language; however, it can be adopted to other languages. Experiments performed on a standard data set “Pasokh” with a standard comparison tool showed promising results for the proposed approach. We used summaries produced by the system in a real-world application in the virtual library of Shahid Beheshti University and received good feedbacks from users.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://swesum.nada.kth.se/index-farsi.html.

  2. http://swesum.nada.kth.se/index-eng.html.

  3. http://textmining.noornet.net/FA/Summarization.

  4. http://www.matnak.com.

  5. http://ijaz.um.ac.ir/.

  6. https://github.com/sobhe/baaz.

  7. http://www.mehrnews.com.

  8. An n-gram is n consecutive words from a given text.

References

  1. Vo, D., Bagheri, E.: Open information extraction. Encycl. Semant. Comput. Robot. Intell. (2017). https://doi.org/10.1142/S2425038416300032

    Article  Google Scholar 

  2. Angeli, G., Johnson Premkumar, M.J., Manning, C.D.: Leveraging linguistic structure for open domain information extraction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. 7th International Joint Conference Natural Language Processing, vol. 1 Long Papers, no. 1, pp. 344–354 (2015)

  3. Khot, T., Sabharwal, A., Clark, P.: Answering Complex Questions Using Open Information Extraction. arXiv Preprint arXiv:1704.05572 (2017)

  4. Schmitz, M., Bart, R., Soderland, S., Etzioni, O., et al. :Open language learning for information extraction. In: EMNLP-CoNLL ’12 Proceedings of 2012 Joint Conference Empirical Methods Natural Language Processing Computation Natural Language Learning, pp. 523–534 (2012)

  5. Zhila, A., Gelbukh, A.: Open information extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: problems of the method, their causes, and ways for improvement. Rev. Signos 49(90), 119–142 (2016)

    Article  Google Scholar 

  6. Zhila, A., Gelbukh, A.: Comparison of open information extraction for Spanish and English. Int. Dialogue Conf. 12(1), 794–802 (2013)

    Google Scholar 

  7. Zhila, A., Gelbukh, A.: Open information extraction for Spanish language based on syntactic constraints. In: Proceedings of ACL 2014 Student Resources Work, pp. 78–85 (2014)

  8. Gamallo, P., Garcia, M.: Multilingual open information extraction. In: Portuguese Conference on Artificial Intelligence. Springer, Berlin (2015)

    Google Scholar 

  9. Falke, T., Stanovsky, G., Gurevych, I., Dagan, I.: Porting an open information extraction system from English to German. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 892–898 (2016)

  10. Rahat, M., Talebpour, A.: Parsa: an open information extraction system for Persian. Digit. Scholarsh. Humanit. (2018). https://doi.org/10.1093/llc/fqy003

    Article  Google Scholar 

  11. Rahat, M., Talebpour, A., Monemian, S.: A recursive algorithm for open information extraction from Persian texts. Int. J. Comput. Appl. Technol. 57(3), 193–206 (2018)

    Article  Google Scholar 

  12. Stanovsky, G., Dagan, I.: Open IE as an intermediate structure for semantic tasks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 2 Short Papers, pp. 303–308 (2015)

  13. Christensen, J., et al.: Towards coherent multi-document summarization. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Section 3, pp. 1163–1173 (2013)

  14. Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.-Y.: Insights from CL-SciSumm 2016: the faceted scientific document summarization shared task. Int. J. Digit. Libr. 1–9 (2017). https://doi.org/10.1007/s00799-017-0221-y

    Article  Google Scholar 

  15. Conroy, J.M., Davis, S.T.: Section mixture models for scientific document summarization. Int. J. Digit. Libr. 86, 1–18 (2017)

    Google Scholar 

  16. Al Saied, H., Dugué, N., Lamirel, J.-C.: Automatic summarization of scientific publications using a feature selection approach. Int. J. Digit. Libr. (2017). https://doi.org/10.1007/s00799-017-0214-x

    Article  Google Scholar 

  17. Cohan, A., Goharian, N.: Scientific document summarization via citation contextualization and scientific discourse. Int. J. Digit. Libr. (2017). https://doi.org/10.1007/s00799-017-0216-8

    Article  Google Scholar 

  18. Richardson, W.R., Srinivasan, V., Fox, E.A.: Knowledge discovery in digital libraries of electronic theses and dissertations: an NDLTD case study. Int. J. Digit. Libr. 9(2), 163–171 (2008)

    Article  Google Scholar 

  19. Modaresi, P., Gross, P., Sefidrodi, S., Eckhof, M., Conrad, S.: On (commercial) benefits of automatic text summarization systems in the news domain: a case of media monitoring and media response analysis. arXiv Preprint arXiv:1701.00728 (2017)

  20. Ferreira, R., et al.: A context based text summarization system. In: 2014 11th IAPR International Workshop on Document Analysis Systems (DAS), pp. 66–70 (2014)

  21. Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive sentence summarization. arXiv Preprint arXiv:1509.00685 (2015)

  22. Banerjee, S., Mitra, P., Sugiyama, K.: Multi-document abstractive summarization using ILP based multi-sentence compression. In: IJCAI, pp. 1208–1214 (2015)

  23. Lloret, E., Romá-Ferri, M.T., Palomar, M.: COMPENDIUM: a text summarization system for generating abstracts of research papers. Data Knowl. Eng. 88, 164–175 (2013)

    Article  Google Scholar 

  24. Gambhir, M., Gupta, V.: Recent automatic text summarization techniques: a survey. Artif. Intell. Rev. 47(1), 1–66 (2017)

    Article  Google Scholar 

  25. Fang, C., Mu, D., Deng, Z., Wu, Z.: Word-sentence co-ranking for automatic extractive text summarization. Expert Syst. Appl. 72, 189–195 (2017)

    Article  Google Scholar 

  26. Ansamma, J., Premjith, P.S., Wilscy, M.: Extractive multi-document summarization using population-based multicriteria optimization. Expert Syst. Appl. 86, 385–397 (2017)

    Article  Google Scholar 

  27. Ouyang, Y., Li, W., Zhang, R., Li, S., Lu, Q.: A progressive sentence selection strategy for document summarization. Inf. Process. Manag. 49(1), 213–221 (2013)

    Article  Google Scholar 

  28. Allahyari, M., et al.: Text summarization techniques: a brief survey. arXiv Preprint arXiv:1707.02268 (2017)

  29. Hassel, M.N.M.: FarsiSum: a Persian text summarizer. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. Association for Computational Linguistics (2004)

  30. Zamanifar, A., Kashefi, O.: AZOM: a Persian structured text summarizer. In: International Conference on Application of Natural Language to Information Systems, pp. 234–237 (2011)

    Chapter  Google Scholar 

  31. Poormasoomi, A., Kahani, M., Toosi, S., Estiri, A., QAEIM, H.: IJAZ: an operational system for single-document summarization of Persian news texts. In: SIGNAL Data Processing, vol. 11, no. 1, pp. 33–48 (2014)

  32. Del Corro, L., Gemulla, R.: ClausIE: clause-based open information extraction. In: Proceedings of the 22nd International Conference on World Wide Web, no. i, pp. 355–366 (2013)

  33. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545 (2011)

  34. Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)

    Google Scholar 

  35. Nguyen, D.Q.: jLDADMM: a Java package for the LDA and DMM topic models. [Online]. http://jldadmm.sourceforge.net/ (2015)

  36. Li, L., et al.: Computational linguistics literature and citations oriented citation linkage, classification and summarization. Int. J. Digit. Libr. (2017). https://doi.org/10.1007/s00799-017-0219-5

    Article  Google Scholar 

  37. Li, L., et al.: CIST system for CL-SciSumm 2016 shared task. In: Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL), pp. 156–167 (2016)

  38. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space, pp. 1–12 (2013)

  39. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

  40. Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings NAACL-HLT, no. June, pp. 746–751 (2013)

  41. Mojgan, S., Jahani, C., Megyesi, B., Nivre, J.: A Persian treebank with stanford typed dependencies. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, pp. 796–801 (2014)

  42. Moghaddas, B.B., Kahani, M., Toosi, S. A., Pourmasoumi, A., Estiri, A.: Pasokh: A standard corpus for the evaluation of Persian text summarizers. In: Proceedings of the 2013 3th International eConference on Computer and Knowledge Engineering, ICCKE, pp. 471–475 (2013)

  43. Lin, C., Hovy, E., Rey, M.: Automatic evaluation of summaries using N-gram co-occurrence statistics. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, no. June, pp. 71–78 (2003)

  44. Steinberger, J., Ježek, K.: Evaluation measures for text summarization. Comput. Inform. 28, 1001–1025 (2009)

    MATH  Google Scholar 

  45. Ledeneva, Y.N. : Automatic language-independent detection of multiword descriptions for text summarization. Instituto Politécnico Naciona (2008)

  46. Lin, C., Rey, M.: ROUGE?: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Work (2004)

  47. Mirella, L., Barzilay, R.: Automatic evaluation of text coherence: models and representations. IJCAI 5, 1085–1090 (2005)

    Google Scholar 

  48. Lloret, E., Palomar, M.: Tackling redundancy in text summarization through different levels of language analysis. Comput. Stand. Interfaces 35(5), 507–518 (2013)

    Article  Google Scholar 

  49. Siddharthan, A., Nenkova, A., McKeown, K.: Syntactic simplification for improving content selection in multi-document summarization. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 896 (2004)

  50. Thadani, K., McKeown, K.: A framework for identifying textual redundancy. In: Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pp. 873–880 (2008)

  51. Carrillo-Mendoza, P., Calvo, H., Gelbukh, A.: Intra-document and inter-document redundancy in multi-document summarization. In: Mexican International Conference on Artificial Intelligence, pp. 105–115 (2016)

    Chapter  Google Scholar 

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their constructive comments, Asef Pourmasoumi for providing us with the data and benchmarking tool of Pasokh corpus, Azadeh Zamanifar for sharing the code of their summarizer and Seyedamin Monemian for his help on running the experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahmoud Rahat.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rahat, M., Talebpour, A. Open information extraction as an intermediate semantic structure for Persian text summarization. Int J Digit Libr 19, 339–352 (2018). https://doi.org/10.1007/s00799-018-0244-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-018-0244-z

Keywords

Navigation