Advertisement

Robust Single-Document Summarizations and a Semantic Measurement of Quality

  • Liqun ShaoEmail author
  • Hao ZhangEmail author
  • Jie WangEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 976)

Abstract

The goal of this paper is to generate an effective summary for a given document with specific realtime requirements. We use the softplus function to enhance keyword rankings to favor important sentences, based on which we present a number of extractive summarization algorithms using various keyword extraction and topic clustering methods. We show that our algorithms not only meet the realtime requirements but also yield the best ROUGE scores on DUC-02 over all previously-known algorithms. We also evaluate our summarization methods over the SummBank dataset and other datasets to ensure that our methods are robust. Experiments show that summaries generated by our methods achieve higher or about the same ROUGE scores than extractive summaries generated by human evaluators. Moreover, we define a semantic measure based on word-embedding using Word Mover’s Distance to evaluate the quality of summaries without human-generated benchmarks. We show that for our algorithms, the orderings of the ROUGE scores and the scores under the new measure are highly comparable, suggesting that this new measure may serve as a viable alternative for measuring the quality of a summary.

Keywords

Single-document summarizations Keyword ranking Topic clustering Word embedding SoftPlus function Semantic similarity Summarization evaluation Realtime 

Notes

Acknowledgements

We thank Ming Jia, Jingwen Wang, Cheng Zhang, Wenjing Yang, and the other members of the Text Automation Lab at UMass Lowell for their support and fruitful discussions. We are grateful to Prof. Hong Yu for making the SummBank dataset available for this study.

References

  1. 1.
    Aslam, J.A., Frost, M.: An information-theoretic measure for document similarity. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2003, pp. 449–450. ACM, New York (2003)Google Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Boutsioukis, G.: Natural language toolkit: texttiling (2016). http://www.nltk.org/_modules/-nltk/tokenize/texttiling.html
  4. 4.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 107–117 (1998)CrossRefGoogle Scholar
  5. 5.
    Cheng, J., Lapata, M.: Neural summarization by extracting sentences and words. CoRR abs/1603.07252 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1603.html#ChengL16a
  6. 6.
    Corney, D., Albakour, D., Martinez, M., Moussa, S.: What do a million news articles look like? In: Proceedings of the First International Workshop on Recent Trends in News Information Retrieval Co-located with 38th European Conference on Information Retrieval (ECIR 2016), Padua, Italy, 20 March 2016, pp. 42–47 (2016). http://ceur-ws.org/Vol-1568/paper8.pdf
  7. 7.
    Dasgupta, A., Kumar, R., Ravi, S.: Summarization through submodularity and dispersion. In: ACL, vol. 1, pp. 1014–1022. The Association for Computer Linguistics (2013). http://dblp.uni-trier.de/db/conf/acl/acl2013-1.html#DasguptaKR13
  8. 8.
    DUC: Document understanding conference 2002 (2002). http://www-nlpir.nist.gov/projects/duc/guidelines/2002.html
  9. 9.
  10. 10.
    Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: AISTATS, vol. 15, p. 275 (2011)Google Scholar
  11. 11.
    Hearst, M.A.: TextTiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)Google Scholar
  12. 12.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM (JACM) 46(5), 604–632 (1999)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 957–966 (2015)Google Scholar
  14. 14.
    Lin, H., Bilmes, J.A.: A class of submodular functions for document summarization. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) ACL, pp. 510–520. The Association for Computer Linguistics (2011). http://dblp.uni-trier.de/db/conf/acl/acl2011.html#LinB11
  15. 15.
    Louis, A., Nenkova, A.: Automatically evaluating content selection in summarization without human models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1, vol. 1, pp. 306–314. EMNLP 2009. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  16. 16.
    Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. In: Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference, pp. 392–396. AAAI Press (2003)Google Scholar
  17. 17.
    Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of EMNLP-04 and the 2004 Conference on Empirical Methods in Natural Language Processing, July 2004Google Scholar
  18. 18.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  19. 19.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  20. 20.
    MIT: TextRank implementation in python (2014). https://github.com/summanlp/textrank
  21. 21.
    MIT: A python implementation of the rapid automatic keyword extraction (2015). https://github.com/aneesha/RAKE
  22. 22.
    Nallapati, R., Zhou, B., dos Santos, C.N., Gülçehre, Ç., Xiang, B.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: CoNLL, pp. 280–290. ACL (2016)Google Scholar
  23. 23.
    Parveen, D., Mesgar, M., Strube, M.: Generating coherent summaries of scientific articles using coherence patterns. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 772–783 (2016)Google Scholar
  24. 24.
    Parveen, D., Ramsl, H.M., Strube, M.: Topical coherence for graph-based extractive summarization. In: Márquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) EMNLP, pp. 1949–1954. The Association for Computational Linguistics (2015)Google Scholar
  25. 25.
    Parveen, D., Strube, M.: Integrating importance, non-redundancy and coherence in graph-based extractive summarization. In: Yang, Q., Wooldridge, M. (eds.) IJCAI, pp. 1298–1304. AAAI Press (2015). http://dblp.uni-trier.de/db/conf/ijcai/ijcai2015.html#Parveen015
  26. 26.
    Radev, D.R., et al.: Mead-a platform for multidocument multilingual text summarization. In: LREC (2004)Google Scholar
  27. 27.
    Radev, D., et al.: Summbank 1.0 LDC2003t16 (2003). https://catalog.ldc.upenn.edu/LDC2003T16
  28. 28.
    Rehurek, R.: Gensim 2.0.0 (2017). https://pypi.python.org/pypi/gensim
  29. 29.
    Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M.W., Kogan, J. (eds.) Text Mining. Applications and Theory, pp. 1–20. Wiley (2010).  https://doi.org/10.1002/9780470689646.ch1
  30. 30.
    Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive sentence summarization. CoRR abs/1509.00685 (2015). http://dblp.uni-trier.de/db/journals/corr/corr1509.html#RushCW15
  31. 31.
    Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Cornell University, Ithaca, NY, USA, Technical report (1987)Google Scholar
  32. 32.
    Shao, L., Wang, J.: DTATG: an automatic title generator based on dependency trees. In: Fred, A.L.N., Dietz, J.L.G., Aveiro, D., Liu, K., Bernardino, J., Filipe, J. (eds.) KDIR, pp. 166–173. SciTePress (2016). http://dblp.uni-trier.de/db/conf/ic3k/kdir2016.html#ShaoW16
  33. 33.
    Shao, L., Zhang, H., Jia, M., Wang, J.: Efficient and effective single-document summarizations and a word-embedding measurement of quality. In: Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - (Volume 1), Funchal, Madeira, Portugal, 1–3 November 2017. pp. 114–122 (2017)Google Scholar
  34. 34.
    Wan, X.: Towards a unified approach to simultaneous single-document and multi-document summarizations. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1137–1145. Association for Computational Linguistics (2010)Google Scholar
  35. 35.
    Wan, X., Xiao, J.: Exploiting neighborhood knowledge for single document summarization and keyphrase extraction. ACM Trans. Inf. Syst. 28(2) (2010). http://dblp.uni-trier.de/db/journals/tois/tois28.html#WanX10
  36. 36.
    Woodsend, K., Lapata, M.: Automatic generation of story highlights. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 565–574. Association for Computational Linguistics (2010)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of MassachusettsLowellUSA

Personalised recommendations