Journal of Computer Science and Technology

, Volume 25, Issue 4, pp 739–749 | Cite as

A New Approach for Multi-Document Update Summarization

Regular Paper

Abstract

Fast changing knowledge on the Internet can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper describes a novel approach for multi-document update summarization. The best summary is defined to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the DUC/TAC 2007 to 2009 datasets (http://duc.nist.gov/, http://www.nist.gov/tac/) have proved that our method closely correlates with the human summaries and outperforms other programs such as LexRank in many categories under the ROUGE evaluation criterion.

Keywords

data mining text mining Kolmogorov complexity information distance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Luhn H P. The automatic creation of literature abstracts. IBM Journal of Research and Development, 1958, 2(2): 159-165.CrossRefMathSciNetGoogle Scholar
  2. [2]
    Wan X, Yang J, Xiao J. Manifold-ranking based topic-focused multi-document summarization. In Proc IJCAI, Hyderabad, India, Jan. 6-12, 2007, pp.2903-2908.Google Scholar
  3. [3]
    Li M, Vitányi P M. An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, 1997.Google Scholar
  4. [4]
    Carbonell J, Goldstein J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proc. SIGIR, Melbourne, Australia, Aug. 24-28, 1998, pp.335-336.Google Scholar
  5. [5]
    Radev D R, Jing H, Stys M, Tam D. Centroid-based summarization of multiple documents. Information Processing and Management, 2004, 40(6): 919-938.CrossRefMATHGoogle Scholar
  6. [6]
    Kupiec J, Pedersen J, Chen F. A trainable document summarizer. In Proc. SIGIR, Seattle, USA, Jul. 9-13, 1995, pp.68-73.Google Scholar
  7. [7]
    Leskovec J, Milic-Frayling N, Grobelnik M. Impact of linguistic analysis on the semantic graph coverage and learning of document extracts. In Proc. AAAI, Pittsburgh, USA, Jul. 9-13, 2005, pp.1069-1074.Google Scholar
  8. [8]
    Shen D, Sun J T, Li H, Yang Q, Chen Z. Document summarization using conditional random fields. In Proc. IJCAI, Hyderabad, India, Jan. 6-12, 2007, pp.2862-2867.Google Scholar
  9. [9]
    Zhang J, Cheng X, Wu G, Xu H. Adasum: An adaptive model for summarization. In Proc. CIKM, Napa Valley, USA, Oct. 26-30, 2008, pp.901-909.Google Scholar
  10. [10]
    Erkan G, Radev D R. Lexpagerank: Prestige in multidocument text summarization. In Proc. EMNLP, Barcelona, Spain, Jul. 25-26, 2004, pp.365-371.Google Scholar
  11. [11]
    Mihalcea R, Tarau P. Textrank — Bring order into texts. In Proc. EMNLP, Barcelona, Spain, Jul. 25-26, 2004, pp.119-126.Google Scholar
  12. [12]
    Mihalcea R, Tarau P. A language independent algorithm for single and multiple document summarization. In Proc. IJCNLP, Jeju Island, Korea, Oct.11-13, 2005, pp.19-24.Google Scholar
  13. [13]
    Wan X, Yang J, Xiao J. Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. In Proc. ACL, Prague, Czech Republic, Jun. 23-30, 2007, pp.552-559.Google Scholar
  14. [14]
    Wan X. An exploration of document impact on graph-based multi-document summarization. In Proc. EMNLP, Hawaii, USA, Oct. 25-27, 2008, pp.755-762.Google Scholar
  15. [15]
    Bennett C H, Gács P, Li M, Vitányi P M, Zurek W H. Information distance. IEEE Transactions on Information Theory, Jul. 1998, 44(4): 1407-1423.CrossRefMATHGoogle Scholar
  16. [16]
    Li M, Badger J H, Chen X, Kwong S, Kearney P, Zhang H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 2001, 17(2): 149-154.CrossRefGoogle Scholar
  17. [17]
    Li M, Chen X, Li X, Ma B, Vitányi P M. The similarity metric. IEEE Transactions on Information Theory, 2004, 50(12): 3250-3264.CrossRefGoogle Scholar
  18. [18]
    Long C, Zhu X, Li M, Ma B. Information shared by many objects. In Proc. CIKM, Napa Valley, USA, Oct. 26-30, 2008, pp.1213-1220.Google Scholar
  19. [19]
    Benedetto D, Caglioti E, Loreto V. Language trees and zipping. Physical Review Letters, Jan. 2002, 88(4): 048702.CrossRefGoogle Scholar
  20. [20]
    Bennett C H, Li M, Ma B. Chain letters and evolutionary histories. Scientific American, Jun. 2003, 288(6): 76-81.CrossRefGoogle Scholar
  21. [21]
    Cilibrasi R L, Vitányi P M. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, Mar. 2007, 19(3): 370-383.CrossRefGoogle Scholar
  22. [22]
    Zhang X, Hao Y, Zhu X, Li M. Information distance from a question to an answer. In Proc. SIGKDD, San Jose, USA, Aug. 12-15, 2007, pp.874-883.Google Scholar
  23. [23]
    Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977, 23(3): 337-343.CrossRefMathSciNetMATHGoogle Scholar
  24. [24]
    Lin C Y, Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proc. HLT-NAACL, Edmonton, Canada, May 27-June 1, 2003, pp.71-78.Google Scholar
  25. [25]
    Nenkova A, Passonneau R,Mckeown K. The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing, Apr. 2007, 4(2): 1-23.CrossRefGoogle Scholar

Copyright information

© Springer 2010

Authors and Affiliations

  • Chong Long
    • 1
  • Min-Lie Huang
    • 1
  • Xiao-Yan Zhu
    • 1
  • Ming Li
    • 2
  1. 1.State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and TechnologyTsinghua UniversityBeijingChina
  2. 2.School of Computer ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations