Skip to main content
Log in

Discovery of interactive graphs for understanding and searching time-indexed corpora

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Rich information spaces (like the Web or scientific publications) are full of “stories”: sets of statements that evolve over time, manifested as, for example, collections of news articles reporting events that relate to an evolving crime investigation, sets of news articles and blog posts accompanying the development of a political election campaign, or sequences of scientific papers on a topic. In this paper, we formulate the problem of discovering such stories as Evolutionary Theme Pattern Discovery, Summary and Exploration (ETP3). We propose a method and a visualisation tool for solving ETP3 by understanding, searching and interacting with such stories and their underlying documents. In contrast to existing approaches, our method concentrates on relational information and on local patterns rather than on the occurrence of individual concepts and global models. In addition, it relies on interactive graphs rather than natural language as the abstracted story representations. Furthermore, we present an evaluation framework. Two real-life case studies are used to illustrate and evaluate the method and tool.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Adar E, Dontcheva M, Fogarty J, Weld DS (2008) Zoetrope: interacting with the ephemeral web. In: UIST ’08: Proceedings of the 21st annual ACM symposium on user interface software and technology. ACM, New York, pp 239–248

  2. Allan J, Gupta R, Khandelwal V (2001) Temporal summaries of news topics. In: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR conference on research and development in information retrieval. ACM, pp New York, pp 10–18

  3. Allan JF (2002) Topic detection and tracking. Springer, Berlin

    MATH  Google Scholar 

  4. Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: IJCAI 2007, proceedings of the 20th international joint conference on artificial intelligence, pp 2670–2676

  5. Berendt B, Subašić I (2009) Measuring graph topology for interactive temporal event detection. Künstliche Intelligenz 02/09: 11–17

    Google Scholar 

  6. Biryukov M, Angheluta R, Moens M-F (2005) Multidocument question answering text summarization using topic signatures. J Digital Inf Manag 3(1): 27–33

    Google Scholar 

  7. Bonchi F, Castillo C, Donato D, Gionis A (2008) Topical query decomposition. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York

  8. Brandes U, Lerner J (2008) Visual analysis of controversy in user-generated encyclopedias. Inf Vis 7(1): 34–48

    Article  Google Scholar 

  9. Chan J, Bailey J, Leckie C (2008) Discovering correlated spatio-temporal changes in evolving graphs. Knowl Inf Syst 16(1): 53–96

    Article  Google Scholar 

  10. Chen C (2003) Mapping scientific frontiers. Springer, London

    Google Scholar 

  11. Chen C (2006) Citespace II: detecting and visualizing emerging trends and transient patterns in scientific literature. J Am Soc Inf Sci Technol 57(3): 359–377

    Article  Google Scholar 

  12. Chen CC, Chen MC (2008) TSCAN: a novel method for topic summarization and content anatomy. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 579–586

  13. Choudhary R, Mehta S, Bagchi A, Balakrishnan R (2008) Towards characterization of actor evolution and interactions in news corpora. In: Advances in information retrieval, 30th European conference on IR research, ECIR 2008. Lecture notes in computer science, vol 4956. Springer, Heidelberg, pp 422–429

  14. Clifton C, Cooley R, Rennie J (2004) Topcat: data mining for topic identification in a text corpus. IEEE Trans Knowl Data Eng 16(8): 949–964

    Article  Google Scholar 

  15. Cui H, Wen J-R, Nie J-Y, Ma W-Y (2002) Probabilistic query expansion using query logs. In: WWW’02: Proceedings of the 11th international conference on World Wide Web. ACM, New York, pp 325–332

  16. Debnath S, Mitra P, Pal N, Giles C (2005) Automatic identification of informative sections of web pages. IEEE Trans Knowl Data Eng 17(9): 1233–1246

    Article  Google Scholar 

  17. Elsas JL, Arguello J, Callan J, Carbonell JG (2008) Retrieval and feedback models for blog feed search. In: SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 347–354

  18. Etzioni O, Cafarella M, Downey D, Kok S, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2004) Web-scale information extraction in KnowItAll (preliminary results). In: WWW’04: Proceedings of the 13th international conference on World Wide Web. ACM, New York, pp 100–110

  19. Feldman R, Fresko M, Goldenberg J, Netzer O, Ungar LH (2007) Extracting product comparisons from discussion boards. In: Proceedings of the 7th IEEE international conference on data mining (ICDM 2007). IEEE Computer Society, pp 469–474

  20. Fonseca BM, Golgher P, Pôssas B, Ribeiro-Neto B, Ziviani N (2005) Concept-based interactive query expansion. In: CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, New York, pp 696–703

  21. Fung GPC, Yu JX, Yu PS, Lu H (2005) Parameter free bursty events detection in text streams. In: VLDB ’05: Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, pp 181–192

  22. Gruhl D, Guha RV, Kumar R, Novak J, Tomkins A (2005) The predictive power of online chatter. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 78–87

  23. He Q, Chang K, Lim E-P, Zhang J (2007) Bursty feature representation for clustering text streams. In: Proceedings of the seventh SIAM international conference on data mining. SIAM

  24. Hearst MA, Pedersen JO (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 76–84

  25. Hollyscoop. Britney Spears news & pictures (2007) http://www.hollyscoop.com/britney-spears/16.aspx, retrieved 1 March 2009

  26. Huang W, Eades P (2005) How people read graphs. In: APVis ’05: proceedings of the 2005 Asia-Pacific symposium on Information visualisation. Darlinghurst, Australia, Australia. Australian Computer Society, Inc., pp 51–58

  27. Janssens FAL, Glänzel W, Moor BD (2007) Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 360–369

  28. Kim H-J, Lee S-G (2004) An intelligent information system for organizing online text documents. Knowl Inf Syst 6(2): 125–149

    Google Scholar 

  29. Kleinberg JM (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4): 373–397

    Article  MathSciNet  Google Scholar 

  30. Kules W, Wilson ML, schraefel mc, Shneiderman B (2008) From keyword search to exploration: How result visualization aids discovery on the web. Technical report, University of Southampton, February 2008. http://eprints.ecs.soton.ac.uk/15169/

  31. Leydesdorff L, Schank T (2008) Dynamic animations of journal maps: indicators of structural change and interdisciplinary developments. J Am Soc Inf Sci Technol 59(11): 1810–1818

    Article  Google Scholar 

  32. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the workshop on text summarization branches out (WAS 2004)

  33. Lin C-Y, Hovy E (2002) Automated multi-document summarization in neats. In: Proceedings of the second international conference on human language technology research. Morgan Kaufmann Publishers Inc., San Francisco, pp 59–62

  34. Ling X, Mei Q, Zhai C, Schatz B (2008) Mining multi-faceted overviews of arbitrary topics in a text collection. In: KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 497–505

  35. Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 198–207

  36. Nallapati R, Feng A, Peng F, Allan J (2004) Event threading within news topics. In: CIKM ’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, New York, pp 446–453

  37. Navigli R, Velardi P (2004) Learning domain ontologies from document warehouses and dedicated web sites. Comput Linguist 30(2): 151–179

    Article  Google Scholar 

  38. Nijssen S, Kok JN (2004) A quickstart in frequent structure mining can make a difference. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp 647–652

  39. Oka M, Abe H, Kato K (2006) Extracting topics from weblogs through frequency segments. In: Proc. of WWW2006 3rd annual workshop on the weblogging ecosystem http://www.blogpulse.com/www2006-workshop/papers/wwe2006-oka.pdf

  40. OneStat.com (2004) Most people use 2 word phrases in search engines according to onestat.com. http://www.onestat.com/html/aboutus_pressbox27.html

  41. Rozenfeld B, Feldman R (2008) Self-supervised relation extraction from the web. Knowl Inf Syst 17(1): 17–33

    Article  Google Scholar 

  42. Schult R, Spiliopoulou M (2006) Discovering emerging topics in unlabelled text collections. In: Advances in databases and information systems, 10th east european conference, ADBIS 2006. Lecture notes in computer science, vol 4152. Springer, Heidelberg, pp 353–366

  43. Smith DA (2002) Detecting and browsing events in unstructured text. In: Proceedings of the 25th annual ACM SIGIR conference. VLDB Endowment, pp 73–80

  44. Subašić I, Berendt B (2008) Web mining for understanding stories through graph visualisation. In: Proceedings of the 2008 IEEE international conference on data mining (ICDM 2008). IEEE Computer Society Press, Los Alamitos, pp 570–579

  45. Thelwall M (2006) Blogs during the london attacks: Top information sources and topics. In: Proc. of WWW2006 WS Weblogging Ecosystem. http://www.blogpulse.com/www2006-workshop/papers/blogs-during-london-at tacks.pdf

  46. Ussery B (2008) Google—average number of words per query have increased!. http://www.beussery.com/blog/index.php/2008/02/google-average-number-of-words-per-query-have-increased/

  47. Wang P, Hu J, Zeng H-J, Chen Z (2009) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3): 265–394

    Article  Google Scholar 

  48. Wang S-C, Tanaka Y (2006) Topic-oriented query expansion for web search. In: WWW ’06: Proceedings of the 15th international conference on World Wide Web. ACM, New York, pp 1029–1030

  49. Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 424–433

  50. Ware C (2004) Information visualization: perception for design. Morgan Kaufmann, San Francisco

    Google Scholar 

  51. Ware C, Bobrow R (2004) Motion to support rapid interactive queries on node–link diagrams. ACM Trans Appl Percept 1(1): 3–18

    Article  Google Scholar 

  52. Wei F, Li W, Lu Q, He Y (2009) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst. doi:10.1007/s10115-009-0194-2

  53. Wikipedia (2008) Disappearance of Madeleine McCann http://en.wikipedia.org/w/index.php?title=Disappearance_of_Madeleine_McCann&oldid=224183687

  54. Wikipedia (2008) Disappearance of Madeleine McCann http://en.wikipedia.org/w/index.php?title=Disappearance_of_Madeleine_McCann&oldid=215814790

  55. Wong PC, Cowley W, Foote H, Jurrus E, Thomas J (2000) Visualizing sequential patterns for text mining. In: Proceedings of the IEEE symposium on information visualization (InfoVis’00), pp 105–111

  56. Xu J, Croft WB (1996) Query expansion using local and global document analysis. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 4–11

  57. Xu J, Croft WB (2000) Improving the effectiveness of information retrieval with local context analysis. ACM Trans Inf Syst 18(1): 79–112

    Article  Google Scholar 

  58. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 46–54

  59. Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 210–217

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bettina Berendt.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Subašić, I., Berendt, B. Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowl Inf Syst 23, 293–319 (2010). https://doi.org/10.1007/s10115-009-0227-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0227-x

Keywords

Navigation