Knowledge and Information Systems

, Volume 23, Issue 3, pp 293–319 | Cite as

Discovery of interactive graphs for understanding and searching time-indexed corpora

Regular Paper

Abstract

Rich information spaces (like the Web or scientific publications) are full of “stories”: sets of statements that evolve over time, manifested as, for example, collections of news articles reporting events that relate to an evolving crime investigation, sets of news articles and blog posts accompanying the development of a political election campaign, or sequences of scientific papers on a topic. In this paper, we formulate the problem of discovering such stories as Evolutionary Theme Pattern Discovery, Summary and Exploration (ETP3). We propose a method and a visualisation tool for solving ETP3 by understanding, searching and interacting with such stories and their underlying documents. In contrast to existing approaches, our method concentrates on relational information and on local patterns rather than on the occurrence of individual concepts and global models. In addition, it relies on interactive graphs rather than natural language as the abstracted story representations. Furthermore, we present an evaluation framework. Two real-life case studies are used to illustrate and evaluate the method and tool.

Keywords

Text mining Web mining Graphical user interfaces 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adar E, Dontcheva M, Fogarty J, Weld DS (2008) Zoetrope: interacting with the ephemeral web. In: UIST ’08: Proceedings of the 21st annual ACM symposium on user interface software and technology. ACM, New York, pp 239–248Google Scholar
  2. 2.
    Allan J, Gupta R, Khandelwal V (2001) Temporal summaries of news topics. In: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR conference on research and development in information retrieval. ACM, pp New York, pp 10–18Google Scholar
  3. 3.
    Allan JF (2002) Topic detection and tracking. Springer, BerlinMATHGoogle Scholar
  4. 4.
    Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: IJCAI 2007, proceedings of the 20th international joint conference on artificial intelligence, pp 2670–2676Google Scholar
  5. 5.
    Berendt B, Subašić I (2009) Measuring graph topology for interactive temporal event detection. Künstliche Intelligenz 02/09: 11–17Google Scholar
  6. 6.
    Biryukov M, Angheluta R, Moens M-F (2005) Multidocument question answering text summarization using topic signatures. J Digital Inf Manag 3(1): 27–33Google Scholar
  7. 7.
    Bonchi F, Castillo C, Donato D, Gionis A (2008) Topical query decomposition. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New YorkGoogle Scholar
  8. 8.
    Brandes U, Lerner J (2008) Visual analysis of controversy in user-generated encyclopedias. Inf Vis 7(1): 34–48CrossRefGoogle Scholar
  9. 9.
    Chan J, Bailey J, Leckie C (2008) Discovering correlated spatio-temporal changes in evolving graphs. Knowl Inf Syst 16(1): 53–96CrossRefGoogle Scholar
  10. 10.
    Chen C (2003) Mapping scientific frontiers. Springer, LondonGoogle Scholar
  11. 11.
    Chen C (2006) Citespace II: detecting and visualizing emerging trends and transient patterns in scientific literature. J Am Soc Inf Sci Technol 57(3): 359–377CrossRefGoogle Scholar
  12. 12.
    Chen CC, Chen MC (2008) TSCAN: a novel method for topic summarization and content anatomy. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 579–586Google Scholar
  13. 13.
    Choudhary R, Mehta S, Bagchi A, Balakrishnan R (2008) Towards characterization of actor evolution and interactions in news corpora. In: Advances in information retrieval, 30th European conference on IR research, ECIR 2008. Lecture notes in computer science, vol 4956. Springer, Heidelberg, pp 422–429Google Scholar
  14. 14.
    Clifton C, Cooley R, Rennie J (2004) Topcat: data mining for topic identification in a text corpus. IEEE Trans Knowl Data Eng 16(8): 949–964CrossRefGoogle Scholar
  15. 15.
    Cui H, Wen J-R, Nie J-Y, Ma W-Y (2002) Probabilistic query expansion using query logs. In: WWW’02: Proceedings of the 11th international conference on World Wide Web. ACM, New York, pp 325–332Google Scholar
  16. 16.
    Debnath S, Mitra P, Pal N, Giles C (2005) Automatic identification of informative sections of web pages. IEEE Trans Knowl Data Eng 17(9): 1233–1246CrossRefGoogle Scholar
  17. 17.
    Elsas JL, Arguello J, Callan J, Carbonell JG (2008) Retrieval and feedback models for blog feed search. In: SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 347–354Google Scholar
  18. 18.
    Etzioni O, Cafarella M, Downey D, Kok S, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2004) Web-scale information extraction in KnowItAll (preliminary results). In: WWW’04: Proceedings of the 13th international conference on World Wide Web. ACM, New York, pp 100–110Google Scholar
  19. 19.
    Feldman R, Fresko M, Goldenberg J, Netzer O, Ungar LH (2007) Extracting product comparisons from discussion boards. In: Proceedings of the 7th IEEE international conference on data mining (ICDM 2007). IEEE Computer Society, pp 469–474Google Scholar
  20. 20.
    Fonseca BM, Golgher P, Pôssas B, Ribeiro-Neto B, Ziviani N (2005) Concept-based interactive query expansion. In: CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, New York, pp 696–703Google Scholar
  21. 21.
    Fung GPC, Yu JX, Yu PS, Lu H (2005) Parameter free bursty events detection in text streams. In: VLDB ’05: Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, pp 181–192Google Scholar
  22. 22.
    Gruhl D, Guha RV, Kumar R, Novak J, Tomkins A (2005) The predictive power of online chatter. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 78–87Google Scholar
  23. 23.
    He Q, Chang K, Lim E-P, Zhang J (2007) Bursty feature representation for clustering text streams. In: Proceedings of the seventh SIAM international conference on data mining. SIAMGoogle Scholar
  24. 24.
    Hearst MA, Pedersen JO (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 76–84Google Scholar
  25. 25.
    Hollyscoop. Britney Spears news & pictures (2007) http://www.hollyscoop.com/britney-spears/16.aspx, retrieved 1 March 2009
  26. 26.
    Huang W, Eades P (2005) How people read graphs. In: APVis ’05: proceedings of the 2005 Asia-Pacific symposium on Information visualisation. Darlinghurst, Australia, Australia. Australian Computer Society, Inc., pp 51–58Google Scholar
  27. 27.
    Janssens FAL, Glänzel W, Moor BD (2007) Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 360–369Google Scholar
  28. 28.
    Kim H-J, Lee S-G (2004) An intelligent information system for organizing online text documents. Knowl Inf Syst 6(2): 125–149Google Scholar
  29. 29.
    Kleinberg JM (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4): 373–397CrossRefMathSciNetGoogle Scholar
  30. 30.
    Kules W, Wilson ML, schraefel mc, Shneiderman B (2008) From keyword search to exploration: How result visualization aids discovery on the web. Technical report, University of Southampton, February 2008. http://eprints.ecs.soton.ac.uk/15169/
  31. 31.
    Leydesdorff L, Schank T (2008) Dynamic animations of journal maps: indicators of structural change and interdisciplinary developments. J Am Soc Inf Sci Technol 59(11): 1810–1818CrossRefGoogle Scholar
  32. 32.
    Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the workshop on text summarization branches out (WAS 2004)Google Scholar
  33. 33.
    Lin C-Y, Hovy E (2002) Automated multi-document summarization in neats. In: Proceedings of the second international conference on human language technology research. Morgan Kaufmann Publishers Inc., San Francisco, pp 59–62Google Scholar
  34. 34.
    Ling X, Mei Q, Zhai C, Schatz B (2008) Mining multi-faceted overviews of arbitrary topics in a text collection. In: KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 497–505Google Scholar
  35. 35.
    Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 198–207Google Scholar
  36. 36.
    Nallapati R, Feng A, Peng F, Allan J (2004) Event threading within news topics. In: CIKM ’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, New York, pp 446–453Google Scholar
  37. 37.
    Navigli R, Velardi P (2004) Learning domain ontologies from document warehouses and dedicated web sites. Comput Linguist 30(2): 151–179CrossRefGoogle Scholar
  38. 38.
    Nijssen S, Kok JN (2004) A quickstart in frequent structure mining can make a difference. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp 647–652Google Scholar
  39. 39.
    Oka M, Abe H, Kato K (2006) Extracting topics from weblogs through frequency segments. In: Proc. of WWW2006 3rd annual workshop on the weblogging ecosystem http://www.blogpulse.com/www2006-workshop/papers/wwe2006-oka.pdf
  40. 40.
    OneStat.com (2004) Most people use 2 word phrases in search engines according to onestat.com. http://www.onestat.com/html/aboutus_pressbox27.html
  41. 41.
    Rozenfeld B, Feldman R (2008) Self-supervised relation extraction from the web. Knowl Inf Syst 17(1): 17–33CrossRefGoogle Scholar
  42. 42.
    Schult R, Spiliopoulou M (2006) Discovering emerging topics in unlabelled text collections. In: Advances in databases and information systems, 10th east european conference, ADBIS 2006. Lecture notes in computer science, vol 4152. Springer, Heidelberg, pp 353–366Google Scholar
  43. 43.
    Smith DA (2002) Detecting and browsing events in unstructured text. In: Proceedings of the 25th annual ACM SIGIR conference. VLDB Endowment, pp 73–80Google Scholar
  44. 44.
    Subašić I, Berendt B (2008) Web mining for understanding stories through graph visualisation. In: Proceedings of the 2008 IEEE international conference on data mining (ICDM 2008). IEEE Computer Society Press, Los Alamitos, pp 570–579Google Scholar
  45. 45.
    Thelwall M (2006) Blogs during the london attacks: Top information sources and topics. In: Proc. of WWW2006 WS Weblogging Ecosystem. http://www.blogpulse.com/www2006-workshop/papers/blogs-during-london-at tacks.pdf
  46. 46.
    Ussery B (2008) Google—average number of words per query have increased!. http://www.beussery.com/blog/index.php/2008/02/google-average-number-of-words-per-query-have-increased/
  47. 47.
    Wang P, Hu J, Zeng H-J, Chen Z (2009) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3): 265–394CrossRefGoogle Scholar
  48. 48.
    Wang S-C, Tanaka Y (2006) Topic-oriented query expansion for web search. In: WWW ’06: Proceedings of the 15th international conference on World Wide Web. ACM, New York, pp 1029–1030Google Scholar
  49. 49.
    Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 424–433Google Scholar
  50. 50.
    Ware C (2004) Information visualization: perception for design. Morgan Kaufmann, San FranciscoGoogle Scholar
  51. 51.
    Ware C, Bobrow R (2004) Motion to support rapid interactive queries on node–link diagrams. ACM Trans Appl Percept 1(1): 3–18CrossRefGoogle Scholar
  52. 52.
    Wei F, Li W, Lu Q, He Y (2009) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst. doi: 10.1007/s10115-009-0194-2
  53. 53.
  54. 54.
  55. 55.
    Wong PC, Cowley W, Foote H, Jurrus E, Thomas J (2000) Visualizing sequential patterns for text mining. In: Proceedings of the IEEE symposium on information visualization (InfoVis’00), pp 105–111Google Scholar
  56. 56.
    Xu J, Croft WB (1996) Query expansion using local and global document analysis. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 4–11Google Scholar
  57. 57.
    Xu J, Croft WB (2000) Improving the effectiveness of information retrieval with local context analysis. ACM Trans Inf Syst 18(1): 79–112CrossRefGoogle Scholar
  58. 58.
    Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 46–54Google Scholar
  59. 59.
    Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 210–217Google Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  1. 1.Department of Computer ScienceKatholieke Universiteit LeuvenLeuven-HeverleeBelgium

Personalised recommendations