Abstract
Rich information spaces (like the Web or scientific publications) are full of “stories”: sets of statements that evolve over time, manifested as, for example, collections of news articles reporting events that relate to an evolving crime investigation, sets of news articles and blog posts accompanying the development of a political election campaign, or sequences of scientific papers on a topic. In this paper, we formulate the problem of discovering such stories as Evolutionary Theme Pattern Discovery, Summary and Exploration (ETP3). We propose a method and a visualisation tool for solving ETP3 by understanding, searching and interacting with such stories and their underlying documents. In contrast to existing approaches, our method concentrates on relational information and on local patterns rather than on the occurrence of individual concepts and global models. In addition, it relies on interactive graphs rather than natural language as the abstracted story representations. Furthermore, we present an evaluation framework. Two real-life case studies are used to illustrate and evaluate the method and tool.
Similar content being viewed by others
References
Adar E, Dontcheva M, Fogarty J, Weld DS (2008) Zoetrope: interacting with the ephemeral web. In: UIST ’08: Proceedings of the 21st annual ACM symposium on user interface software and technology. ACM, New York, pp 239–248
Allan J, Gupta R, Khandelwal V (2001) Temporal summaries of news topics. In: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR conference on research and development in information retrieval. ACM, pp New York, pp 10–18
Allan JF (2002) Topic detection and tracking. Springer, Berlin
Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: IJCAI 2007, proceedings of the 20th international joint conference on artificial intelligence, pp 2670–2676
Berendt B, Subašić I (2009) Measuring graph topology for interactive temporal event detection. Künstliche Intelligenz 02/09: 11–17
Biryukov M, Angheluta R, Moens M-F (2005) Multidocument question answering text summarization using topic signatures. J Digital Inf Manag 3(1): 27–33
Bonchi F, Castillo C, Donato D, Gionis A (2008) Topical query decomposition. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York
Brandes U, Lerner J (2008) Visual analysis of controversy in user-generated encyclopedias. Inf Vis 7(1): 34–48
Chan J, Bailey J, Leckie C (2008) Discovering correlated spatio-temporal changes in evolving graphs. Knowl Inf Syst 16(1): 53–96
Chen C (2003) Mapping scientific frontiers. Springer, London
Chen C (2006) Citespace II: detecting and visualizing emerging trends and transient patterns in scientific literature. J Am Soc Inf Sci Technol 57(3): 359–377
Chen CC, Chen MC (2008) TSCAN: a novel method for topic summarization and content anatomy. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 579–586
Choudhary R, Mehta S, Bagchi A, Balakrishnan R (2008) Towards characterization of actor evolution and interactions in news corpora. In: Advances in information retrieval, 30th European conference on IR research, ECIR 2008. Lecture notes in computer science, vol 4956. Springer, Heidelberg, pp 422–429
Clifton C, Cooley R, Rennie J (2004) Topcat: data mining for topic identification in a text corpus. IEEE Trans Knowl Data Eng 16(8): 949–964
Cui H, Wen J-R, Nie J-Y, Ma W-Y (2002) Probabilistic query expansion using query logs. In: WWW’02: Proceedings of the 11th international conference on World Wide Web. ACM, New York, pp 325–332
Debnath S, Mitra P, Pal N, Giles C (2005) Automatic identification of informative sections of web pages. IEEE Trans Knowl Data Eng 17(9): 1233–1246
Elsas JL, Arguello J, Callan J, Carbonell JG (2008) Retrieval and feedback models for blog feed search. In: SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 347–354
Etzioni O, Cafarella M, Downey D, Kok S, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2004) Web-scale information extraction in KnowItAll (preliminary results). In: WWW’04: Proceedings of the 13th international conference on World Wide Web. ACM, New York, pp 100–110
Feldman R, Fresko M, Goldenberg J, Netzer O, Ungar LH (2007) Extracting product comparisons from discussion boards. In: Proceedings of the 7th IEEE international conference on data mining (ICDM 2007). IEEE Computer Society, pp 469–474
Fonseca BM, Golgher P, Pôssas B, Ribeiro-Neto B, Ziviani N (2005) Concept-based interactive query expansion. In: CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, New York, pp 696–703
Fung GPC, Yu JX, Yu PS, Lu H (2005) Parameter free bursty events detection in text streams. In: VLDB ’05: Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, pp 181–192
Gruhl D, Guha RV, Kumar R, Novak J, Tomkins A (2005) The predictive power of online chatter. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 78–87
He Q, Chang K, Lim E-P, Zhang J (2007) Bursty feature representation for clustering text streams. In: Proceedings of the seventh SIAM international conference on data mining. SIAM
Hearst MA, Pedersen JO (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 76–84
Hollyscoop. Britney Spears news & pictures (2007) http://www.hollyscoop.com/britney-spears/16.aspx, retrieved 1 March 2009
Huang W, Eades P (2005) How people read graphs. In: APVis ’05: proceedings of the 2005 Asia-Pacific symposium on Information visualisation. Darlinghurst, Australia, Australia. Australian Computer Society, Inc., pp 51–58
Janssens FAL, Glänzel W, Moor BD (2007) Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 360–369
Kim H-J, Lee S-G (2004) An intelligent information system for organizing online text documents. Knowl Inf Syst 6(2): 125–149
Kleinberg JM (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4): 373–397
Kules W, Wilson ML, schraefel mc, Shneiderman B (2008) From keyword search to exploration: How result visualization aids discovery on the web. Technical report, University of Southampton, February 2008. http://eprints.ecs.soton.ac.uk/15169/
Leydesdorff L, Schank T (2008) Dynamic animations of journal maps: indicators of structural change and interdisciplinary developments. J Am Soc Inf Sci Technol 59(11): 1810–1818
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the workshop on text summarization branches out (WAS 2004)
Lin C-Y, Hovy E (2002) Automated multi-document summarization in neats. In: Proceedings of the second international conference on human language technology research. Morgan Kaufmann Publishers Inc., San Francisco, pp 59–62
Ling X, Mei Q, Zhai C, Schatz B (2008) Mining multi-faceted overviews of arbitrary topics in a text collection. In: KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 497–505
Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 198–207
Nallapati R, Feng A, Peng F, Allan J (2004) Event threading within news topics. In: CIKM ’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, New York, pp 446–453
Navigli R, Velardi P (2004) Learning domain ontologies from document warehouses and dedicated web sites. Comput Linguist 30(2): 151–179
Nijssen S, Kok JN (2004) A quickstart in frequent structure mining can make a difference. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp 647–652
Oka M, Abe H, Kato K (2006) Extracting topics from weblogs through frequency segments. In: Proc. of WWW2006 3rd annual workshop on the weblogging ecosystem http://www.blogpulse.com/www2006-workshop/papers/wwe2006-oka.pdf
OneStat.com (2004) Most people use 2 word phrases in search engines according to onestat.com. http://www.onestat.com/html/aboutus_pressbox27.html
Rozenfeld B, Feldman R (2008) Self-supervised relation extraction from the web. Knowl Inf Syst 17(1): 17–33
Schult R, Spiliopoulou M (2006) Discovering emerging topics in unlabelled text collections. In: Advances in databases and information systems, 10th east european conference, ADBIS 2006. Lecture notes in computer science, vol 4152. Springer, Heidelberg, pp 353–366
Smith DA (2002) Detecting and browsing events in unstructured text. In: Proceedings of the 25th annual ACM SIGIR conference. VLDB Endowment, pp 73–80
Subašić I, Berendt B (2008) Web mining for understanding stories through graph visualisation. In: Proceedings of the 2008 IEEE international conference on data mining (ICDM 2008). IEEE Computer Society Press, Los Alamitos, pp 570–579
Thelwall M (2006) Blogs during the london attacks: Top information sources and topics. In: Proc. of WWW2006 WS Weblogging Ecosystem. http://www.blogpulse.com/www2006-workshop/papers/blogs-during-london-at tacks.pdf
Ussery B (2008) Google—average number of words per query have increased!. http://www.beussery.com/blog/index.php/2008/02/google-average-number-of-words-per-query-have-increased/
Wang P, Hu J, Zeng H-J, Chen Z (2009) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3): 265–394
Wang S-C, Tanaka Y (2006) Topic-oriented query expansion for web search. In: WWW ’06: Proceedings of the 15th international conference on World Wide Web. ACM, New York, pp 1029–1030
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 424–433
Ware C (2004) Information visualization: perception for design. Morgan Kaufmann, San Francisco
Ware C, Bobrow R (2004) Motion to support rapid interactive queries on node–link diagrams. ACM Trans Appl Percept 1(1): 3–18
Wei F, Li W, Lu Q, He Y (2009) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst. doi:10.1007/s10115-009-0194-2
Wikipedia (2008) Disappearance of Madeleine McCann http://en.wikipedia.org/w/index.php?title=Disappearance_of_Madeleine_McCann&oldid=224183687
Wikipedia (2008) Disappearance of Madeleine McCann http://en.wikipedia.org/w/index.php?title=Disappearance_of_Madeleine_McCann&oldid=215814790
Wong PC, Cowley W, Foote H, Jurrus E, Thomas J (2000) Visualizing sequential patterns for text mining. In: Proceedings of the IEEE symposium on information visualization (InfoVis’00), pp 105–111
Xu J, Croft WB (1996) Query expansion using local and global document analysis. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 4–11
Xu J, Croft WB (2000) Improving the effectiveness of information retrieval with local context analysis. ACM Trans Inf Syst 18(1): 79–112
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 46–54
Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 210–217
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Subašić, I., Berendt, B. Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowl Inf Syst 23, 293–319 (2010). https://doi.org/10.1007/s10115-009-0227-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0227-x