Discovery of interactive graphs for understanding and searching time-indexed corpora
- 141 Downloads
- 9 Citations
Abstract
Rich information spaces (like the Web or scientific publications) are full of “stories”: sets of statements that evolve over time, manifested as, for example, collections of news articles reporting events that relate to an evolving crime investigation, sets of news articles and blog posts accompanying the development of a political election campaign, or sequences of scientific papers on a topic. In this paper, we formulate the problem of discovering such stories as Evolutionary Theme Pattern Discovery, Summary and Exploration (ETP3). We propose a method and a visualisation tool for solving ETP3 by understanding, searching and interacting with such stories and their underlying documents. In contrast to existing approaches, our method concentrates on relational information and on local patterns rather than on the occurrence of individual concepts and global models. In addition, it relies on interactive graphs rather than natural language as the abstracted story representations. Furthermore, we present an evaluation framework. Two real-life case studies are used to illustrate and evaluate the method and tool.
Keywords
Text mining Web mining Graphical user interfacesPreview
Unable to display preview. Download preview PDF.
References
- 1.Adar E, Dontcheva M, Fogarty J, Weld DS (2008) Zoetrope: interacting with the ephemeral web. In: UIST ’08: Proceedings of the 21st annual ACM symposium on user interface software and technology. ACM, New York, pp 239–248Google Scholar
- 2.Allan J, Gupta R, Khandelwal V (2001) Temporal summaries of news topics. In: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR conference on research and development in information retrieval. ACM, pp New York, pp 10–18Google Scholar
- 3.Allan JF (2002) Topic detection and tracking. Springer, BerlinMATHGoogle Scholar
- 4.Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: IJCAI 2007, proceedings of the 20th international joint conference on artificial intelligence, pp 2670–2676Google Scholar
- 5.Berendt B, Subašić I (2009) Measuring graph topology for interactive temporal event detection. Künstliche Intelligenz 02/09: 11–17Google Scholar
- 6.Biryukov M, Angheluta R, Moens M-F (2005) Multidocument question answering text summarization using topic signatures. J Digital Inf Manag 3(1): 27–33Google Scholar
- 7.Bonchi F, Castillo C, Donato D, Gionis A (2008) Topical query decomposition. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New YorkGoogle Scholar
- 8.Brandes U, Lerner J (2008) Visual analysis of controversy in user-generated encyclopedias. Inf Vis 7(1): 34–48CrossRefGoogle Scholar
- 9.Chan J, Bailey J, Leckie C (2008) Discovering correlated spatio-temporal changes in evolving graphs. Knowl Inf Syst 16(1): 53–96CrossRefGoogle Scholar
- 10.Chen C (2003) Mapping scientific frontiers. Springer, LondonGoogle Scholar
- 11.Chen C (2006) Citespace II: detecting and visualizing emerging trends and transient patterns in scientific literature. J Am Soc Inf Sci Technol 57(3): 359–377CrossRefGoogle Scholar
- 12.Chen CC, Chen MC (2008) TSCAN: a novel method for topic summarization and content anatomy. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 579–586Google Scholar
- 13.Choudhary R, Mehta S, Bagchi A, Balakrishnan R (2008) Towards characterization of actor evolution and interactions in news corpora. In: Advances in information retrieval, 30th European conference on IR research, ECIR 2008. Lecture notes in computer science, vol 4956. Springer, Heidelberg, pp 422–429Google Scholar
- 14.Clifton C, Cooley R, Rennie J (2004) Topcat: data mining for topic identification in a text corpus. IEEE Trans Knowl Data Eng 16(8): 949–964CrossRefGoogle Scholar
- 15.Cui H, Wen J-R, Nie J-Y, Ma W-Y (2002) Probabilistic query expansion using query logs. In: WWW’02: Proceedings of the 11th international conference on World Wide Web. ACM, New York, pp 325–332Google Scholar
- 16.Debnath S, Mitra P, Pal N, Giles C (2005) Automatic identification of informative sections of web pages. IEEE Trans Knowl Data Eng 17(9): 1233–1246CrossRefGoogle Scholar
- 17.Elsas JL, Arguello J, Callan J, Carbonell JG (2008) Retrieval and feedback models for blog feed search. In: SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 347–354Google Scholar
- 18.Etzioni O, Cafarella M, Downey D, Kok S, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2004) Web-scale information extraction in KnowItAll (preliminary results). In: WWW’04: Proceedings of the 13th international conference on World Wide Web. ACM, New York, pp 100–110Google Scholar
- 19.Feldman R, Fresko M, Goldenberg J, Netzer O, Ungar LH (2007) Extracting product comparisons from discussion boards. In: Proceedings of the 7th IEEE international conference on data mining (ICDM 2007). IEEE Computer Society, pp 469–474Google Scholar
- 20.Fonseca BM, Golgher P, Pôssas B, Ribeiro-Neto B, Ziviani N (2005) Concept-based interactive query expansion. In: CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, New York, pp 696–703Google Scholar
- 21.Fung GPC, Yu JX, Yu PS, Lu H (2005) Parameter free bursty events detection in text streams. In: VLDB ’05: Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, pp 181–192Google Scholar
- 22.Gruhl D, Guha RV, Kumar R, Novak J, Tomkins A (2005) The predictive power of online chatter. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 78–87Google Scholar
- 23.He Q, Chang K, Lim E-P, Zhang J (2007) Bursty feature representation for clustering text streams. In: Proceedings of the seventh SIAM international conference on data mining. SIAMGoogle Scholar
- 24.Hearst MA, Pedersen JO (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 76–84Google Scholar
- 25.Hollyscoop. Britney Spears news & pictures (2007) http://www.hollyscoop.com/britney-spears/16.aspx, retrieved 1 March 2009
- 26.Huang W, Eades P (2005) How people read graphs. In: APVis ’05: proceedings of the 2005 Asia-Pacific symposium on Information visualisation. Darlinghurst, Australia, Australia. Australian Computer Society, Inc., pp 51–58Google Scholar
- 27.Janssens FAL, Glänzel W, Moor BD (2007) Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 360–369Google Scholar
- 28.Kim H-J, Lee S-G (2004) An intelligent information system for organizing online text documents. Knowl Inf Syst 6(2): 125–149Google Scholar
- 29.Kleinberg JM (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4): 373–397CrossRefMathSciNetGoogle Scholar
- 30.Kules W, Wilson ML, schraefel mc, Shneiderman B (2008) From keyword search to exploration: How result visualization aids discovery on the web. Technical report, University of Southampton, February 2008. http://eprints.ecs.soton.ac.uk/15169/
- 31.Leydesdorff L, Schank T (2008) Dynamic animations of journal maps: indicators of structural change and interdisciplinary developments. J Am Soc Inf Sci Technol 59(11): 1810–1818CrossRefGoogle Scholar
- 32.Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the workshop on text summarization branches out (WAS 2004)Google Scholar
- 33.Lin C-Y, Hovy E (2002) Automated multi-document summarization in neats. In: Proceedings of the second international conference on human language technology research. Morgan Kaufmann Publishers Inc., San Francisco, pp 59–62Google Scholar
- 34.Ling X, Mei Q, Zhai C, Schatz B (2008) Mining multi-faceted overviews of arbitrary topics in a text collection. In: KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 497–505Google Scholar
- 35.Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 198–207Google Scholar
- 36.Nallapati R, Feng A, Peng F, Allan J (2004) Event threading within news topics. In: CIKM ’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, New York, pp 446–453Google Scholar
- 37.Navigli R, Velardi P (2004) Learning domain ontologies from document warehouses and dedicated web sites. Comput Linguist 30(2): 151–179CrossRefGoogle Scholar
- 38.Nijssen S, Kok JN (2004) A quickstart in frequent structure mining can make a difference. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp 647–652Google Scholar
- 39.Oka M, Abe H, Kato K (2006) Extracting topics from weblogs through frequency segments. In: Proc. of WWW2006 3rd annual workshop on the weblogging ecosystem http://www.blogpulse.com/www2006-workshop/papers/wwe2006-oka.pdf
- 40.OneStat.com (2004) Most people use 2 word phrases in search engines according to onestat.com. http://www.onestat.com/html/aboutus_pressbox27.html
- 41.Rozenfeld B, Feldman R (2008) Self-supervised relation extraction from the web. Knowl Inf Syst 17(1): 17–33CrossRefGoogle Scholar
- 42.Schult R, Spiliopoulou M (2006) Discovering emerging topics in unlabelled text collections. In: Advances in databases and information systems, 10th east european conference, ADBIS 2006. Lecture notes in computer science, vol 4152. Springer, Heidelberg, pp 353–366Google Scholar
- 43.Smith DA (2002) Detecting and browsing events in unstructured text. In: Proceedings of the 25th annual ACM SIGIR conference. VLDB Endowment, pp 73–80Google Scholar
- 44.Subašić I, Berendt B (2008) Web mining for understanding stories through graph visualisation. In: Proceedings of the 2008 IEEE international conference on data mining (ICDM 2008). IEEE Computer Society Press, Los Alamitos, pp 570–579Google Scholar
- 45.Thelwall M (2006) Blogs during the london attacks: Top information sources and topics. In: Proc. of WWW2006 WS Weblogging Ecosystem. http://www.blogpulse.com/www2006-workshop/papers/blogs-during-london-at tacks.pdf
- 46.Ussery B (2008) Google—average number of words per query have increased!. http://www.beussery.com/blog/index.php/2008/02/google-average-number-of-words-per-query-have-increased/
- 47.Wang P, Hu J, Zeng H-J, Chen Z (2009) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3): 265–394CrossRefGoogle Scholar
- 48.Wang S-C, Tanaka Y (2006) Topic-oriented query expansion for web search. In: WWW ’06: Proceedings of the 15th international conference on World Wide Web. ACM, New York, pp 1029–1030Google Scholar
- 49.Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 424–433Google Scholar
- 50.Ware C (2004) Information visualization: perception for design. Morgan Kaufmann, San FranciscoGoogle Scholar
- 51.Ware C, Bobrow R (2004) Motion to support rapid interactive queries on node–link diagrams. ACM Trans Appl Percept 1(1): 3–18CrossRefGoogle Scholar
- 52.Wei F, Li W, Lu Q, He Y (2009) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst. doi: 10.1007/s10115-009-0194-2
- 53.Wikipedia (2008) Disappearance of Madeleine McCann http://en.wikipedia.org/w/index.php?title=Disappearance_of_Madeleine_McCann&oldid=224183687
- 54.Wikipedia (2008) Disappearance of Madeleine McCann http://en.wikipedia.org/w/index.php?title=Disappearance_of_Madeleine_McCann&oldid=215814790
- 55.Wong PC, Cowley W, Foote H, Jurrus E, Thomas J (2000) Visualizing sequential patterns for text mining. In: Proceedings of the IEEE symposium on information visualization (InfoVis’00), pp 105–111Google Scholar
- 56.Xu J, Croft WB (1996) Query expansion using local and global document analysis. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 4–11Google Scholar
- 57.Xu J, Croft WB (2000) Improving the effectiveness of information retrieval with local context analysis. ACM Trans Inf Syst 18(1): 79–112CrossRefGoogle Scholar
- 58.Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 46–54Google Scholar
- 59.Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 210–217Google Scholar