Skip to main content
Log in

Representations for multi-document event clustering

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We study several techniques for representing, fusing and comparing content representations of news documents. As underlying models we consider the vector space model (both in a term setting and in a latent semantic analysis setting) and probabilistic topic models based on latent Dirichlet allocation. Content terms can be classified as topical terms or named entities, yielding several models for content fusion and comparison. All used methods are completely unsupervised. We find that simple methods can still outperform the current state-of-the-art techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Allan J, Lavrenko V, Swan R (2002) Explorations within topic tracking and detection. Kluwer, Norwell, ir 20, pp 197–224

  • Allan J, Wade C, Bolivar A (2003) Retrieval and novelty detection at the sentence level. In: SIGIR ’03, ACM, New York, pp 314–321

  • Bagga A, Baldwin B (1998) Algorithms for scoring coreference chains. In: The first international conference on language resources and evaluation workshop on linguistics coreference, Granada, pp 563–566

  • Barzilay R, Lee L (2003) Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: HLT-NAACL ’03: main proceedings, Edmonton, pp 16–23

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3: 993–1022

    MATH  Google Scholar 

  • Buntine W, Jakulin A (2006) Discrete component analysis. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, latent structure and feature selection techniques. Springer, Heidelberg, pp 237–247

    Google Scholar 

  • Cutting DR, Pedersen JO, Karger D, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR ’92, Seattle, pp 318–329

  • de Marneffe MC, Rafferty AN, Manning CD (2008) Finding contradictions in text. In: ACL’08: HLT, Association for Computational Linguistics, Columbus, pp 1039–1047

  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41: 391–407

    Article  Google Scholar 

  • Deschacht K, De Belder J, Moens MF (2012) The latent words language model. Comput Speech Lang 26(5): 384–409

    Article  Google Scholar 

  • Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: SIGIR ’01, ACM, New York, pp 19–25

  • Griffiths T, Steyvers M, Tenenbaum J (2007) Topics in semantic representation. Psychol Rev 114(2): 211–244

    Article  Google Scholar 

  • Hatzivassiloglou V (1998) Automatic acquisition of lexical semantic knowledge from large corpora: the identification of semantically related words, markedness, polarity, and antonymy. PhD thesis, New York

  • Hershkop S, Stolfo SJ (2005) Combining email models for false positive reduction. In: KDD ’05, ACM, New York, pp 98–107

  • Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of uncertainty in artificial intelligence, Stockholm

  • Kumaran G, Allan J (2004) Text classification and named entities for new event detection. In: SIGIR ’04, ACM, New York, pp 297–304

  • Lee MD, Welsh M (2005) An empirical evaluation of models of text document similarity. In: CogSci2005, Erlbaum, pp 1254–1259

  • Li W, Mccallum A (2006) Pachinko allocation: DAG-structured mixture models of topic correlations. In: ICML ’06, ACM, New York, pp 577–584

  • Li Z, Wang B, Li M, Ma WY (2005) A probabilistic model for retrospective news event detection. In: SIGIR ’05, ACM, New York, pp 106–113

  • Makkonen U, Ahonen-Myka H, Marko (2002) Applying semantic classes in event detection and tracking. In: Proceedings of the International Conference on Natural Language Processing (ICON’02), Bombay, pp 175–183

  • Mccallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: Proceedings of the 19th international joint conference on artificial intelligence, Edinburgh, pp 786–791

  • Mckeown K, Radev DR (1995) Generating summaries of multiple news articles. In: SIGIR ’95, Seattle, pp 74–82

  • Nallapati R, Feng A, Peng F, Allan J (2004) Event threading within news topics. In: CIKM ’04, Washington, pp 446–453

  • Pearl J (1991) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San Mateo

    Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20: 53–65

    Article  MATH  Google Scholar 

  • Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Boston

    Google Scholar 

  • Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton

    MATH  Google Scholar 

  • Snoek CGM (2005) Early versus late fusion in semantic video analysis. In: ACM multimedia, New York, pp 399–402

  • Steinberger J, Ježek K (2009) Update summarization based on novel topic distribution. In: DocEng’09, ACM, New York, pp 205–213

  • Steinberger J, Poesio M, Kabadjov MA, Jeek K (2007) Two uses of anaphora resolution in summarization. Inf Process Manag 43(6): 1663–1680

    Article  Google Scholar 

  • Steinberger J, Turchi M, Kabadjov M, Steinberger R, Cristianini N (2010) Wrapping up a summary: from representation to generation. In: Proceedings of the ACL 2010 conference short papers, Association for Computational Linguistics, Uppsala, pp 382–386. http://www.aclweb.org/anthology/P10-2070

  • Stone B, Dennis S, Kwantes PJ (2011) Comparing methods for single paragraph similarity analysis. Top Cogn Sci 3(1): 92–122. doi:10.1111/j.1756-8765.2010.01108.x

    Article  Google Scholar 

  • Tsatsaronis G, Varlamis I, Vazirgiannis M (2010) Text relatedness based on a word thesaurus. J Artif Intell Res 37: 1–39

    MATH  Google Scholar 

  • Voorhees EM (1986) Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Technical Report, Ithaca

  • Wang ZW, Wong SKM, Yao YY (1992) An analysis of vector space models based on computational geometry. In: SIGIR ’92, ACM, New York, pp 152–160

  • Wang K, Li X, Gao J (2010) Multi-style language model for web scale information retrieval. In: SIGIR ’10, ACM, New York, pp 467–474

  • Yang Y, Carbonell JG, Brown RD, Pierce T, Archibald BT, Liu X (1999) Learning approaches for detecting and tracking news events. IEEE Intell Syst 14(4): 32–43

    Article  Google Scholar 

  • Zhang K, Zi J, Wu LG (2007) New event detection based on indexing-tree and named entity. In: SIGIR ’07, ACM, New York, pp 215–222

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wim De Smet.

Additional information

Responsible editor: R. Bayardo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Smet, W., Moens, MF. Representations for multi-document event clustering. Data Min Knowl Disc 26, 533–558 (2013). https://doi.org/10.1007/s10618-012-0270-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-012-0270-1

Keywords

Navigation