Representations for multi-document event clustering

De Smet, Wim; Moens, Marie-Francine

doi:10.1007/s10618-012-0270-1

Representations for multi-document event clustering

Published: 20 June 2012

Volume 26, pages 533–558, (2013)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Wim De Smet¹ &
Marie-Francine Moens¹

2205 Accesses
6 Citations
Explore all metrics

Abstract

We study several techniques for representing, fusing and comparing content representations of news documents. As underlying models we consider the vector space model (both in a term setting and in a latent semantic analysis setting) and probabilistic topic models based on latent Dirichlet allocation. Content terms can be classified as topical terms or named entities, yielding several models for content fusion and comparison. All used methods are completely unsupervised. We find that simple methods can still outperform the current state-of-the-art techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Allan J, Lavrenko V, Swan R (2002) Explorations within topic tracking and detection. Kluwer, Norwell, ir 20, pp 197–224
Allan J, Wade C, Bolivar A (2003) Retrieval and novelty detection at the sentence level. In: SIGIR ’03, ACM, New York, pp 314–321
Bagga A, Baldwin B (1998) Algorithms for scoring coreference chains. In: The first international conference on language resources and evaluation workshop on linguistics coreference, Granada, pp 563–566
Barzilay R, Lee L (2003) Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: HLT-NAACL ’03: main proceedings, Edmonton, pp 16–23
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3: 993–1022
MATH Google Scholar
Buntine W, Jakulin A (2006) Discrete component analysis. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, latent structure and feature selection techniques. Springer, Heidelberg, pp 237–247
Google Scholar
Cutting DR, Pedersen JO, Karger D, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR ’92, Seattle, pp 318–329
de Marneffe MC, Rafferty AN, Manning CD (2008) Finding contradictions in text. In: ACL’08: HLT, Association for Computational Linguistics, Columbus, pp 1039–1047
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41: 391–407
Article Google Scholar
Deschacht K, De Belder J, Moens MF (2012) The latent words language model. Comput Speech Lang 26(5): 384–409
Article Google Scholar
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: SIGIR ’01, ACM, New York, pp 19–25
Griffiths T, Steyvers M, Tenenbaum J (2007) Topics in semantic representation. Psychol Rev 114(2): 211–244
Article Google Scholar
Hatzivassiloglou V (1998) Automatic acquisition of lexical semantic knowledge from large corpora: the identification of semantically related words, markedness, polarity, and antonymy. PhD thesis, New York
Hershkop S, Stolfo SJ (2005) Combining email models for false positive reduction. In: KDD ’05, ACM, New York, pp 98–107
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of uncertainty in artificial intelligence, Stockholm
Kumaran G, Allan J (2004) Text classification and named entities for new event detection. In: SIGIR ’04, ACM, New York, pp 297–304
Lee MD, Welsh M (2005) An empirical evaluation of models of text document similarity. In: CogSci2005, Erlbaum, pp 1254–1259
Li W, Mccallum A (2006) Pachinko allocation: DAG-structured mixture models of topic correlations. In: ICML ’06, ACM, New York, pp 577–584
Li Z, Wang B, Li M, Ma WY (2005) A probabilistic model for retrospective news event detection. In: SIGIR ’05, ACM, New York, pp 106–113
Makkonen U, Ahonen-Myka H, Marko (2002) Applying semantic classes in event detection and tracking. In: Proceedings of the International Conference on Natural Language Processing (ICON’02), Bombay, pp 175–183
Mccallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: Proceedings of the 19th international joint conference on artificial intelligence, Edinburgh, pp 786–791
Mckeown K, Radev DR (1995) Generating summaries of multiple news articles. In: SIGIR ’95, Seattle, pp 74–82
Nallapati R, Feng A, Peng F, Allan J (2004) Event threading within news topics. In: CIKM ’04, Washington, pp 446–453
Pearl J (1991) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San Mateo
Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20: 53–65
Article MATH Google Scholar
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Boston
Google Scholar
Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton
MATH Google Scholar
Snoek CGM (2005) Early versus late fusion in semantic video analysis. In: ACM multimedia, New York, pp 399–402
Steinberger J, Ježek K (2009) Update summarization based on novel topic distribution. In: DocEng’09, ACM, New York, pp 205–213
Steinberger J, Poesio M, Kabadjov MA, Jeek K (2007) Two uses of anaphora resolution in summarization. Inf Process Manag 43(6): 1663–1680
Article Google Scholar
Steinberger J, Turchi M, Kabadjov M, Steinberger R, Cristianini N (2010) Wrapping up a summary: from representation to generation. In: Proceedings of the ACL 2010 conference short papers, Association for Computational Linguistics, Uppsala, pp 382–386. http://www.aclweb.org/anthology/P10-2070
Stone B, Dennis S, Kwantes PJ (2011) Comparing methods for single paragraph similarity analysis. Top Cogn Sci 3(1): 92–122. doi:10.1111/j.1756-8765.2010.01108.x
Article Google Scholar
Tsatsaronis G, Varlamis I, Vazirgiannis M (2010) Text relatedness based on a word thesaurus. J Artif Intell Res 37: 1–39
MATH Google Scholar
Voorhees EM (1986) Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Technical Report, Ithaca
Wang ZW, Wong SKM, Yao YY (1992) An analysis of vector space models based on computational geometry. In: SIGIR ’92, ACM, New York, pp 152–160
Wang K, Li X, Gao J (2010) Multi-style language model for web scale information retrieval. In: SIGIR ’10, ACM, New York, pp 467–474
Yang Y, Carbonell JG, Brown RD, Pierce T, Archibald BT, Liu X (1999) Learning approaches for detecting and tracking news events. IEEE Intell Syst 14(4): 32–43
Article Google Scholar
Zhang K, Zi J, Wu LG (2007) New event detection based on indexing-tree and named entity. In: SIGIR ’07, ACM, New York, pp 215–222

Download references

Author information

Authors and Affiliations

Department of Computer Science, K.U. Leuven, Leuven, Belgium
Wim De Smet & Marie-Francine Moens

Authors

Wim De Smet
View author publications
You can also search for this author in PubMed Google Scholar
Marie-Francine Moens
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wim De Smet.

Additional information

Responsible editor: R. Bayardo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Smet, W., Moens, MF. Representations for multi-document event clustering. Data Min Knowl Disc 26, 533–558 (2013). https://doi.org/10.1007/s10618-012-0270-1

Download citation

Received: 27 October 2008
Accepted: 11 May 2012
Published: 20 June 2012
Issue Date: May 2013
DOI: https://doi.org/10.1007/s10618-012-0270-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Representations for multi-document event clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

An integrated clustering and BERT framework for improved topic modeling

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Representations for multi-document event clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

An integrated clustering and BERT framework for improved topic modeling

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation