A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA

Gialampoukidis, Ilias; Vrochidis, Stefanos; Kompatsiaris, Ioannis

doi:10.1007/978-3-319-41920-6_13

Ilias Gialampoukidis¹⁴,
Stefanos Vrochidis¹⁴ &
Ioannis Kompatsiaris¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9729))

Included in the following conference series:

International Conference on Machine Learning and Data Mining in Pattern Recognition

3102 Accesses
8 Citations

Abstract

Nowadays there is an important need by journalists and media monitoring companies to cluster news in large amounts of web articles, in order to ensure fast access to their topics or events of interest. Our aim in this work is to identify groups of news articles that share a common topic or event, without a priori knowledge of the number of clusters. The estimation of the correct number of topics is a challenging issue, due to the existence of “noise”, i.e. news articles which are irrelevant to all other topics. In this context, we introduce a novel density-based news clustering framework, in which the assignment of news articles to topics is done by the well-established Latent Dirichlet Allocation, but the estimation of the number of clusters is performed by our novel method, called “DBSCAN-Martingale”, which allows for extracting noise from the dataset and progressively extracts clusters from an OPTICS reachability plot. We evaluate our framework and the DBSCAN-Martingale on the 20newsgroups-mini dataset and on 220 web news articles, which are references to specific Wikipedia pages. Among twenty methods for news clustering, without knowing the number of clusters k, the framework of DBSCAN-Martingale provides the correct number of clusters and the highest Normalized Mutual Information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer, US (2012)
Google Scholar
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: ACM Sigmod Record, vol. 28(2), pp. 49–60. ACM, June 1999
Google Scholar
Ball, G.H., Hall, D.J.: ISODATA, a novel method of data analysis and pattern classification. Stanford Research Institute (NTIS No. AD 699616) (1965)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013, Part II. LNCS, vol. 7819, pp. 160–172. Springer, Heidelberg (2013)
Chapter Google Scholar
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods 3(1), 1–27 (1974)
Article MathSciNet MATH Google Scholar
Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A.: NbClust: an R package for determining the relevant number of clusters in a data set. Journal of Statistical Software 61(6), 1–36 (2014)
Article Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224–227 (1979)
Article Google Scholar
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis, vol. 3. Wiley, New York (1973)
MATH Google Scholar
Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics 4(1), 95–104 (1974)
Article MathSciNet MATH Google Scholar
Doob, J.L.: Stochastic Processes, vol. 101. Wiley, New York (1953)
MATH Google Scholar
Ester, M., Kriegel, H. P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96(34), pp. 226–231, August 1996
Google Scholar
Frey, T., Van Groenewoud, H.: A cluster analysis of the D2 matrix of white spruce stands in Saskatchewan based on the maximum-minimum principle. The Journal of Ecology, 873–886 (1972)
Google Scholar
Halkidi, M., Vazirgiannis, M., Batistakis, Y.: Quality scheme assessment in the clustering process. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 265–276. Springer, Heidelberg (2000)
Chapter Google Scholar
Hartigan, J.A.: Clustering Algorithms. John Wiley Sons, New York (1975)
MATH Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2(1), 193–218 (1985)
Article MATH Google Scholar
Hubert, L.J., Levin, J.R.: A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin 83(6), 1072 (1976)
Article Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding groups in data. An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, 1st edn. Wiley, New York (1990)
Google Scholar
Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 23–34 (1988)
Google Scholar
Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian nonparametrics. arXiv preprint (2012). arXiv:1111.0352
Kumar, A., Daumé, H.: A co-training approach for multi-view spectral clustering. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 393–400 (2011)
Google Scholar
McClain, J.O., Rao, V.R.: Clustisz: A program to test for the quality of clustering of a set of objects. Journal of Marketing Research (pre-1986) 12(000004), 456 (1975)
Google Scholar
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)
Article Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Qian, M., Zhai, C.: Unsupervised feature selection for multi-view clustering on text-image web news data. In: Proceedings of the 23rd ACM International Conference on Information Knowledge Management, pp. 1963–1966. ACM, November 2014
Google Scholar
Ratkowsky, D.A., Lance, G.N.: A criterion for determining the number of groups in a classification. Australian Computer Journal 10(3), 115–117 (1978)
Google Scholar
Sander, J., Qin, X., Lu, Z., Niu, N., Kovarsky, A.: Automatic extraction of clusters from hierarchical clustering representations. In: Advances in Knowledge Discovery and Data Mining, pp. 75–87. Springer, Heidelberg (2003)
Google Scholar
Schneider, J., Vlachos, M.: Fast parameterless density-based clustering via random projections. In: Proceedings of the 22nd ACM International Conference on Information Knowledge Management, pp. 861–866. ACM (2013)
Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476) (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Information Technologies Institute, CERTH, Thessaloniki, Greece
Ilias Gialampoukidis, Stefanos Vrochidis & Ioannis Kompatsiaris

Authors

Ilias Gialampoukidis
View author publications
You can also search for this author in PubMed Google Scholar
Stefanos Vrochidis
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Kompatsiaris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ilias Gialampoukidis .

Editor information

Editors and Affiliations

IBaI, Inst of Comp Vision and applied Comp Sci, Leipzig, Sachsen, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gialampoukidis, I., Vrochidis, S., Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2016. Lecture Notes in Computer Science(), vol 9729. Springer, Cham. https://doi.org/10.1007/978-3-319-41920-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-41920-6_13
Published: 28 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41919-0
Online ISBN: 978-3-319-41920-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics