Advertisement

A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA

  • Ilias GialampoukidisEmail author
  • Stefanos Vrochidis
  • Ioannis Kompatsiaris
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9729)

Abstract

Nowadays there is an important need by journalists and media monitoring companies to cluster news in large amounts of web articles, in order to ensure fast access to their topics or events of interest. Our aim in this work is to identify groups of news articles that share a common topic or event, without a priori knowledge of the number of clusters. The estimation of the correct number of topics is a challenging issue, due to the existence of “noise”, i.e. news articles which are irrelevant to all other topics. In this context, we introduce a novel density-based news clustering framework, in which the assignment of news articles to topics is done by the well-established Latent Dirichlet Allocation, but the estimation of the number of clusters is performed by our novel method, called “DBSCAN-Martingale”, which allows for extracting noise from the dataset and progressively extracts clusters from an OPTICS reachability plot. We evaluate our framework and the DBSCAN-Martingale on the 20newsgroups-mini dataset and on 220 web news articles, which are references to specific Wikipedia pages. Among twenty methods for news clustering, without knowing the number of clusters k, the framework of DBSCAN-Martingale provides the correct number of clusters and the highest Normalized Mutual Information.

Keywords

Clustering news articles Latent Dirichlet Allocation DBSCAN-Martingale 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer, US (2012)Google Scholar
  2. 2.
    Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: ACM Sigmod Record, vol. 28(2), pp. 49–60. ACM, June 1999Google Scholar
  3. 3.
    Ball, G.H., Hall, D.J.: ISODATA, a novel method of data analysis and pattern classification. Stanford Research Institute (NTIS No. AD 699616) (1965)Google Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)zbMATHGoogle Scholar
  5. 5.
    Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013, Part II. LNCS, vol. 7819, pp. 160–172. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  6. 6.
    Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods 3(1), 1–27 (1974)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A.: NbClust: an R package for determining the relevant number of clusters in a data set. Journal of Statistical Software 61(6), 1–36 (2014)CrossRefGoogle Scholar
  8. 8.
    Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224–227 (1979)CrossRefGoogle Scholar
  9. 9.
    Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis, vol. 3. Wiley, New York (1973)zbMATHGoogle Scholar
  10. 10.
    Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics 4(1), 95–104 (1974)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Doob, J.L.: Stochastic Processes, vol. 101. Wiley, New York (1953)zbMATHGoogle Scholar
  12. 12.
    Ester, M., Kriegel, H. P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96(34), pp. 226–231, August 1996Google Scholar
  13. 13.
    Frey, T., Van Groenewoud, H.: A cluster analysis of the D2 matrix of white spruce stands in Saskatchewan based on the maximum-minimum principle. The Journal of Ecology, 873–886 (1972)Google Scholar
  14. 14.
    Halkidi, M., Vazirgiannis, M., Batistakis, Y.: Quality scheme assessment in the clustering process. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 265–276. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  15. 15.
    Hartigan, J.A.: Clustering Algorithms. John Wiley Sons, New York (1975)zbMATHGoogle Scholar
  16. 16.
    Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2(1), 193–218 (1985)CrossRefzbMATHGoogle Scholar
  17. 17.
    Hubert, L.J., Levin, J.R.: A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin 83(6), 1072 (1976)CrossRefGoogle Scholar
  18. 18.
    Kaufman, L., Rousseeuw, P.J.: Finding groups in data. An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, 1st edn. Wiley, New York (1990)Google Scholar
  19. 19.
    Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 23–34 (1988)Google Scholar
  20. 20.
    Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian nonparametrics. arXiv preprint (2012). arXiv:1111.0352
  21. 21.
    Kumar, A., Daumé, H.: A co-training approach for multi-view spectral clustering. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 393–400 (2011)Google Scholar
  22. 22.
    McClain, J.O., Rao, V.R.: Clustisz: A program to test for the quality of clustering of a set of objects. Journal of Marketing Research (pre-1986) 12(000004), 456 (1975)Google Scholar
  23. 23.
    Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)CrossRefGoogle Scholar
  24. 24.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  25. 25.
    Qian, M., Zhai, C.: Unsupervised feature selection for multi-view clustering on text-image web news data. In: Proceedings of the 23rd ACM International Conference on Information Knowledge Management, pp. 1963–1966. ACM, November 2014Google Scholar
  26. 26.
    Ratkowsky, D.A., Lance, G.N.: A criterion for determining the number of groups in a classification. Australian Computer Journal 10(3), 115–117 (1978)Google Scholar
  27. 27.
    Sander, J., Qin, X., Lu, Z., Niu, N., Kovarsky, A.: Automatic extraction of clusters from hierarchical clustering representations. In: Advances in Knowledge Discovery and Data Mining, pp. 75–87. Springer, Heidelberg (2003)Google Scholar
  28. 28.
    Schneider, J., Vlachos, M.: Fast parameterless density-based clustering via random projections. In: Proceedings of the 22nd ACM International Conference on Information Knowledge Management, pp. 861–866. ACM (2013)Google Scholar
  29. 29.
    Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476) (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Ilias Gialampoukidis
    • 1
    Email author
  • Stefanos Vrochidis
    • 1
  • Ioannis Kompatsiaris
    • 1
  1. 1.Information Technologies Institute, CERTHThessalonikiGreece

Personalised recommendations