Skip to main content

A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA

  • Conference paper
  • First Online:
Machine Learning and Data Mining in Pattern Recognition (MLDM 2016)

Abstract

Nowadays there is an important need by journalists and media monitoring companies to cluster news in large amounts of web articles, in order to ensure fast access to their topics or events of interest. Our aim in this work is to identify groups of news articles that share a common topic or event, without a priori knowledge of the number of clusters. The estimation of the correct number of topics is a challenging issue, due to the existence of “noise”, i.e. news articles which are irrelevant to all other topics. In this context, we introduce a novel density-based news clustering framework, in which the assignment of news articles to topics is done by the well-established Latent Dirichlet Allocation, but the estimation of the number of clusters is performed by our novel method, called “DBSCAN-Martingale”, which allows for extracting noise from the dataset and progressively extracts clusters from an OPTICS reachability plot. We evaluate our framework and the DBSCAN-Martingale on the 20newsgroups-mini dataset and on 220 web news articles, which are references to specific Wikipedia pages. Among twenty methods for news clustering, without knowing the number of clusters k, the framework of DBSCAN-Martingale provides the correct number of clusters and the highest Normalized Mutual Information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer, US (2012)

    Google Scholar 

  2. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: ACM Sigmod Record, vol. 28(2), pp. 49–60. ACM, June 1999

    Google Scholar 

  3. Ball, G.H., Hall, D.J.: ISODATA, a novel method of data analysis and pattern classification. Stanford Research Institute (NTIS No. AD 699616) (1965)

    Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013, Part II. LNCS, vol. 7819, pp. 160–172. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  6. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods 3(1), 1–27 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  7. Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A.: NbClust: an R package for determining the relevant number of clusters in a data set. Journal of Statistical Software 61(6), 1–36 (2014)

    Article  Google Scholar 

  8. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224–227 (1979)

    Article  Google Scholar 

  9. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis, vol. 3. Wiley, New York (1973)

    MATH  Google Scholar 

  10. Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics 4(1), 95–104 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  11. Doob, J.L.: Stochastic Processes, vol. 101. Wiley, New York (1953)

    MATH  Google Scholar 

  12. Ester, M., Kriegel, H. P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96(34), pp. 226–231, August 1996

    Google Scholar 

  13. Frey, T., Van Groenewoud, H.: A cluster analysis of the D2 matrix of white spruce stands in Saskatchewan based on the maximum-minimum principle. The Journal of Ecology, 873–886 (1972)

    Google Scholar 

  14. Halkidi, M., Vazirgiannis, M., Batistakis, Y.: Quality scheme assessment in the clustering process. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 265–276. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  15. Hartigan, J.A.: Clustering Algorithms. John Wiley Sons, New York (1975)

    MATH  Google Scholar 

  16. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2(1), 193–218 (1985)

    Article  MATH  Google Scholar 

  17. Hubert, L.J., Levin, J.R.: A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin 83(6), 1072 (1976)

    Article  Google Scholar 

  18. Kaufman, L., Rousseeuw, P.J.: Finding groups in data. An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, 1st edn. Wiley, New York (1990)

    Google Scholar 

  19. Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 23–34 (1988)

    Google Scholar 

  20. Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian nonparametrics. arXiv preprint (2012). arXiv:1111.0352

  21. Kumar, A., Daumé, H.: A co-training approach for multi-view spectral clustering. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 393–400 (2011)

    Google Scholar 

  22. McClain, J.O., Rao, V.R.: Clustisz: A program to test for the quality of clustering of a set of objects. Journal of Marketing Research (pre-1986) 12(000004), 456 (1975)

    Google Scholar 

  23. Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)

    Article  Google Scholar 

  24. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  25. Qian, M., Zhai, C.: Unsupervised feature selection for multi-view clustering on text-image web news data. In: Proceedings of the 23rd ACM International Conference on Information Knowledge Management, pp. 1963–1966. ACM, November 2014

    Google Scholar 

  26. Ratkowsky, D.A., Lance, G.N.: A criterion for determining the number of groups in a classification. Australian Computer Journal 10(3), 115–117 (1978)

    Google Scholar 

  27. Sander, J., Qin, X., Lu, Z., Niu, N., Kovarsky, A.: Automatic extraction of clusters from hierarchical clustering representations. In: Advances in Knowledge Discovery and Data Mining, pp. 75–87. Springer, Heidelberg (2003)

    Google Scholar 

  28. Schneider, J., Vlachos, M.: Fast parameterless density-based clustering via random projections. In: Proceedings of the 22nd ACM International Conference on Information Knowledge Management, pp. 861–866. ACM (2013)

    Google Scholar 

  29. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476) (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ilias Gialampoukidis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Gialampoukidis, I., Vrochidis, S., Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2016. Lecture Notes in Computer Science(), vol 9729. Springer, Cham. https://doi.org/10.1007/978-3-319-41920-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41920-6_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41919-0

  • Online ISBN: 978-3-319-41920-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics