Skip to main content

Real-Time News Grouping: Detecting the Same-Content News on Turkish News Stream

  • Conference paper
  • First Online:
Progress in Intelligent Decision Science (IDS 2020)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1301))

Included in the following conference series:

  • 583 Accesses

Abstract

The increasing number of news sources makes analysis of the news difficult and increase the need for automated systems. This paper presents a system that clusters news with similar content in real time. The system uses the Apache Solr database to capture texts from the news source and its MoreLikeThis (MLT) search component to extract the 5 most similar news from thousands of previously recorded news. The new news will be included in the cluster with the most similar of the 5 news obtained by pre-filtering. Therefore, the main problem sought in this study is finding the news that most closely resembles the new news. For this, a 2-step approach has been proposed. The majority of the news published in different sources in the media is created by the same or broadening/shortening of the same text. For this reason, the ‘citation rate’ was calculated primarily among news pairs. If there are news pairs that exceed the citation threshold, the pair with the highest citation rate is included in the same cluster. Otherwise, the numerical representations of the texts at different levels were used in order to determine the similarity semantically. The results of the study show that the proposed 2-stage approach reduces the sensitivity of embeddings at different levels to text lengths. Thus, it achieved up to 7.6% improvement compared to clustering approach only with embeddings. The system proposed in these study has a structure that can be used in real life applications in terms of real-time clustering with a high F-score rate of over 90%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Staykovski, T., Barrón-Cedeno, A., Da San Martino, G., Nakov, P.: Dense vs. sparse representations for news stream clustering. In: Text2Story@ ECIR, pp. 47–52 (2019)

    Google Scholar 

  2. Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report, Virginia Tech. (2007)

    Google Scholar 

  3. Blokh, I., Alexandrov, V.: News clustering based on similarity analysis. Proc. Comput. Sci. 122, 715–719 (2017)

    Article  Google Scholar 

  4. Dangre, N., Bodke, A., Date, A., Rungta, S., Pathak, S.S.: System for Marathi news clustering. Proc. Comput. Sci. 92, 18–22 (2016)

    Article  Google Scholar 

  5. Bisandu, D.B., Prasad, R., Liman, M.M.: Clustering news articles using efficient similarity measure and N-grams. Int. J. Knowl. Eng. Data Mining 5(4), 333–348 (2018)

    Article  Google Scholar 

  6. Lwin, M.T., Aye, M.M.: A modified hierarchical agglomerative approach for efficient document clustering system. Am. Sci. Res. J. Eng. Technol. Sci. (ASRJETS) 29(1), 228–238 (2017)

    Google Scholar 

  7. Bouras, C., Tsogkas, V.: Assisting cluster coherency via n-grams and clustering as a tool to deal with the new user problem. Int. J. Mach. Learn. Cybernet. 7(2), 171–184 (2014)

    Article  Google Scholar 

  8. Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198 (2002)

    Google Scholar 

  9. Neal, R., Hinton, G.: A view of the em algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models. Kluwer (1998)

    Google Scholar 

  10. Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198. ACM Press (2002)

    Google Scholar 

  11. Banerjee, A., Ghosh, J.: Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres. In: IEEE International Joint Conference on Neural Networks, Honolulu, Hawaii, pp. 1590–1595 (2002)

    Google Scholar 

  12. Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learn. Res. 6, 1345–1382 (2005)

    MathSciNet  MATH  Google Scholar 

  13. Ravi, K., Santosh, V., Adrian, V.: On clusterings: good, bad and spectral. J. ACM 51(3), 497–515 (2004)

    Article  MathSciNet  Google Scholar 

  14. Kummamuru, K., Dhawale, A., Krishnapuram, R.: Fuzzy co-clustering of documents and keywords. In: FUZZ 2003: 12th IEEE International Conference on Fuzzy Systems, pp. 772–777 (2003)

    Google Scholar 

  15. Dell, Z., Yisheng, D.: Semantic, hierarchical, online clustering of web search results. In: 6th Asia Pacific Web Conference (APWEB), Hangzhou, China, pp. 69–78 (2004)

    Google Scholar 

  16. Wei, X., Xin, L., Yihong G.: Document clustering based on non-negative matrix factorization. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, pp. 267–273 (2003)

    Google Scholar 

  17. Chris, D., Xioafeng, H., Horst D.S.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of SIAM International Conference on Data Mining, pp. 606–610 (2005)

    Google Scholar 

  18. Chris, D., Tao, L., Wei, P.: NMF and PLSI: equivalence and a hybrid algorithm. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 641–642, New York, NY, USA (2006)

    Google Scholar 

  19. Derek, G., Padraig, C.: Producing accurate interpretable clusters from high dimensional data. In: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, vol. 3721, pp. 486–494. University of Dublin, Trinity College, Dublin (2005)

    Google Scholar 

  20. Stanislaw, O., Jerzy, S., Dawid, W.: Lingo: search results clustering algorithm based on singular value decomposition. In: Klopotek, M.A., Wierzchon, S.T., Trojanowski, K. (eds.) Intelligent Information Systems, Advances in Soft Computing, pp. 359–368. Springer (2004)

    Google Scholar 

  21. Sven, E., Benno, S., Martin, P.: The suffix tree document model revisited. In: Tochtermann, K., Maurer, H. (eds.) Proceedings of the 5th International Conference on Knowledge Management (I-KNOW 2005), Graz, Austria, pp. 596–603. Know-Center (2005). Journal of Universal Computer Science

    Google Scholar 

  22. Hammouda, K.M., Kamel, M.S.: Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16(10), 1279–1296 (2004)

    Google Scholar 

  23. Llewellyn, C., Grover, C., Oberlander, J.: Summarizing newspaper comments. In: Eighth International AAAI Conference on Weblogs and Social Media (2014)

    Google Scholar 

  24. Miranda, S., Znotiņš, A., Cohen, S.B., Barzdins, G.: Multilingual clustering of streaming news. arXiv preprint arXiv:1809.00540 (2018)

  25. Gong, L., Zeng, J., Zhang, S.: Text stream clustering algorithm based on adaptive feature selection. Expert Syst. Appl. 38(3), 1393–1399 (2011)

    Article  Google Scholar 

  26. Wattanakitrungroj, N., Maneeroj, S., Lursinsap, C.: BEstream: batch capturing with elliptic function for one-pass data stream clustering. Data Knowl. Eng. 117, 53–70 (2018)

    Article  Google Scholar 

  27. O’callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings 18th International Conference on Data Engineering, pp. 685–694. IEEE (2002)

    Google Scholar 

  28. Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 234–243 (2003)

    Google Scholar 

  29. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  30. Chidambaram, M., Yang, Y., Cer, D., Yuan, S., Sung, Y.H., Strope, B., Kurzweil, R.: Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836 (2018)

  31. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)

    Google Scholar 

  32. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)

    Article  Google Scholar 

  33. Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 17th International Conference on Computational Linguistics, vol. 1, pp. 79–85. Association for Computational Linguistics (1998)

    Google Scholar 

Download references

Acknowledgment

We would like to thank to Interpress Media Monitoring Agency, which provides news data for Turkish News Texts Corpus and Grouped Turkish News Texts Test Set

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Başak Buluz Kömeçoğlu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kömeçoğlu, Y., Kömeçoğlu, B.B., Yılmaz, B. (2021). Real-Time News Grouping: Detecting the Same-Content News on Turkish News Stream. In: Allahviranloo, T., Salahshour, S., Arica, N. (eds) Progress in Intelligent Decision Science. IDS 2020. Advances in Intelligent Systems and Computing, vol 1301. Springer, Cham. https://doi.org/10.1007/978-3-030-66501-2_2

Download citation

Publish with us

Policies and ethics