Real-Time News Grouping: Detecting the Same-Content News on Turkish News Stream

Kömeçoğlu, Yavuz; Kömeçoğlu, Başak Buluz; Yılmaz, Burcu

doi:10.1007/978-3-030-66501-2_2

Yavuz Kömeçoğlu¹⁷,
Başak Buluz Kömeçoğlu¹⁸ &
Burcu Yılmaz¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1301))

Included in the following conference series:

International Online Conference on Intelligent Decision Science

583 Accesses

Abstract

The increasing number of news sources makes analysis of the news difficult and increase the need for automated systems. This paper presents a system that clusters news with similar content in real time. The system uses the Apache Solr database to capture texts from the news source and its MoreLikeThis (MLT) search component to extract the 5 most similar news from thousands of previously recorded news. The new news will be included in the cluster with the most similar of the 5 news obtained by pre-filtering. Therefore, the main problem sought in this study is finding the news that most closely resembles the new news. For this, a 2-step approach has been proposed. The majority of the news published in different sources in the media is created by the same or broadening/shortening of the same text. For this reason, the ‘citation rate’ was calculated primarily among news pairs. If there are news pairs that exceed the citation threshold, the pair with the highest citation rate is included in the same cluster. Otherwise, the numerical representations of the texts at different levels were used in order to determine the similarity semantically. The results of the study show that the proposed 2-stage approach reduces the sensitivity of embeddings at different levels to text lengths. Thus, it achieved up to 7.6% improvement compared to clustering approach only with embeddings. The system proposed in these study has a structure that can be used in real life applications in terms of real-time clustering with a high F-score rate of over 90%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Aggregation of Semantically Similar News Articles with the Help of Embedding Techniques and Unsupervised Machine Learning Algorithms: A Machine Learning Application with Semantic Technologies

Combining Lexical and Semantic Similarity Methods for News Article Matching

Iterative Strict Density-Based Clustering for News Stream

References

Staykovski, T., Barrón-Cedeno, A., Da San Martino, G., Nakov, P.: Dense vs. sparse representations for news stream clustering. In: Text2Story@ ECIR, pp. 47–52 (2019)
Google Scholar
Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report, Virginia Tech. (2007)
Google Scholar
Blokh, I., Alexandrov, V.: News clustering based on similarity analysis. Proc. Comput. Sci. 122, 715–719 (2017)
Article Google Scholar
Dangre, N., Bodke, A., Date, A., Rungta, S., Pathak, S.S.: System for Marathi news clustering. Proc. Comput. Sci. 92, 18–22 (2016)
Article Google Scholar
Bisandu, D.B., Prasad, R., Liman, M.M.: Clustering news articles using efficient similarity measure and N-grams. Int. J. Knowl. Eng. Data Mining 5(4), 333–348 (2018)
Article Google Scholar
Lwin, M.T., Aye, M.M.: A modified hierarchical agglomerative approach for efficient document clustering system. Am. Sci. Res. J. Eng. Technol. Sci. (ASRJETS) 29(1), 228–238 (2017)
Google Scholar
Bouras, C., Tsogkas, V.: Assisting cluster coherency via n-grams and clustering as a tool to deal with the new user problem. Int. J. Mach. Learn. Cybernet. 7(2), 171–184 (2014)
Article Google Scholar
Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198 (2002)
Google Scholar
Neal, R., Hinton, G.: A view of the em algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models. Kluwer (1998)
Google Scholar
Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198. ACM Press (2002)
Google Scholar
Banerjee, A., Ghosh, J.: Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres. In: IEEE International Joint Conference on Neural Networks, Honolulu, Hawaii, pp. 1590–1595 (2002)
Google Scholar
Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learn. Res. 6, 1345–1382 (2005)
MathSciNet MATH Google Scholar
Ravi, K., Santosh, V., Adrian, V.: On clusterings: good, bad and spectral. J. ACM 51(3), 497–515 (2004)
Article MathSciNet Google Scholar
Kummamuru, K., Dhawale, A., Krishnapuram, R.: Fuzzy co-clustering of documents and keywords. In: FUZZ 2003: 12th IEEE International Conference on Fuzzy Systems, pp. 772–777 (2003)
Google Scholar
Dell, Z., Yisheng, D.: Semantic, hierarchical, online clustering of web search results. In: 6th Asia Pacific Web Conference (APWEB), Hangzhou, China, pp. 69–78 (2004)
Google Scholar
Wei, X., Xin, L., Yihong G.: Document clustering based on non-negative matrix factorization. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, pp. 267–273 (2003)
Google Scholar
Chris, D., Xioafeng, H., Horst D.S.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of SIAM International Conference on Data Mining, pp. 606–610 (2005)
Google Scholar
Chris, D., Tao, L., Wei, P.: NMF and PLSI: equivalence and a hybrid algorithm. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 641–642, New York, NY, USA (2006)
Google Scholar
Derek, G., Padraig, C.: Producing accurate interpretable clusters from high dimensional data. In: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, vol. 3721, pp. 486–494. University of Dublin, Trinity College, Dublin (2005)
Google Scholar
Stanislaw, O., Jerzy, S., Dawid, W.: Lingo: search results clustering algorithm based on singular value decomposition. In: Klopotek, M.A., Wierzchon, S.T., Trojanowski, K. (eds.) Intelligent Information Systems, Advances in Soft Computing, pp. 359–368. Springer (2004)
Google Scholar
Sven, E., Benno, S., Martin, P.: The suffix tree document model revisited. In: Tochtermann, K., Maurer, H. (eds.) Proceedings of the 5th International Conference on Knowledge Management (I-KNOW 2005), Graz, Austria, pp. 596–603. Know-Center (2005). Journal of Universal Computer Science
Google Scholar
Hammouda, K.M., Kamel, M.S.: Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16(10), 1279–1296 (2004)
Google Scholar
Llewellyn, C., Grover, C., Oberlander, J.: Summarizing newspaper comments. In: Eighth International AAAI Conference on Weblogs and Social Media (2014)
Google Scholar
Miranda, S., Znotiņš, A., Cohen, S.B., Barzdins, G.: Multilingual clustering of streaming news. arXiv preprint arXiv:1809.00540 (2018)
Gong, L., Zeng, J., Zhang, S.: Text stream clustering algorithm based on adaptive feature selection. Expert Syst. Appl. 38(3), 1393–1399 (2011)
Article Google Scholar
Wattanakitrungroj, N., Maneeroj, S., Lursinsap, C.: BEstream: batch capturing with elliptic function for one-pass data stream clustering. Data Knowl. Eng. 117, 53–70 (2018)
Article Google Scholar
O’callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings 18th International Conference on Data Engineering, pp. 685–694. IEEE (2002)
Google Scholar
Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 234–243 (2003)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Chidambaram, M., Yang, Y., Cer, D., Yuan, S., Sung, Y.H., Strope, B., Kurzweil, R.: Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836 (2018)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)
Article Google Scholar
Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 17th International Conference on Computational Linguistics, vol. 1, pp. 79–85. Association for Computational Linguistics (1998)
Google Scholar

Download references

Acknowledgment

We would like to thank to Interpress Media Monitoring Agency, which provides news data for Turkish News Texts Corpus and Grouped Turkish News Texts Test Set

Author information

Authors and Affiliations

Kodiks Bilisim, Istanbul, Turkey
Yavuz Kömeçoğlu
Gebze Technical University, Kocaeli, Turkey
Başak Buluz Kömeçoğlu & Burcu Yılmaz

Authors

Yavuz Kömeçoğlu
View author publications
You can also search for this author in PubMed Google Scholar
Başak Buluz Kömeçoğlu
View author publications
You can also search for this author in PubMed Google Scholar
Burcu Yılmaz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Başak Buluz Kömeçoğlu .

Editor information

Editors and Affiliations

Faculty of Engineering and Natural Sciences, Bahcesehir University, Istanbul, Turkey
Tofigh Allahviranloo
Faculty of Engineering and Natural Sciences, Bahçeşehir University, Istanbul, Turkey
Soheil Salahshour
Faculty of Engineering and Natural Sciences, Bahcesehir University, Istanbul, Turkey
Nafiz Arica

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kömeçoğlu, Y., Kömeçoğlu, B.B., Yılmaz, B. (2021). Real-Time News Grouping: Detecting the Same-Content News on Turkish News Stream. In: Allahviranloo, T., Salahshour, S., Arica, N. (eds) Progress in Intelligent Decision Science. IDS 2020. Advances in Intelligent Systems and Computing, vol 1301. Springer, Cham. https://doi.org/10.1007/978-3-030-66501-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-66501-2_2
Published: 30 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66500-5
Online ISBN: 978-3-030-66501-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Real-Time News Grouping: Detecting the Same-Content News on Turkish News Stream

Abstract

Access this chapter

Similar content being viewed by others

Aggregation of Semantically Similar News Articles with the Help of Embedding Techniques and Unsupervised Machine Learning Algorithms: A Machine Learning Application with Semantic Technologies

Combining Lexical and Semantic Similarity Methods for News Article Matching

Iterative Strict Density-Based Clustering for News Stream

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Real-Time News Grouping: Detecting the Same-Content News on Turkish News Stream

Abstract

Access this chapter

Similar content being viewed by others

Aggregation of Semantically Similar News Articles with the Help of Embedding Techniques and Unsupervised Machine Learning Algorithms: A Machine Learning Application with Semantic Technologies

Combining Lexical and Semantic Similarity Methods for News Article Matching

Iterative Strict Density-Based Clustering for News Stream

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation