Automatic news-roundup generation using clustering, extraction, and presentation

  • Vincent Utomo
  • Jenq-Shiou LeuEmail author
Regular Paper


Along with the growth of the internet, the number of information published increased exponentially. This huge flow of information causes a problem called “information overload” which makes it harder for internet users to find key information they needed on the internet. To solve this, this paper proposes an application that helps user find trending news of their query/interest easily. Some challenges include how to determining the trending subtopic, how to extract only the content of each webpage, and how to present the data to user. Therefore, three core modules are used in this study, which are clustering, extraction, and presentation. Several methods are tested in this study, including naïve, manual thresholding, and heuristic clustering method. The result shows that hierarchical clustering using tf–idf word weighting, cosine similarity as distance measure and heuristically terminated using elbow point analysis achieves the best result at 50.84% Acc and 61.96% NMI. One challenge commonly faced by extraction algorithm is the tendency to have lower effectivity over time. In this paper, extraction algorithm using a prior-known subject/keyword to help the content extraction process is used. Second stage of noise removal process is also introduced to further remove noise that exists within the content block. The evaluation result shows improved score of 7.48%. The final application was able to receive score of 4.18 of 5 for its helpfulness and 4.35 of 5 for its effectiveness by respondents; showing that the proposed application could really help users to find information and help to solve information overload problem.


Information overload Search result clustering Subtopic discovery Information retrieval User query Second-stage noise removal 


Supplementary material

530_2019_638_MOESM1_ESM.mp4 (2.8 mb)
Supplementary material 1 (MP4 2889 kb)


  1. 1.
    Abualigah, L.M., Khader, A.T., Al-Betar, M.A., Alomari, O.A.: Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst. Appl. 84, 24–36 (2017)CrossRefGoogle Scholar
  2. 2.
    Adelberg, B.: Nodose. A tool for semi-automatically extracting semi-structured data from text documents. In: Proceedings of SIGMOD, pp. 283–294 (1998)Google Scholar
  3. 3.
    Arın, İ., Erpam, M.K., Saygın, Y.: I-TWEC: interactive clustering tool for Twitter. Expert Syst. Appl. 96, 1–13 (2018)CrossRefGoogle Scholar
  4. 4.
    Baumgartner, R. a. F. S. a. G. G.: Visual web information extraction with lixto. VLDB 2001. In: Proceedings of 27th International Conference on Very Large Data Bases‚ September 11–14, Roma‚ Italy (2001)Google Scholar
  5. 5.
    Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Vips: a vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)Google Scholar
  6. 6.
    Carey, H. J., Manic, M.: HTML web content extraction using paragraph tags. In: IEEE 25th International Symposium on Industrial Electronics (ISIE), pp. 1099–1104 (2016)Google Scholar
  7. 7.
    Chen, H., Dumais, S.: Bringing order to the web: automatically categorizing search results. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 145–152 (2000)Google Scholar
  8. 8.
    Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 830–839 (2005)Google Scholar
  9. 9.
    Dalvi, N., Bohannon, P., Sha, F.: Robust web extraction: an approach based on a probabilistic tree-edit model. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 335–348 (2009)Google Scholar
  10. 10.
    Defays, D.: An efficient algorithm for a complete link method. Comput J 20(4), 364–366 (1977)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96(34), 226–231 (1996)Google Scholar
  12. 12.
    Grangier, X.: Python-Goose. (2011)
  13. 13.
    Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based content extraction of HTML documents. In: Proceedings of the 12th INTERNATIONAL CONFERENCE on World Wide Web, pp. 207–214 (2003)Google Scholar
  14. 14.
    Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)zbMATHGoogle Scholar
  15. 15.
    Ifrim, G., Shi, B., Brigadir, I.: Event detection in twitter using aggressive filtering and hierarchical tweet clustering. Second Workshop on Social News on the Web (SNOW), Seoul, Korea, 8 April 2014 (2018)Google Scholar
  16. 16.
    Insa, D., Silva, J., Tamarit, S.: Using the words/leafs ratio in the DOM tree for content extraction. J. Logic Algebr. Program 82(8), 311–325 (2013)CrossRefGoogle Scholar
  17. 17.
    Ketchen Jr, D.J., Shook, C.L.: The application of cluster analysis in strategic management research: an analysis and critique. Strateg. Manag. J 17(6) 441–458 (1996)Google Scholar
  18. 18.
    Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, England (2014)CrossRefGoogle Scholar
  19. 19.
    Levandowsky, M., Winter, D.: Distance between sets. Nature 234(5323), 34–35 (1971)CrossRefGoogle Scholar
  20. 20.
    Liu, L., Pu, C., Han, W.: XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings. 16th International Conference on Data Engineering, pp. 611–621 (2000)Google Scholar
  21. 21.
    Lovász, L., Plummer, M.: Matching theory. Vol. 367. American Mathematical Soc (2009)Google Scholar
  22. 22.
    Ma, L., Goharian, N., Chowdhury, A., Chung, M.: Extracting unstructured data from template generated web documents. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 512–515 (2003)Google Scholar
  23. 23.
    Myllymaki, J.: Effective web data extraction with standard XML technologies. Comput. Netw. 39(5), 635–644 (2002)CrossRefGoogle Scholar
  24. 24.
    Nenkova, A., Vanderwende, L.: The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005, Volume 101 (2005)Google Scholar
  25. 25.
    Palacios, R.: Eatiht. (2015)
  26. 26.
    Parameswaran, A., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal schemes for robust web extraction. In: Proceedings of the VLDB Conference, Vol. 4 No. 11 VLDB Endowment, pp. 980–991 (2011)Google Scholar
  27. 27.
    Rosa, K.D. et al.: Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM (2011)Google Scholar
  28. 28.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefGoogle Scholar
  29. 29.
    Sanoja, A., Gancarski, S.: Block-o-matic: a web page segmentation framework. In: 2014 International Conference on Multimedia Computing and Systems (ICMCS), pp. 595–600 (2014)Google Scholar
  30. 30.
    Schubert, E., et al.: DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans. Database Syst. (TODS) 42(3), 19 (2017)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Sharifi, B., Hutton, M.-A., Kalita, J.K.: Experiments in microblog summarization. In: 2010 IEEE Second International Conference on Social Computing (SocialCom), pp. 49–56 (2010)Google Scholar
  32. 32.
    Singhal, A.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001)Google Scholar
  33. 33.
    Song, D., Sun, F., Liao, L.: A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowl. Inf. Syst. 42(1), 75–96 (2015)CrossRefGoogle Scholar
  34. 34.
    Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 245–254 (2011)Google Scholar
  35. 35.
    Weninger, T., Palacios, R., Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: a MetaAnalysis of its past and thoughts on its future. ACM SIGKDD Explor. Newsl 2(17), 17–23 (2016)CrossRefGoogle Scholar
  36. 36.
    Weninger, T., Hsu, W. H., Han, J.: CETR: content extraction via tag ratios. In: Proceedings of the 19th International Conference on World Wide Web, pp. 971–980 (2010)Google Scholar
  37. 37.
    Utomo, V., Leu, J.-S.: Unpublished. Subject-Assisted Extraction: Looking at Web Content Extraction from Different SideGoogle Scholar
  38. 38.
    Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Wu, S., Liu, J., Fan, J.: Automatic web content extraction by combination of learning and grouping. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1264–1274 (2015)Google Scholar
  40. 40.
    Xie, P., Xing, E.P.: Integrating document clustering and topic modeling. arXiv preprint arXiv:1309.6874(2013)
  41. 41.
    Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 267–273 (2003)Google Scholar
  42. 42.
    Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to Web search results. Comput. Netw. 31(11), 1361–1374 (1999)CrossRefGoogle Scholar
  43. 43.
    Zeng, H.-J. et al.: Learning to cluster web search results. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 210–217 (2004)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Electronic and Computer EngineeringNational Taiwan University of Science and TechnologyTaipeiTaiwan

Personalised recommendations