Automatic Web News Extraction Based on DS Theory Considering Content Topics

  • Kaihang Zhang
  • Chuang Zhang
  • Xiaojun Chen
  • Jianlong Tan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10860)


In addition to the news content, most news web pages also contain various noises, such as advertisements, recommendations, and navigation panels. These noises may hamper the studies and applications which require pre-processing to extract the news content accurately. Existing methods of news content extraction mostly rely on non-content features, such as tag path, text layout, and DOM structure. However, without considering topics of the news content, these methods are difficult to recognize noises whose external characteristics are similar to those of the news content. In this paper, we propose a method that combines non-content features and a topic feature based on Dempster-Shafer (DS) theory to increase the recognition accuracy. We use maximal compatibility blocks to generate topics from text nodes and then obtain feature values of topics. Each feature is converted into evidence for the DS theory which can be utilized in the uncertain information fusion. Experimental results on English and Chinese web pages show that combining the topic feature by DS theory can improve the extraction performance obviously.


Content extraction Dempster-Shafer theory Maximal compatibility blocks Information fusion 


  1. 1.
    Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM international conference on Information & Knowledge Management. pp. 2059–2068. ACM (2013)Google Scholar
  2. 2.
    Weninger, T., Hsu, W.H., Han, J.: Cetr: content extraction via tag ratios. In: Proceedings of the 19th international conference on World wide web. pp. 971–980. ACM (2010)Google Scholar
  3. 3.
    Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. pp. 245–254. ACM (2011)Google Scholar
  4. 4.
    Reis, D.d.C., Golgher, P.B., Silva, A.S., Laender, A.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th international conference on World Wide Web. pp. 502–511. ACM (2004)Google Scholar
  5. 5.
    Fang, Y., Xie, X., Zhang, X., Cheng, R., Zhang, Z.: Stem: a suffix tree-based method for web data records extraction. Knowledge and Information Systems pp. 1–27 (2017)Google Scholar
  6. 6.
    Gulhane, P., Madaan, A., Mehta, R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: Proceedings of the 27th International Conference on Data Engineering (ICDE). pp. 1209–1220. IEEE (2011)Google Scholar
  7. 7.
    Bing, L., Wong, T.L., Lam, W.: Unsupervised extraction of popular product attributes from e-commerce web sites by considering customer reviews. ACM Transactions on Internet Technology (TOIT) 16(2), 1–17 (2016)CrossRefGoogle Scholar
  8. 8.
    Charron, B., Hirate, Y., Purcell, D., Rezk, M.: Extracting semantic information for e-commerce. In: Proceedings of the International Semantic Web Conference. pp. 273–290. Springer (2016)Google Scholar
  9. 9.
    Gali, N., Mariescu-Istodor, R., Fränti, P.: Using linguistic features to automatically extract web page title. Expert Systems with Applications 79, 296–312 (2017)CrossRefGoogle Scholar
  10. 10.
    Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured data: the TSIMMIS experience. In: Proceedings of the East-European Conference on Advances in Databases and Information Systems pp. 1–8 (1997)Google Scholar
  11. 11.
    Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data & Knowledge Engineering 36(3), 283–316 (2001)CrossRefGoogle Scholar
  12. 12.
    Ashish, N., Knoblock, C.A.: Semi-automatic wrapper generation for internet information sources. In: Proceedings of the Ifcis International Conference on Cooperative Information Systems. pp. 160–169. IEEE (1997)Google Scholar
  13. 13.
    Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering. pp. 611–621. IEEE (2000)Google Scholar
  14. 14.
    Deng, C., Shipeng, Y., Jirong, W., Wei-Ying, M.: Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79 (2003)Google Scholar
  15. 15.
    Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: Proceedings of the 13th international conference on World Wide Web. pp. 203–211. ACM (2004)Google Scholar
  16. 16.
    Sentz, K., Ferson, S., et al.: Combination of evidence in Dempster-Shafer theory, vol. 4015. Citeseer (2002)Google Scholar
  17. 17.
    Dong, F., Shatz, S.M., Xu, H.: Reasoning under uncertainty for shill detection in online auctions using dempster-shafer theory. International Journal of Software Engineering and Knowledge Engineering 20(07), 943–973 (2010)CrossRefGoogle Scholar
  18. 18.
    Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing pp. 404–411 (2004)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Kaihang Zhang
    • 1
    • 2
  • Chuang Zhang
    • 1
  • Xiaojun Chen
    • 1
  • Jianlong Tan
    • 1
  1. 1.Institute of Information EngineeringChinese Academy of SciencesBeijingChina
  2. 2.School of Cyber SecurityUniversity of Chinese Academy of SciencesBeijingChina

Personalised recommendations