Advertisement

Improving the Categorization of Web Sites by Analysis of Html-Tags Statistics to Block Inappropriate Content

  • Dmitry Novozhilov
  • Igor KotenkoEmail author
  • Andrey Chechulin
Conference paper
Part of the Studies in Computational Intelligence book series (SCI, volume 616)

Abstract

The paper considers the problem of improving the quality of web sites categorization using data mining methods. This goal is important for automated systems of parental control. The purpose of such systems is protection from unwanted or inappropriate information. The novelty of the proposed approach is in usage of HTML tags statistics of web pages to improve the categorization of sites that are similar in terms of textual content, but differing in their structural features. The paper describes the architecture of the categorization system, the algorithm of its work, the results of experiments, and assessment of classification quality.

Notes

Acknowledgment

This research is being supported by The Ministry of Education and Science of The Russian Federation (contract # 14.604.21.0147, unique contract identifier RFMEFI60414X0147).

References

  1. 1.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: ECML-98, LNCS, vol. 1398, pp. 137–142. Springer (1998)Google Scholar
  2. 2.
    Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: Coling’00, pp. 453–459. Morgan Kaufmann (2000)Google Scholar
  3. 3.
    Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: ACM, pp. 83–92 (2006)Google Scholar
  4. 4.
    Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of word-and sense-based text categorization using several classification algorithms. J. Intell. Inf. Syst. 21(3), 227–247 (2000)CrossRefGoogle Scholar
  5. 5.
    Attardi, G., Gulli, A., Sebastiani, F.: Automatic web page categorization by link and context analysis. In: THAI’99, pp. 105–119 (1999)Google Scholar
  6. 6.
    Khonji, M., Iraqi, Y., Jones, A.: Enhancing phishing E-Mail classifiers: a lexical URL analysis approach. Int. J. Inf. Secur. Res. 6, 236–245 (2012)Google Scholar
  7. 7.
    Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: KDD’09, pp. 1245–1254. ACM (2009)Google Scholar
  8. 8.
    Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: ICIKM 2005, ACM (2005)Google Scholar
  9. 9.
    Geide, M.: N-gram Character Sequence Analysis of Benign vs. Malicious Domains/URLs. Available at http://analysis-manifold.com/ Accessed 24 March 2015
  10. 10.
    Meshkizadeh, S., Masoud-Rahmani, A.: Webpage classification based on compound of using html features and url features and features of sibling pages. Int. J. Adv. Comput. Technol. 2(4), 36–46 (2010)Google Scholar
  11. 11.
    Patil, A.S., Pawar, B.V.: Automated classification of web sites using naive bayesian algorithm. In: IMECS2012, vol. 1, p. 466 (2012)Google Scholar
  12. 12.
    Riboni, D. Feature selection for web page classification. In: EURASIA-ICT-2002 (2002)Google Scholar
  13. 13.
    Kotenko, I., Chechulin, A., Shorov, A., Komashinsky, D.: Analysis and evaluation of web pages classification techniques for inappropriate content blocking. LNAI 8557, 39–54 (2014)Google Scholar
  14. 14.
    URLBlacklist.com.: http://urlblacklist.com/ Accessed 24 March 2015
  15. 15.
    Shalla Secure Services KG.: http://www.shallalist.de/ Accessed 24 March 2015

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Dmitry Novozhilov
    • 1
  • Igor Kotenko
    • 1
    Email author
  • Andrey Chechulin
    • 1
  1. 1.Laboratory of Computer Security ProblemsSt. Petersburg Institute for Informatics and Automation (SPIIRAS)St. PetersburgRussia

Personalised recommendations