2-Way Text Classification for Harmful Web Documents

  • Youngsoo Kim
  • Taekyong Nam
  • Dongho Won
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3981)


The openness of the Web allows any user to access almost any type of information. However, some information, such as adult content, is not appropriate for all users, notably children. Additionally for adults, some contents included in abnormal porn sites can do ordinary people’s mental health harm. In this paper, we propose an efficient 2-way text filter for blocking harmful web documents and also present a new criterion for clear classification. It filters off 0-grade web texts containing no harmful words using pattern matching with harmful words dictionaries, and classifies 1-grade,2-grade and 3-grade web texts using a machine learning algorithm.


Feature Selection Algorithm Pattern Match Algorithm Meta Search Engine High Term Frequency User Dictionary 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Internet Contents Rating Association, http://www.icra.org
  2. 2.
  3. 3.
    Information Communication Ethics Committee, http://www.icec.or.kr
  4. 4.
    Siolas, G.: Support Vector Machines based on a semantic kernel for text categorization. In: IJCNN 2000, vol. 5, pp. 205–209 (2000)Google Scholar
  5. 5.
    Support vector machine-Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/SVM
  6. 6.
    Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on Machine Learning, pp. 412–420 (1997)Google Scholar
  7. 7.
    Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of the 14th international conference on Machine Learning, pp. 143–151 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Youngsoo Kim
    • 1
    • 2
  • Taekyong Nam
    • 1
  • Dongho Won
    • 2
  1. 1.Network Security Group, Electronics and Telecommunications Research Institute (ETRI)DaejeonKorea
  2. 2.Information Security Group, School of Information and Communication EngineeringSungkyunkwan UniversitySuwon, Gyeonggi-doKorea

Personalised recommendations