Web Document Classification by Keywords Using Random Forests

  • Myungsook Klassen
  • Nikhila Paturi
Part of the Communications in Computer and Information Science book series (CCIS, volume 88)


Web directory hierarchy is critical to serve user’s search request. Creating and maintaining such directories without human experts involvement requires good classification of web documents. In this paper, we explore web page classification using keywords from documents as attributes and using the random forest learning methods. Our initially results are promising that the random forests learning method performed better than several other well known learning methods. When the number of topics increased from five to seven, random forests still performed better than other methods even though absolute classification rates decreased.


web document classification random forests data mining keywords topics web directory 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Breiman, L.: Random Forest. Machine Learning 45(1), 5–32 (2001)zbMATHCrossRefGoogle Scholar
  2. 2.
    Shi, T.: Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Modern Pathology 18, 547–557 (2005)CrossRefGoogle Scholar
  3. 3.
    Svetnik, V.: Random Forest: A Classification and Regression Tool for compound classification and QSAR modeling. J. Chem. Inf. Computer Science 43, 1947–1958 (2003)Google Scholar
  4. 4.
    Zhang, J., Zulkernine, M.: A Hybrid Network Intrusion Detection Technique Using Random Forests. In: Proceedings of the First International Conference on Availability, Reliability and Security (ARES 2006), pp. 262–269 (2006)Google Scholar
  5. 5.
    Russel, I., Markov, Z., Neller, T.: Wed Document Classification. NSF Project MLeXAI sample project report,
  6. 6.
    Qi, W., Davidson, B.: Web page classification: Features and Algorithms. ACM Computing Surveys 41(2) (2009)Google Scholar
  7. 7.
    Shen, D., Chen, Z., et al.: Web-page classification through summarization. In: SIGIR 2004 (2004)Google Scholar
  8. 8.
    Glover, E.J., Tsioutsiouliklis, K., Flake, et al.: Using web structure for classifying and describing web pages. In: Proc. of www, vol. 12 (2002)Google Scholar
  9. 9.
    Ye, Y., Li, H., Deng, X., Huang, J.: Feature weighting random forest for detection of hidden web search interfaces. Computational Linguistics and Chinese Language Processing 13(4), 387–404 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Myungsook Klassen
    • 1
  • Nikhila Paturi
    • 1
  1. 1.Computer Science DepartmentCalifornia Lutheran UniversityThousand OaskUSA

Personalised recommendations