Importance-Based Web Page Classification Using Cost-Sensitive SVM

  • Wei Liu
  • Gui-rong Xue
  • Yong Yu
  • Hua-jun Zeng
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3739)


Web page classification is facing great challenges since there is a huge repository and diversity of information. As known, each web page varies both in content and quality, just as PageRank suggested. Typical machine learning algorithms take advantage of positive and negative examples to train a classifier; however, it has been neglected that each instance has a different weight, which can be user pre-defined. This paper presents an effective algorithm based on Cost-Sensitive Support Vector Machine (CS-SVM) to improve the accuracy of classification. During the training process of CS-SVM, different cost factors are attached on the training errors to generate an optimized hyperplane. Our experiments show that CS-SVM outperforms SVM on the standard ODP data set. The web pages with relative high PageRank values contribute most to the classifier and using them for training can exceed the random sampling technique.


Support Vector Machine Little Square Support Vector Machine Random Sampling Technique Sequential Minimal Optimization Soft Margin 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Roush, W.: Search Beyond Google. MIT technology review, 34–35 (2004)Google Scholar
  2. 2.
    Yiming, Y., Xin, L.: A Reexamination of Text Categorization Methods. In: Proceedings of the 22th International Conference on Research and Development in Information Retrieval, University of California, Berkeley, USA, pp. 42–49 (1999)Google Scholar
  3. 3.
    McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proceedings of AAAI 1998 Workshop on Learning for Text Categorization, Madison, WI, pp. 41–48 (1998)Google Scholar
  4. 4.
    Lewis, D.D., Ringuette, M.: A Classification of Two Learning Algorithms for Text Categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)Google Scholar
  5. 5.
    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar
  6. 6.
    Ruiz, M.E., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002)zbMATHCrossRefGoogle Scholar
  7. 7.
    Burges, C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)CrossRefGoogle Scholar
  8. 8.
    Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th International Conference on Machine Learning (ICML), Bled, Slovenia, pp. 200–209 (1999)Google Scholar
  9. 9.
    Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Processing Letters 9(3), 293–300 (1999)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Bing, L., Yang, D., Xiaoli, L., Wee Sum, L.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of International Conference on Data Mining, pp. 179–186 (2003)Google Scholar
  11. 11.
    Brin, S., Page, L.: The Anatomy of a Large-scale Hypertextual Web Search Engine. In: Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia (1998)Google Scholar
  12. 12.
    Bernhard, E.B., Isabelle, M.G., Vladimir, N.V.: A Training Algorithm for Optimal Margin Classifiers. In: Proceedings of International Conference on Computational Learning Theory, pp. 144–152 (1992)Google Scholar
  13. 13.
    Kuhn, H., Tucker, A.: Nonlinear Programming. In: Proceedings of 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics, pp. 481–492. University of California Press (1951)Google Scholar
  14. 14.
    Platt, J.: Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. In: Advances in Kernel Methods - Support Vector Learning, pp. 185–208 (1998)Google Scholar
  15. 15.
    Joachims, T.: Making large-Scale SVM Learning Practical Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)Google Scholar
  16. 16.
    Yiming, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1-2), 69–90 (1999)Google Scholar
  17. 17.
    Chakrabarti, S., Dom, B., Indyk, P.: Enhanced Hypertext Categorization Using Hyperlinks. In: Proceedings of ACM Special Interest Group on Management of Data, June 1998, vol. 27(2), pp. 307–318 (1998)Google Scholar
  18. 18.
    Attardi, G., Gull, A., Sebastiani, F.: Automatic Web Page Categorization by Link and Context Analysis. In: Proceedings of 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence (Varese, IT), p. 12 (1999)Google Scholar
  19. 19.
    Shih, L.k., Karger, D.R.: Using URLs and Table Layout for Web Classification Tasks. In: Proceedings of the 13th international conference on World Wide Web (2004)Google Scholar
  20. 20.
    Belkin, M., Niyogi, P., Sindhwani, V.: Manifold Regularization: a Geometric Framework for Learning from Examples, University of Chicago Computer Science Technical Report TR-2004-06 (2004)Google Scholar
  21. 21.
    Carroll, R.J., Ruppert, D.: Transformation and Weighting in Regression. Chapman and Hall, New York (1998)Google Scholar
  22. 22.
    Paredes, R., Vidal, E.: A Nearest Neighbor Weighted Measure in Classification Problems. In: Proceedings of VIII Simposium Nacional de Reconocimiento de Formas y An alisis de Im agenes, Bilbao, Spain, May 1999, vol. 1, pp. 437–444 (1999)Google Scholar
  23. 23.
    Shen, H., Gui-Rong, X., Yong, Y., Benyu, Z., Zheng, C., Wei-Ying, M.: Multi-type Features based Web Document Clustering. In: Proceedings of the 5th International Conference on Web Information Systems Engineering, Brisbane, Australia, November 22-24 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Wei Liu
    • 1
  • Gui-rong Xue
    • 1
  • Yong Yu
    • 2
  • Hua-jun Zeng
    • 3
  1. 1.Shanghai Jiao Tong UniversityMin Hang ShanghaiChina
  2. 2.Computer Science DepartmentShanghai Jiao Tong UniversityShanghaiChina
  3. 3.Microsoft Research AsiaBeijingChina

Personalised recommendations