Classification of XSLT-Generated Web Documents with Support Vector Machines

  • Atakan Kurt
  • Engin Tozal
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3915)


XSLT is a transformation language mainly used for converting XML documents to HTML or other formats. Due to its simplicity and flexibility XML has replaced traditional EDI file formats. Most e-business applications store data in XML, convert XML into HTML using XSTL, and publish the HTML documents to the web. In this paper we argue that the use of XSLT presents an opportunity rather than a challenge to web document classification. We show that it is possible to combine the advantages of both HTML and XML into classification of documents at the XSLT transformation stage, named XSLT classification, to attain higher classification rates using Support Vector Machines (SVM). The results are both expected and promising. We believe that XSLT classification can become a favorable classification method over HTML or XML classification where XSLT stylesheets are available.


Support Vector Machine Document Type Definition Wireless Application Protocol High Classification Rate Structural Model Classification 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Tozal, E.: Classification Using XSLT. MS Thesis, Comp. Eng.Fatih University (2005)Google Scholar
  2. 2.
    Kurt, A., Tozal, E.: A Web Classification Framework Based on XSLT. In: Shen, H.T., Li, J., Li, M., Ni, J., Wang, W. (eds.) APWeb Workshops 2006. LNCS, vol. 3842, pp. 86–96. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  3. 3.
    Dumais, S., et al.: Inductive learning algorithms and representations for text categ-orization. In: 7th Int. Conf. on Information and knowledge management, pp. 148–155 (1998)Google Scholar
  4. 4.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: 10th European Conference on Machine Learning (ECML) (1998)Google Scholar
  5. 5.
    Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, Heidelberg (1999)MATHGoogle Scholar
  6. 6.
    Basu, A., Watters, C., Shepherd, M.: Support Vector Machines for Text Categorization. In: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (2003)Google Scholar
  7. 7.
    Mladenic, D.: Turning Yahoo to Automatic Web-Page Classifier. In: European Conference on Artificial Intelligence (1998)Google Scholar
  8. 8.
    Esposto, F., Malerba, D., Pace, L.D., Leo, P.: A machine learning apporach to web mining. In: Proc. of the 6th Congress of the Italian Association for Artificial Intelligence (1999)Google Scholar
  9. 9.
    Sun, A., Lim, E., Ng, W.: Web classification using support vector machine. In: The 4th Int. Workshop on Web information and Data Management, ACM Press, New York (2002)Google Scholar
  10. 10.
    Asirvatham, A.P., Ravi, K.K.: Web Page Classification based on Document Structure (2001)Google Scholar
  11. 11.
    Oh, H.-J., et al.: A practical hypertext categorization method using links and incrementally available class information. In: The 23rd ACM Int. Conf. on R & D in Information Retrieval (2000)Google Scholar
  12. 12.
    Chakrabarti, S., Dom, B.E., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD (1998)Google Scholar
  13. 13.
    Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: Proceedings of the 6th ACM SIGKDD 2000 (2000)Google Scholar
  14. 14.
    Denoyer, L., Gallinari, P.: Bayesian network model for semi-structured document classification. Information Processing and Management 40(5) (2004)Google Scholar
  15. 15.
    Lewis, D.D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)Google Scholar
  16. 16.
    Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence (2001)Google Scholar
  17. 17.
    McCallum, A., Nigam, K.: A comparision of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Atakan Kurt
    • 1
  • Engin Tozal
    • 1
  1. 1.Computer Eng. Dept.Fatih UniversityTurkey

Personalised recommendations