Classification of XSLT-Generated Web Documents with Support Vector Machines

Kurt, Atakan; Tozal, Engin

doi:10.1007/11730262_6

Classification of XSLT-Generated Web Documents with Support Vector Machines

Atakan Kurt¹⁸ &
Engin Tozal¹⁸

Conference paper

329 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3915))

Abstract

XSLT is a transformation language mainly used for converting XML documents to HTML or other formats. Due to its simplicity and flexibility XML has replaced traditional EDI file formats. Most e-business applications store data in XML, convert XML into HTML using XSTL, and publish the HTML documents to the web. In this paper we argue that the use of XSLT presents an opportunity rather than a challenge to web document classification. We show that it is possible to combine the advantages of both HTML and XML into classification of documents at the XSLT transformation stage, named XSLT classification, to attain higher classification rates using Support Vector Machines (SVM). The results are both expected and promising. We believe that XSLT classification can become a favorable classification method over HTML or XML classification where XSLT stylesheets are available.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Tozal, E.: Classification Using XSLT. MS Thesis, Comp. Eng.Fatih University (2005)
Google Scholar
Kurt, A., Tozal, E.: A Web Classification Framework Based on XSLT. In: Shen, H.T., Li, J., Li, M., Ni, J., Wang, W. (eds.) APWeb Workshops 2006. LNCS, vol. 3842, pp. 86–96. Springer, Heidelberg (2006)
Chapter Google Scholar
Dumais, S., et al.: Inductive learning algorithms and representations for text categ-orization. In: 7th Int. Conf. on Information and knowledge management, pp. 148–155 (1998)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: 10th European Conference on Machine Learning (ECML) (1998)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, Heidelberg (1999)
MATH Google Scholar
Basu, A., Watters, C., Shepherd, M.: Support Vector Machines for Text Categorization. In: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (2003)
Google Scholar
Mladenic, D.: Turning Yahoo to Automatic Web-Page Classifier. In: European Conference on Artificial Intelligence (1998)
Google Scholar
Esposto, F., Malerba, D., Pace, L.D., Leo, P.: A machine learning apporach to web mining. In: Proc. of the 6th Congress of the Italian Association for Artificial Intelligence (1999)
Google Scholar
Sun, A., Lim, E., Ng, W.: Web classification using support vector machine. In: The 4th Int. Workshop on Web information and Data Management, ACM Press, New York (2002)
Google Scholar
Asirvatham, A.P., Ravi, K.K.: Web Page Classification based on Document Structure (2001)
Google Scholar
Oh, H.-J., et al.: A practical hypertext categorization method using links and incrementally available class information. In: The 23rd ACM Int. Conf. on R & D in Information Retrieval (2000)
Google Scholar
Chakrabarti, S., Dom, B.E., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD (1998)
Google Scholar
Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: Proceedings of the 6th ACM SIGKDD 2000 (2000)
Google Scholar
Denoyer, L., Gallinari, P.: Bayesian network model for semi-structured document classification. Information Processing and Management 40(5) (2004)
Google Scholar
Lewis, D.D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)
Google Scholar
Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence (2001)
Google Scholar
McCallum, A., Nigam, K.: A comparision of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Eng. Dept., Fatih University, Istanbul, Turkey
Atakan Kurt & Engin Tozal

Authors

Atakan Kurt
View author publications
You can also search for this author in PubMed Google Scholar
Engin Tozal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Technology, Queensland University of Technology, Brisbane, Australia
Richi Nayak
Computer Science Department, Rensselaer Polytechnic Institute, USA
Mohammed J. Zaki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kurt, A., Tozal, E. (2006). Classification of XSLT-Generated Web Documents with Support Vector Machines. In: Nayak, R., Zaki, M.J. (eds) Knowledge Discovery from XML Documents. KDXD 2006. Lecture Notes in Computer Science, vol 3915. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11730262_6

Download citation

DOI: https://doi.org/10.1007/11730262_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33180-3
Online ISBN: 978-3-540-33181-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics