Abstract
An efficient classification approach for XML documents is introduced in this paper, which lies in combining the content with the structure of XML documents to compute the similarity between the categories and documents. It is based on the Support Vector Machine (SVM) algorithm and the Structured Link Vector Model (SLVM) which used closed frequent subtrees as the structural units. The document tree pruning strategy was applied to improve the classification system while the link information between the documents was considered to get better classification results. We did experiments on the INEX XML mining data sets combining these techniques, and the results showed that our approach performs better than any other competitor’s approach on XML classification.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Yang, J., Chen, X.: A semi-structured document model for text mining. Journal of Computer Science and Technology 17, 603–610 (2002)
Zaki, M.J., Aggarwal, C.C.: XRules: an effective structural classifier for XML data. Presented at the Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C. (2003)
De Knijf, J.: FAT-CAT: Frequent Attributes Tree Based Classification. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 485–496. Springer, Heidelberg (2007)
Denoyer, L., Gallinari, P.: Bayesian network model for semi-structured document classification. Inf. Process. Manage. 40, 807–827 (2004)
Costa, G., Ortale, R., Ritacco, E.: Effective XML Classification Using Content and Structural Information via Rule Learning. In: 2011 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 102–109 (2011)
Wu, J.: A Framework for Learning Comprehensible Theories in XML Document Classification. IEEE Transactions on Knowledge and Data Engineering 24, 1–14 (2012)
Collobert, R., Bengio, S.: SVMTorch: support vector machines for large-scale regression problems. J. Mach. Learn. Res. 1, 143–160 (2001)
Yun, C., Yi, X., Yirong, Y., Muntz, R.R.: Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Transactions on Knowledge and Data Engineering 17, 190–202 (2005)
Gery, M., Largeron, C., Thollard, F.: Integrating Structure in the Probabilistic Model for Information Retrieval. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2008, pp. 763–769 (2008)
De Vries, C.M., Nayak, R., Kutty, S., Geva, S., Tagarelli, A.: Overview of the INEX 2010 XML Mining Track: Clustering and Classification of XML Documents. In: Geva, S., Kamps, J., Schenkel, R., Trotman, A. (eds.) INEX 2010. LNCS, vol. 6932, pp. 363–376. Springer, Heidelberg (2011)
de Campos, L.M., Fernández-Luna, J.M., Huete, J.F., Masegosa, A.R., Romero, A.E.: Link-Based Text Classification Using Bayesian Networks. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 397–406. Springer, Heidelberg (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, S., Hong, Y., Yang, J. (2012). XML Document Classification Using Closed Frequent Subtree. In: Bao, Z., et al. Web-Age Information Management. WAIM 2012. Lecture Notes in Computer Science, vol 7419. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33050-6_34
Download citation
DOI: https://doi.org/10.1007/978-3-642-33050-6_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33049-0
Online ISBN: 978-3-642-33050-6
eBook Packages: Computer ScienceComputer Science (R0)