XML Document Classification Using Closed Frequent Subtree

Wang, Songlin; Hong, Yihong; Yang, Jianwu

doi:10.1007/978-3-642-33050-6_34

XML Document Classification Using Closed Frequent Subtree

Songlin Wang²⁵,
Yihong Hong²⁵ &
Jianwu Yang²⁵

Conference paper

837 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7419))

Abstract

An efficient classification approach for XML documents is introduced in this paper, which lies in combining the content with the structure of XML documents to compute the similarity between the categories and documents. It is based on the Support Vector Machine (SVM) algorithm and the Structured Link Vector Model (SLVM) which used closed frequent subtrees as the structural units. The document tree pruning strategy was applied to improve the classification system while the link information between the documents was considered to get better classification results. We did experiments on the INEX XML mining data sets combining these techniques, and the results showed that our approach performs better than any other competitor’s approach on XML classification.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yang, J., Chen, X.: A semi-structured document model for text mining. Journal of Computer Science and Technology 17, 603–610 (2002)
Article MATH Google Scholar
Zaki, M.J., Aggarwal, C.C.: XRules: an effective structural classifier for XML data. Presented at the Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C. (2003)
Google Scholar
De Knijf, J.: FAT-CAT: Frequent Attributes Tree Based Classification. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 485–496. Springer, Heidelberg (2007)
Chapter Google Scholar
Denoyer, L., Gallinari, P.: Bayesian network model for semi-structured document classification. Inf. Process. Manage. 40, 807–827 (2004)
Article Google Scholar
Costa, G., Ortale, R., Ritacco, E.: Effective XML Classification Using Content and Structural Information via Rule Learning. In: 2011 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 102–109 (2011)
Google Scholar
Wu, J.: A Framework for Learning Comprehensible Theories in XML Document Classification. IEEE Transactions on Knowledge and Data Engineering 24, 1–14 (2012)
Article Google Scholar
Collobert, R., Bengio, S.: SVMTorch: support vector machines for large-scale regression problems. J. Mach. Learn. Res. 1, 143–160 (2001)
MathSciNet Google Scholar
Yun, C., Yi, X., Yirong, Y., Muntz, R.R.: Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Transactions on Knowledge and Data Engineering 17, 190–202 (2005)
Article Google Scholar
Gery, M., Largeron, C., Thollard, F.: Integrating Structure in the Probabilistic Model for Information Retrieval. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2008, pp. 763–769 (2008)
Google Scholar
De Vries, C.M., Nayak, R., Kutty, S., Geva, S., Tagarelli, A.: Overview of the INEX 2010 XML Mining Track: Clustering and Classification of XML Documents. In: Geva, S., Kamps, J., Schenkel, R., Trotman, A. (eds.) INEX 2010. LNCS, vol. 6932, pp. 363–376. Springer, Heidelberg (2011)
Chapter Google Scholar
de Campos, L.M., Fernández-Luna, J.M., Huete, J.F., Masegosa, A.R., Romero, A.E.: Link-Based Text Classification Using Bayesian Networks. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 397–406. Springer, Heidelberg (2010)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Sci. & Tech., Peking University, Beijing, 100871, China
Songlin Wang, Yihong Hong & Jianwu Yang

Authors

Songlin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yihong Hong
View author publications
You can also search for this author in PubMed Google Scholar
Jianwu Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, National University of Singapore, Singapore
Zhifeng Bao
College of Computer Science and Technology, Zhejiang University, 38 ZheDa Road, 310027, Hangzhou, China
Yunjun Gao
Northeastern University, Shenyang, China
Yu Gu
Heilongjiang University, 150080, Harbin, China
Longjiang Guo
Department of Computer Science, Georgia State University, 34 Peachtree Street, Suite 1413, 30303, Atlanta, GA, USA
Yingshu Li
Renmin University of China, Beijing, China
Jiaheng Lu
School of Computer Science, Hangzhou Dianzi University, Hangzhou, China
Zujie Ren
School of Software, Tsinghua University, Beijing, China
Chaokun Wang
School of Information, Renmin University of China, 100872, Beijing, China
Xiao Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, S., Hong, Y., Yang, J. (2012). XML Document Classification Using Closed Frequent Subtree. In: Bao, Z., et al. Web-Age Information Management. WAIM 2012. Lecture Notes in Computer Science, vol 7419. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33050-6_34

Download citation

DOI: https://doi.org/10.1007/978-3-642-33050-6_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33049-0
Online ISBN: 978-3-642-33050-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics