Skip to main content

XML Document Classification Using Closed Frequent Subtree

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7419))

Abstract

An efficient classification approach for XML documents is introduced in this paper, which lies in combining the content with the structure of XML documents to compute the similarity between the categories and documents. It is based on the Support Vector Machine (SVM) algorithm and the Structured Link Vector Model (SLVM) which used closed frequent subtrees as the structural units. The document tree pruning strategy was applied to improve the classification system while the link information between the documents was considered to get better classification results. We did experiments on the INEX XML mining data sets combining these techniques, and the results showed that our approach performs better than any other competitor’s approach on XML classification.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yang, J., Chen, X.: A semi-structured document model for text mining. Journal of Computer Science and Technology 17, 603–610 (2002)

    Article  MATH  Google Scholar 

  2. Zaki, M.J., Aggarwal, C.C.: XRules: an effective structural classifier for XML data. Presented at the Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C. (2003)

    Google Scholar 

  3. De Knijf, J.: FAT-CAT: Frequent Attributes Tree Based Classification. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 485–496. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  4. Denoyer, L., Gallinari, P.: Bayesian network model for semi-structured document classification. Inf. Process. Manage. 40, 807–827 (2004)

    Article  Google Scholar 

  5. Costa, G., Ortale, R., Ritacco, E.: Effective XML Classification Using Content and Structural Information via Rule Learning. In: 2011 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 102–109 (2011)

    Google Scholar 

  6. Wu, J.: A Framework for Learning Comprehensible Theories in XML Document Classification. IEEE Transactions on Knowledge and Data Engineering 24, 1–14 (2012)

    Article  Google Scholar 

  7. Collobert, R., Bengio, S.: SVMTorch: support vector machines for large-scale regression problems. J. Mach. Learn. Res. 1, 143–160 (2001)

    MathSciNet  Google Scholar 

  8. Yun, C., Yi, X., Yirong, Y., Muntz, R.R.: Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Transactions on Knowledge and Data Engineering 17, 190–202 (2005)

    Article  Google Scholar 

  9. Gery, M., Largeron, C., Thollard, F.: Integrating Structure in the Probabilistic Model for Information Retrieval. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2008, pp. 763–769 (2008)

    Google Scholar 

  10. De Vries, C.M., Nayak, R., Kutty, S., Geva, S., Tagarelli, A.: Overview of the INEX 2010 XML Mining Track: Clustering and Classification of XML Documents. In: Geva, S., Kamps, J., Schenkel, R., Trotman, A. (eds.) INEX 2010. LNCS, vol. 6932, pp. 363–376. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. de Campos, L.M., Fernández-Luna, J.M., Huete, J.F., Masegosa, A.R., Romero, A.E.: Link-Based Text Classification Using Bayesian Networks. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 397–406. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, S., Hong, Y., Yang, J. (2012). XML Document Classification Using Closed Frequent Subtree. In: Bao, Z., et al. Web-Age Information Management. WAIM 2012. Lecture Notes in Computer Science, vol 7419. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33050-6_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33050-6_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33049-0

  • Online ISBN: 978-3-642-33050-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics