Skip to main content

Clustering XML Documents Using Frequent Edge-Sets

  • Conference paper
  • First Online:
International Conference on Applications and Techniques in Cyber Security and Intelligence (ATCI 2017)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 580))

  • 1072 Accesses

Abstract

Clustering of XML documents is a useful technique for knowledge discovery in XML databases. However, the process of clustering XML documents is always time-consuming due to the semi-structured characteristics of the documents. In this paper, we present an efficient clustering algorithm called Frequent Edge-based XML Clustering (FEXC) to cluster XML documents using frequent edge sets. First, we represent XML documents using edge sets, and then discover the frequent edge sets for each document employing a traditional frequent pattern mining approach. Second, for each frequent edge set, we find all the documents containing it, and then compute a measure called entropy overlap, which indicates the document relevance (overlap) with the ones containing all other frequent edge sets. Clustering is then performed using the entropy overlap measure. Third, we perform a merging process which removes redundant clusters, therefore reducing the number of clusters. Experimental results show that our proposed method outperforms the traditional distance-based XML clustering algorithm in terms of efficiency without compromising the quality of clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Thakur, R.S., Jain, R.C., Pardasani, K.R.: Mining level-crossing association rules from large databases. J. Comput. Sci. 2(1) (2006)

    Google Scholar 

  2. Koltsidas, H., Müller, H., Viglas, S.D.: Sorting hierarchical data in external memory. Proc. Vldb Endow. 1(1), 1205–1216 (2008)

    Article  Google Scholar 

  3. Beil, F., Ester, M., Xu X.W.: Frequent term-based text clustering. In: KDD, pp. 436–442 (2002)

    Google Scholar 

  4. Wong, K.F., Yu, J.X., Tang, N.: Answering XML queries using path-based indexes: a survey. World Wide Web 9(3), 277–299 (2006)

    Article  Google Scholar 

  5. Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A tree-based approach to clustering XML documents by structure. In: PKDD, pp. 137–148 (2004)

    Google Scholar 

  6. Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.K.: Clustering XML documents by structure. In: SETN, pp. 112–121 (2004)

    Google Scholar 

  7. Nayak, R., Iryadi, W.: XML schema clustering with semantic and hierarchical similarity measures. Knowl. Based Syst. 20(4), 336–349 (2007)

    Article  Google Scholar 

  8. Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: CIKM, pp. 292–299 (2002)

    Google Scholar 

  9. Leung, H., Chung, K.F.L, Chan, S.C., Luk, R.W.P: XML document clustering using common XPath. In: WIRI, pp. 91–96 (2005)

    Google Scholar 

  10. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: WebDB, pp. 61–66 (2002)

    Google Scholar 

  11. Tagarelli, A., Greco, S.: Toward semantic XML clustering. In: SDM, pp. 188–199 (2006)

    Google Scholar 

  12. Wang, L., Cheung, D.W., Mamoulis, N., Yiu, S.M.: An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Eng. 16(1), 82–96 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiyuan Jin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Jin, Z., Wang, L., Chang, Y. (2018). Clustering XML Documents Using Frequent Edge-Sets. In: Abawajy, J., Choo, KK., Islam, R. (eds) International Conference on Applications and Techniques in Cyber Security and Intelligence. ATCI 2017. Advances in Intelligent Systems and Computing, vol 580. Edizioni della Normale, Cham. https://doi.org/10.1007/978-3-319-67071-3_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67071-3_50

  • Published:

  • Publisher Name: Edizioni della Normale, Cham

  • Print ISBN: 978-3-319-67070-6

  • Online ISBN: 978-3-319-67071-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics