Clustering XML Documents Using Frequent Edge-Sets

Jin, Zhiyuan; Wang, Le; Chang, Yanfen

doi:10.1007/978-3-319-67071-3_50

Zhiyuan Jin¹⁷,
Le Wang¹⁷ &
Yanfen Chang¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 580))

Included in the following conference series:

International Conference on Applications and Techniques in Cyber Security and Intelligence

1072 Accesses

Abstract

Clustering of XML documents is a useful technique for knowledge discovery in XML databases. However, the process of clustering XML documents is always time-consuming due to the semi-structured characteristics of the documents. In this paper, we present an efficient clustering algorithm called Frequent Edge-based XML Clustering (FEXC) to cluster XML documents using frequent edge sets. First, we represent XML documents using edge sets, and then discover the frequent edge sets for each document employing a traditional frequent pattern mining approach. Second, for each frequent edge set, we find all the documents containing it, and then compute a measure called entropy overlap, which indicates the document relevance (overlap) with the ones containing all other frequent edge sets. Clustering is then performed using the entropy overlap measure. Third, we perform a merging process which removes redundant clusters, therefore reducing the number of clusters. Experimental results show that our proposed method outperforms the traditional distance-based XML clustering algorithm in terms of efficiency without compromising the quality of clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Clustering XML documents by patterns

Article Open access 23 January 2015

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Article 04 August 2017

A Compact and Efficient Labeling Scheme for XML Documents

References

Thakur, R.S., Jain, R.C., Pardasani, K.R.: Mining level-crossing association rules from large databases. J. Comput. Sci. 2(1) (2006)
Google Scholar
Koltsidas, H., Müller, H., Viglas, S.D.: Sorting hierarchical data in external memory. Proc. Vldb Endow. 1(1), 1205–1216 (2008)
Article Google Scholar
Beil, F., Ester, M., Xu X.W.: Frequent term-based text clustering. In: KDD, pp. 436–442 (2002)
Google Scholar
Wong, K.F., Yu, J.X., Tang, N.: Answering XML queries using path-based indexes: a survey. World Wide Web 9(3), 277–299 (2006)
Article Google Scholar
Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A tree-based approach to clustering XML documents by structure. In: PKDD, pp. 137–148 (2004)
Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.K.: Clustering XML documents by structure. In: SETN, pp. 112–121 (2004)
Google Scholar
Nayak, R., Iryadi, W.: XML schema clustering with semantic and hierarchical similarity measures. Knowl. Based Syst. 20(4), 336–349 (2007)
Article Google Scholar
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: CIKM, pp. 292–299 (2002)
Google Scholar
Leung, H., Chung, K.F.L, Chan, S.C., Luk, R.W.P: XML document clustering using common XPath. In: WIRI, pp. 91–96 (2005)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: WebDB, pp. 61–66 (2002)
Google Scholar
Tagarelli, A., Greco, S.: Toward semantic XML clustering. In: SDM, pp. 188–199 (2006)
Google Scholar
Wang, L., Cheung, D.W., Mamoulis, N., Yiu, S.M.: An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Eng. 16(1), 82–96 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science, Dahongying University, Ningbo, 315175, China
Zhiyuan Jin, Le Wang & Yanfen Chang

Authors

Zhiyuan Jin
View author publications
You can also search for this author in PubMed Google Scholar
Le Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanfen Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiyuan Jin .

Editor information

Editors and Affiliations

Faculty of Science, Engineering and Built Environment, Deakin University, Geelong, Victoria, Australia
Jemal Abawajy
Department of Information Systems and Cyber Security, The University of Texas at San Antonio, San Antonio, Texas, USA
Kim-Kwang Raymond Choo
School of Computing and Mathematics, Charles Sturt University, Albury, New South Wales, Australia
Rafiqul Islam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jin, Z., Wang, L., Chang, Y. (2018). Clustering XML Documents Using Frequent Edge-Sets. In: Abawajy, J., Choo, KK., Islam, R. (eds) International Conference on Applications and Techniques in Cyber Security and Intelligence. ATCI 2017. Advances in Intelligent Systems and Computing, vol 580. Edizioni della Normale, Cham. https://doi.org/10.1007/978-3-319-67071-3_50

Download citation

DOI: https://doi.org/10.1007/978-3-319-67071-3_50
Published: 21 October 2017
Publisher Name: Edizioni della Normale, Cham
Print ISBN: 978-3-319-67070-6
Online ISBN: 978-3-319-67071-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Clustering XML Documents Using Frequent Edge-Sets

Abstract

Access this chapter

Similar content being viewed by others

Clustering XML documents by patterns

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

A Compact and Efficient Labeling Scheme for XML Documents

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Clustering XML Documents Using Frequent Edge-Sets

Abstract

Access this chapter

Similar content being viewed by others

Clustering XML documents by patterns

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

A Compact and Efficient Labeling Scheme for XML Documents

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation