Structure-Oriented Techniques for XML Document Partitioning

Costa, Gianni; Ortale, Riccardo

doi:10.1007/978-3-319-14194-7_9

Gianni Costa⁶ &
Riccardo Ortale⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 586))

607 Accesses

Abstract

Focusing on only one type of structural component in the process of clustering XML documents may produce clusters with a certain extent of inner structural inhomogeneity, due either to uncaught differences in the overall logical structures of the available XML documents or to inappropriate choices of the targeted structural component. To overcome these limitations, two approaches to clustering XML documents by multiple heterogeneous structures are proposed. An approach looks at the simultaneous occurrences of such structures across the individual XML documents. The other approach instead combines multiple clusterings of the XML documents, separately performed with respect to the individual types of structures in isolation. A comparative evaluation over both real and synthetic XML data proved that the effectiveness of the devised approaches is at least on a par and even superior with respect to the effectiveness of state-of-the-art competitors. Additionally, the empirical evidence also reveals that the proposed approaches outperform such competitors in terms of time efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.w3c.org.

References

S. Abiteboul, P. Buneman, D. Suciu, Data on the Web: From Relations to Semistructured Data and XML (Morgan Kaufmann, 2000)
Google Scholar
E. Wilde, R. Glushko, Xml fever. Commun. ACM 51(7), 40–46 (2008)
Article Google Scholar
C.C. Aggarwal et al., XProJ: a framework for projected structural clustering of XML documents, in Proceedings of SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2007), pp. 46–55
Google Scholar
R.A. Baeza-Yates, N. Fuhr, Y. Andamaarek, Special issue on XML retrieval. ACM Trans. Inf. Syst. 24(4) (2006)
Google Scholar
L. Denoyer, P. Gallinari, Overview of the INEX 2008 XML mining track, in Advances in Focused Retrieval (2009), pp. 401–411
Google Scholar
T. Asai et al., Efficient substructure discovery from large semi-structured data, in Proceedings of Siam Conference on Data Mining (SDM) (2002), pp. 158–174
Google Scholar
K. Wang, H. Liu, Discovering typical structures of documents: a road map approach, in Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (1998), pp. 146–154
Google Scholar
A. Algergawy, M. Mesiti, R. Nayak, G. Saake, XML data clustering: an overview. ACM Comput. Surv. 43(4), 25:1–25:41 (2011)
Article Google Scholar
L. Denoyer, P. Gallinari, Report on the XML mining track at INEX 2007: categorization and clustering of XML documents. ACM SIGIR Forum 42(1), 22–28 (2008)
Article Google Scholar
G. Demartini et al., Report on the XML mining track at INEX 2008: categorization and clustering of XML documents. ACM SIGIR Forum 43(1), 17–36 (2009)
Article Google Scholar
R. Nayak et al., Overview of the INEX 2009 XML mining track: clustering and classification of XML documents, in Focused Retrieval and Evaluation (2010), pp. 366–378
Google Scholar
M.N. Garofalakis et al., XTRACT: a system for extracting document type descriptors from XML documents, in Proceedings of International Conference on Management of Data (SIGMOD) (2000), pp. 165–176
Google Scholar
S. Nestorov, S. Abiteboul, R. Motwani, Extracting schema from semistructured data, in Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD) (1998), pp. 295–306
Google Scholar
S. Bergamaschi, S. Castano, M. Vincini, Semantic integration of semistructured and structured data sources. SIGMOD Record 28(1), 54–59 (1999)
Article Google Scholar
G. Costa, G. Manco, R. Ortale, A. Tagarelli, A tree-based approach to clustering XML documents by structure, in Proceedings of International Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) (2004), pp. 137–148
Google Scholar
S. Joshi, N. Agrawal, R. Krishnapuram, S. Negi, A bag of paths model for measuring structural similarity in web documents, in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003), pp. 577–582
Google Scholar
G. Costa, R. Ortale, E. Ritacco, Effective XML classification using content and structural information via rule learning, in IEEE International Conference on Tools with Artificial Intelligence (2011), pp. 102–109
Google Scholar
G. Costa, R. Ortale, On effective XML clustering by path commonality: an efficient and scalable algorithm, in IEEE International Conference on Tools with Artificial Intelligence (2012), pp. 389–396
Google Scholar
G. Costa, R. Ortale, E. Ritacco, X-class: associative classification of XML documents by structure. ACM Trans. Inf. Syst. 31(1), 3:1–3:40 (2013)
Article Google Scholar
T. Dalamagas, T. Cheng, K.-J. Winkel, T.K. Sellis, A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)
Article Google Scholar
F.D. Francesca, G. Gordano, R. Ortale, A. Tagarelli, Distance-based clustering of XML documents, in International ECML/PKDD Workshop on Mining Graphs, Trees and Sequences (2003), pp. 75–78
Google Scholar
M.J. Zaki, C.C. Aggarwal, Xrules: an effective structural classifier for XML data, in Proceedings of SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), (2003) pp. 316–325
Google Scholar
W. Lian, D.W.-L. Cheung, N. Mamoulis, S.-M. Yiu, An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Eng. 16(1), 82–96 (2004)
Article Google Scholar
G. Costa, G. Manco, R. Ortale, E. Ritacco, Hierarchical clustering of XML documents focused on structural components. Data Knowl. Eng. 84, 26–46 (2013)
Article Google Scholar
G. Costa, R. Ortale, Structure-oriented clustering of XML documents: a transactional approach, in IEEE International Conference on Intelligent Systems (2012), pp. 188–193
Google Scholar
M.J. Zaki, Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans. Knowl. Data Eng. 17(8), 1021–1035 (2005)
Article Google Scholar
E. Cesario, G. Manco, R. Ortale, Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans. Knowl. Data Eng. 19(12), 1607–1624 (2007)
Article Google Scholar
T. Li, M. Ogihara, S. Ma, On combining multiple clusterings: an overview and a new perspective. Appl. Intell. 33(2), 207–219 (2010)
Article Google Scholar
R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval (Addison-Wesley, Boston, 1999)
Google Scholar
G. Costa, G. Manco, R. Ortale, A hierarchical model-based approach to co-clustering high-dimensional data, in Proceedings of ACM Symposium on Applied Computing (2008), pp. 886–890
Google Scholar
G. Costa, G. Manco, R. Ortale, An incremental clustering scheme for data de-duplication. Data Min. Knowl. Disc. 20(1), 152–187 (2010)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

ICAR-CNR, Via P. Bucci 41C, 87036, Rende, CS, Italy
Gianni Costa & Riccardo Ortale

Authors

Gianni Costa
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Ortale
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Riccardo Ortale .

Editor information

Editors and Affiliations

Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria
Mincho Hadjiski
KEDRI – Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland, New Zealand
Nikola Kasabov
Research & Advanced Engineering, Ford Motor Company, Dearborn, Michigan, USA
Dimitar Filev
University of Library Studies and Information Technologies, Sofia, Bulgaria
Vladimir Jotsov

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Costa, G., Ortale, R. (2016). Structure-Oriented Techniques for XML Document Partitioning. In: Hadjiski, M., Kasabov, N., Filev, D., Jotsov, V. (eds) Novel Applications of Intelligent Systems. Studies in Computational Intelligence, vol 586. Springer, Cham. https://doi.org/10.1007/978-3-319-14194-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-14194-7_9
Published: 28 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14193-0
Online ISBN: 978-3-319-14194-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics