Abstract
Focusing on only one type of structural component in the process of clustering XML documents may produce clusters with a certain extent of inner structural inhomogeneity, due either to uncaught differences in the overall logical structures of the available XML documents or to inappropriate choices of the targeted structural component. To overcome these limitations, two approaches to clustering XML documents by multiple heterogeneous structures are proposed. An approach looks at the simultaneous occurrences of such structures across the individual XML documents. The other approach instead combines multiple clusterings of the XML documents, separately performed with respect to the individual types of structures in isolation. A comparative evaluation over both real and synthetic XML data proved that the effectiveness of the devised approaches is at least on a par and even superior with respect to the effectiveness of state-of-the-art competitors. Additionally, the empirical evidence also reveals that the proposed approaches outperform such competitors in terms of time efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
S. Abiteboul, P. Buneman, D. Suciu, Data on the Web: From Relations to Semistructured Data and XML (Morgan Kaufmann, 2000)
E. Wilde, R. Glushko, Xml fever. Commun. ACM 51(7), 40–46 (2008)
C.C. Aggarwal et al., XProJ: a framework for projected structural clustering of XML documents, in Proceedings of SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2007), pp. 46–55
R.A. Baeza-Yates, N. Fuhr, Y. Andamaarek, Special issue on XML retrieval. ACM Trans. Inf. Syst. 24(4) (2006)
L. Denoyer, P. Gallinari, Overview of the INEX 2008 XML mining track, in Advances in Focused Retrieval (2009), pp. 401–411
T. Asai et al., Efficient substructure discovery from large semi-structured data, in Proceedings of Siam Conference on Data Mining (SDM) (2002), pp. 158–174
K. Wang, H. Liu, Discovering typical structures of documents: a road map approach, in Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (1998), pp. 146–154
A. Algergawy, M. Mesiti, R. Nayak, G. Saake, XML data clustering: an overview. ACM Comput. Surv. 43(4), 25:1–25:41 (2011)
L. Denoyer, P. Gallinari, Report on the XML mining track at INEX 2007: categorization and clustering of XML documents. ACM SIGIR Forum 42(1), 22–28 (2008)
G. Demartini et al., Report on the XML mining track at INEX 2008: categorization and clustering of XML documents. ACM SIGIR Forum 43(1), 17–36 (2009)
R. Nayak et al., Overview of the INEX 2009 XML mining track: clustering and classification of XML documents, in Focused Retrieval and Evaluation (2010), pp. 366–378
M.N. Garofalakis et al., XTRACT: a system for extracting document type descriptors from XML documents, in Proceedings of International Conference on Management of Data (SIGMOD) (2000), pp. 165–176
S. Nestorov, S. Abiteboul, R. Motwani, Extracting schema from semistructured data, in Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD) (1998), pp. 295–306
S. Bergamaschi, S. Castano, M. Vincini, Semantic integration of semistructured and structured data sources. SIGMOD Record 28(1), 54–59 (1999)
G. Costa, G. Manco, R. Ortale, A. Tagarelli, A tree-based approach to clustering XML documents by structure, in Proceedings of International Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) (2004), pp. 137–148
S. Joshi, N. Agrawal, R. Krishnapuram, S. Negi, A bag of paths model for measuring structural similarity in web documents, in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003), pp. 577–582
G. Costa, R. Ortale, E. Ritacco, Effective XML classification using content and structural information via rule learning, in IEEE International Conference on Tools with Artificial Intelligence (2011), pp. 102–109
G. Costa, R. Ortale, On effective XML clustering by path commonality: an efficient and scalable algorithm, in IEEE International Conference on Tools with Artificial Intelligence (2012), pp. 389–396
G. Costa, R. Ortale, E. Ritacco, X-class: associative classification of XML documents by structure. ACM Trans. Inf. Syst. 31(1), 3:1–3:40 (2013)
T. Dalamagas, T. Cheng, K.-J. Winkel, T.K. Sellis, A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)
F.D. Francesca, G. Gordano, R. Ortale, A. Tagarelli, Distance-based clustering of XML documents, in International ECML/PKDD Workshop on Mining Graphs, Trees and Sequences (2003), pp. 75–78
M.J. Zaki, C.C. Aggarwal, Xrules: an effective structural classifier for XML data, in Proceedings of SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), (2003) pp. 316–325
W. Lian, D.W.-L. Cheung, N. Mamoulis, S.-M. Yiu, An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Eng. 16(1), 82–96 (2004)
G. Costa, G. Manco, R. Ortale, E. Ritacco, Hierarchical clustering of XML documents focused on structural components. Data Knowl. Eng. 84, 26–46 (2013)
G. Costa, R. Ortale, Structure-oriented clustering of XML documents: a transactional approach, in IEEE International Conference on Intelligent Systems (2012), pp. 188–193
M.J. Zaki, Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans. Knowl. Data Eng. 17(8), 1021–1035 (2005)
E. Cesario, G. Manco, R. Ortale, Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans. Knowl. Data Eng. 19(12), 1607–1624 (2007)
T. Li, M. Ogihara, S. Ma, On combining multiple clusterings: an overview and a new perspective. Appl. Intell. 33(2), 207–219 (2010)
R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval (Addison-Wesley, Boston, 1999)
G. Costa, G. Manco, R. Ortale, A hierarchical model-based approach to co-clustering high-dimensional data, in Proceedings of ACM Symposium on Applied Computing (2008), pp. 886–890
G. Costa, G. Manco, R. Ortale, An incremental clustering scheme for data de-duplication. Data Min. Knowl. Disc. 20(1), 152–187 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Costa, G., Ortale, R. (2016). Structure-Oriented Techniques for XML Document Partitioning. In: Hadjiski, M., Kasabov, N., Filev, D., Jotsov, V. (eds) Novel Applications of Intelligent Systems. Studies in Computational Intelligence, vol 586. Springer, Cham. https://doi.org/10.1007/978-3-319-14194-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-14194-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14193-0
Online ISBN: 978-3-319-14194-7
eBook Packages: EngineeringEngineering (R0)