Abstract
This paper reports on the INRIA group’s approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure and content. Our approach consists of representing XML documents by a set of their sub-paths, defined according to some criteria (length, root beginning, leaf ending). By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as k-means. We use an implementation of the clustering algorithm known as dynamic clouds that can work with distinct groups of independent modalities put in separate variables. This is useful in our model since embedded sub-paths are not independent: we split potentially dependant paths into separate variables, resulting in each of them containing independant paths. Experiments with the INEX collections show good results for the structure-only collections, but our approach could not scale well for large structure-and-content collections.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Candillier, L., Tellier, I., Torre, F.: Transforming XML trees for efficient classification and clustering. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 469–480. Springer, Heidelberg (2006)
Candillier, L., Tellier, I., Torre, F., Bousquet, O.: SSC: Statistical subspace clustering. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS, vol. 3587, pp. 100–109. Springer, Heidelberg (2005)
Celeux, G., Diday, E., Govaert, G., Lechevallier, Y., Ralambondrainy, H.: Classification Automatique des Données, Environnement statistique et informatique. Dunod informatique, Bordas, Paris, France (1989)
Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS, vol. 3202, pp. 137–148. Springer, Heidelberg (2004)
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: Clustering XML Documents by Structure. In: Vouros, G., Panayiotopoulos, T. (eds.) SETN 2004. LNCS, vol. 3025, pp. 112–121. Springer, Heidelberg (2004)
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: Clustering XML Documents Using Structural Summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 547–556. Springer, Heidelberg (2004)
Denoyer, L.: Apprentissage et inférence statistique dans les bases de documents structurés: Application aux corpus de documents textuels. PhD thesis, Université de Paris 6 (December 2004)
Denoyer, L., Gallinari, P.: Categorization and Clustering of XML documents using Structure and Content Information. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977. Springer, Heidelberg (2006)
Denoyer, L., Vittaut, J.-N., Gallinari, P., Brunesseaux, S.: Structured Multimedia Document Classification. In: ACM Document Engeneering, Grenoble, pp. 153–160 (November 2003)
Doucet, A., Ahonen-Myka, H.: Naïve Clustering of a large XML Document Collection. In: INEX Workshop, pp. 81–87 (2002)
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: WebDB, Madison, Wisconsin, USA, pp. 55–60 (June 2002)
Francesca, F.D., Gordano, G., Ortale, R., Tagarelli, A.: Distance-based Clustering of XML Documents. In: De Raedt, L., Washio, T. (eds.) MGTS-2003: Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences, ECML/PKDD 2203 workshop proceedings, pp. 75–78 (September 2003)
Guillaume, D., Murtagh, F.: Clustering of XML documents. Computer Physics Communications 127(2-3), 215–227 (2000)
Hubert, L., Arabie, P.: Comparing Partitions. Journal of Classification 2, 193–218 (1985)
Jianwu, Y., Xiaoou, C.: A semi-structured document model for text mining. J. Comput. Sci. Technol. 17(5), 603–610 (2002)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD 1999: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22. ACM Press, New York (1999)
Lian, W., Cheung, D.W.-L., Mamoulis, N., Yiu, S.-M.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transaction on Knowledge and Data Engineering 16(1), 82–96 (2004)
Liu, J., Wang, J.T.L., Hsu, W., Herbert, K.G.: XML Clustering by Principal Component Analysis. In: ICTAI, pp. 658–662 (2004)
Nayak, R., Xu, S.: XML documents clustering by structures with XCLS. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 432–442. Springer, Heidelberg (2006)
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: WebDB, Madison, Wisconsin, USA, pp. 61–66 (June 2002)
Porter, M.F.: An algorithm for suffix stripping. In: Readings in information retrieval, San Francisco, CA, USA, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Termier, A., Rousset, M.-C., Sebag, M.: TreeFinder: a First Step towards XML Data Mining. In: ICDM 2002: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), Washington, DC, USA, p. 450. IEEE Computer Society Press, Los Alamitos (2002)
Hagenbuchner, M., Sperduti, A., Tsoi, A.C., Trentini, F., Scarselli, F., Gori, M.: Clustering XML documents using self-organizing maps for structures. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 481–496. Springer, Heidelberg (2006)
Vercoustre, A.-M., Fegas, M., Lechevallier, Y., Despeyroux, T.: Classification de documents XML à partir d’une représentation linéaire des arbres de ces documents. In: Actes des 6éme journées Extraction et Gestion des Connaissances (EGC 2006), Revue des Nouvelles Technologies de l’Information (RNTI-E-6), Lille, France, pp. 433–444 (January 2006)
Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 340–344. ACM Press, New York (2000)
Yoon, J.P., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A Three-Dimensional Bitmap Indexing for XML Documents. Journal of Intelligent Information Systems 17(2-3), 241–254 (2001)
Zaki, M.J., Aggarwal, C.C.: XRules: an effective structural classifier for XML data. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 316–325. ACM Press, New York (2003)
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical Report 01–40, Department of Computer Science, University of Minnesota, Minneapolis, MN (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vercoustre, AM., Fegas, M., Gul, S., Lechevallier, Y. (2006). A Flexible Structured-Based Representation for XML Document Mining. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds) Advances in XML Information Retrieval and Evaluation. INEX 2005. Lecture Notes in Computer Science, vol 3977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-34963-1_34
Download citation
DOI: https://doi.org/10.1007/978-3-540-34963-1_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34962-4
Online ISBN: 978-3-540-34963-1
eBook Packages: Computer ScienceComputer Science (R0)