A Flexible Structured-Based Representation for XML Document Mining

Vercoustre, Anne-Marie; Fegas, Mounir; Gul, Saba; Lechevallier, Yves

doi:10.1007/978-3-540-34963-1_34

Anne-Marie Vercoustre²⁰,
Mounir Fegas²⁰,
Saba Gul²⁰ &
…
Yves Lechevallier²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3977))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

389 Accesses
10 Citations

Abstract

This paper reports on the INRIA group’s approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure and content. Our approach consists of representing XML documents by a set of their sub-paths, defined according to some criteria (length, root beginning, leaf ending). By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as k-means. We use an implementation of the clustering algorithm known as dynamic clouds that can work with distinct groups of independent modalities put in separate variables. This is useful in our model since embedded sub-paths are not independent: we split potentially dependant paths into separate variables, resulting in each of them containing independant paths. Experiments with the INEX collections show good results for the structure-only collections, but our approach could not scale well for large structure-and-content collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Candillier, L., Tellier, I., Torre, F.: Transforming XML trees for efficient classification and clustering. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 469–480. Springer, Heidelberg (2006)
Google Scholar
Candillier, L., Tellier, I., Torre, F., Bousquet, O.: SSC: Statistical subspace clustering. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS, vol. 3587, pp. 100–109. Springer, Heidelberg (2005)
Chapter Google Scholar
Celeux, G., Diday, E., Govaert, G., Lechevallier, Y., Ralambondrainy, H.: Classification Automatique des Données, Environnement statistique et informatique. Dunod informatique, Bordas, Paris, France (1989)
Google Scholar
Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS, vol. 3202, pp. 137–148. Springer, Heidelberg (2004)
Chapter Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: Clustering XML Documents by Structure. In: Vouros, G., Panayiotopoulos, T. (eds.) SETN 2004. LNCS, vol. 3025, pp. 112–121. Springer, Heidelberg (2004)
Chapter Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: Clustering XML Documents Using Structural Summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 547–556. Springer, Heidelberg (2004)
Chapter Google Scholar
Denoyer, L.: Apprentissage et inférence statistique dans les bases de documents structurés: Application aux corpus de documents textuels. PhD thesis, Université de Paris 6 (December 2004)
Google Scholar
Denoyer, L., Gallinari, P.: Categorization and Clustering of XML documents using Structure and Content Information. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977. Springer, Heidelberg (2006)
Google Scholar
Denoyer, L., Vittaut, J.-N., Gallinari, P., Brunesseaux, S.: Structured Multimedia Document Classification. In: ACM Document Engeneering, Grenoble, pp. 153–160 (November 2003)
Google Scholar
Doucet, A., Ahonen-Myka, H.: Naïve Clustering of a large XML Document Collection. In: INEX Workshop, pp. 81–87 (2002)
Google Scholar
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: WebDB, Madison, Wisconsin, USA, pp. 55–60 (June 2002)
Google Scholar
Francesca, F.D., Gordano, G., Ortale, R., Tagarelli, A.: Distance-based Clustering of XML Documents. In: De Raedt, L., Washio, T. (eds.) MGTS-2003: Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences, ECML/PKDD 2203 workshop proceedings, pp. 75–78 (September 2003)
Google Scholar
Guillaume, D., Murtagh, F.: Clustering of XML documents. Computer Physics Communications 127(2-3), 215–227 (2000)
Article MATH Google Scholar
Hubert, L., Arabie, P.: Comparing Partitions. Journal of Classification 2, 193–218 (1985)
Article MATH Google Scholar
Jianwu, Y., Xiaoou, C.: A semi-structured document model for text mining. J. Comput. Sci. Technol. 17(5), 603–610 (2002)
Article MATH Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD 1999: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22. ACM Press, New York (1999)
Chapter Google Scholar
Lian, W., Cheung, D.W.-L., Mamoulis, N., Yiu, S.-M.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transaction on Knowledge and Data Engineering 16(1), 82–96 (2004)
Article Google Scholar
Liu, J., Wang, J.T.L., Hsu, W., Herbert, K.G.: XML Clustering by Principal Component Analysis. In: ICTAI, pp. 658–662 (2004)
Google Scholar
Nayak, R., Xu, S.: XML documents clustering by structures with XCLS. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 432–442. Springer, Heidelberg (2006)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: WebDB, Madison, Wisconsin, USA, pp. 61–66 (June 2002)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. In: Readings in information retrieval, San Francisco, CA, USA, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar
Termier, A., Rousset, M.-C., Sebag, M.: TreeFinder: a First Step towards XML Data Mining. In: ICDM 2002: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), Washington, DC, USA, p. 450. IEEE Computer Society Press, Los Alamitos (2002)
Chapter Google Scholar
Hagenbuchner, M., Sperduti, A., Tsoi, A.C., Trentini, F., Scarselli, F., Gori, M.: Clustering XML documents using self-organizing maps for structures. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 481–496. Springer, Heidelberg (2006)
Google Scholar
Vercoustre, A.-M., Fegas, M., Lechevallier, Y., Despeyroux, T.: Classification de documents XML à partir d’une représentation linéaire des arbres de ces documents. In: Actes des 6éme journées Extraction et Gestion des Connaissances (EGC 2006), Revue des Nouvelles Technologies de l’Information (RNTI-E-6), Lille, France, pp. 433–444 (January 2006)
Google Scholar
Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 340–344. ACM Press, New York (2000)
Chapter Google Scholar
Yoon, J.P., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A Three-Dimensional Bitmap Indexing for XML Documents. Journal of Intelligent Information Systems 17(2-3), 241–254 (2001)
Article MATH Google Scholar
Zaki, M.J., Aggarwal, C.C.: XRules: an effective structural classifier for XML data. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 316–325. ACM Press, New York (2003)
Google Scholar
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical Report 01–40, Department of Computer Science, University of Minnesota, Minneapolis, MN (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

INRIA, Rocquencourt, France
Anne-Marie Vercoustre, Mounir Fegas, Saba Gul & Yves Lechevallier

Authors

Anne-Marie Vercoustre
View author publications
You can also search for this author in PubMed Google Scholar
Mounir Fegas
View author publications
You can also search for this author in PubMed Google Scholar
Saba Gul
View author publications
You can also search for this author in PubMed Google Scholar
Yves Lechevallier
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Duisburg-Essen, Duisburg, Germany
Norbert Fuhr
Queen Mary, University of London, London, UK
Mounia Lalmas
University Duisburg-Essen, Germany
Saadia Malik
Microsoft Research Cambridge, United Kingdom
Gabriella Kazai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vercoustre, AM., Fegas, M., Gul, S., Lechevallier, Y. (2006). A Flexible Structured-Based Representation for XML Document Mining. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds) Advances in XML Information Retrieval and Evaluation. INEX 2005. Lecture Notes in Computer Science, vol 3977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-34963-1_34

Download citation

DOI: https://doi.org/10.1007/978-3-540-34963-1_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34962-4
Online ISBN: 978-3-540-34963-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics