Minimization of the Disagreements in Clustering Aggregation

  • Safia Nait Bahloul
  • Baroudi Rouba
  • Youssef Amghar
Part of the Communications in Computer and Information Science book series (CCIS, volume 15)


Several experiences proved the impact of the choice of the parts of documents selected on the result of the classification and consequently on the number of requests which can answer these clusters. The process of aggregation gives a very natural method of data classification and considers then m produced classifications by them m attributes and tries to produce a classification called "optimal" which is the most close possible of m classifications. The optimization consists in minimizing the number of pairs of objects (u, v) such as a C classification place them in the same cluster whereas another C’ classification place them in different clusters. This number corresponds to the concept of disagreements. We propose an approach which exploits the various elements of an XML document participating in various views to give different classifications. These classifications are then aggregated in the only one classification minimizing the number of disagreements. Our approach is divided into two steps: the first consists in applying the K-means algorithm on the collection of XML documents by considering every time a different element from the document. Second step aggregates the various classifications obtained previously to produce the one that minimizes the number of disagreements.


XML classification aggregation disagreements 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Despeyroux, T., Lechavellier, Y., Trousse, B., Vercoustre, A.: Expériences de Classification d’une Collection de Documents XML de Structure Homogène. IEEE Computer Society, Washington (2004)Google Scholar
  2. 2.
    Mitchell, T.M.: Machine Learning. McGraw Hill, New York (1997)MATHGoogle Scholar
  3. 3.
    Guillaume, D., Murtagh, F.: Clustering of XML Documents. Computer Physics Communications 127(2-3), 215–227 (2000)MATHCrossRefGoogle Scholar
  4. 4.
    Denoyer L., Vittaut J.-N., Allinari P., Brunessaux S.: Structured Multimedia Document Classification. In: DocEng 2003, Grenoble, France, pp. 153–160 (2003) Google Scholar
  5. 5.
    Despeyroux, T., Lechavellier, Y., Trousse, B., Vercoustre, A.: Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology. In: Proceedings of the 5th International Conference on Knowledge Management (I-Know), Vienna, Autriche (July 2005)Google Scholar
  6. 6.
    Lee, M.L., Liang Huai Yang, L.H., Wynne Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: CIKM 2002: Proceedings of the eleventh international conference on Information and knowledge (2002) Google Scholar
  7. 7.
    Steinbach, M., Karypis, M., Kumar, G.V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)Google Scholar
  8. 8.
    Bertino, E., Guerrini, G., Mesiti, M.: Measuring the Structural Similarity among XML Documents and DTDs. Technical report, DISI-TR-02-02 (2001)Google Scholar
  9. 9.
    Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: WebDB, pp. 55–60 (2002)Google Scholar
  10. 10.
    Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proceedings of the Fifth International Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, USA (June 2002)Google Scholar
  11. 11.
    Lian, W., Cheung, D.W.-L.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Trans. Knowl. Data Eng. 16(1), 82–96 (2004)CrossRefGoogle Scholar
  12. 12.
    Termier, A., Rousset, M.C, Sebag, M.: TreeFinder: A First Step towards XML Data Mining. In: ICDM 2002: Proceedings of the 2002 IEEE International Conference on Data Mining, p. 450 (2004) Google Scholar
  13. 13.
    McQueen, J.: Some methods for classification and analysis of multivariate observations. In: the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)Google Scholar
  14. 14.
    Doucet, A., Ahonen-Myka, H.: Naive Clustering of a large XML Document Collection. In: INEX Workshop, pp. 81–87 (2002)Google Scholar
  15. 15.
    Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: Proc. of the 6th International Conference on Knowledge Discovery and Data mining, pp. 340–344 (2000)Google Scholar
  16. 16.
    Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. John Wiley & Sons, Chichester (1973)MATHGoogle Scholar
  17. 17.
    Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, San Diego (1990)MATHGoogle Scholar
  18. 18.
    Strehl, A., Ghosh, J.: Cluster ensembles: A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research (2002)Google Scholar
  19. 19.
    Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: A cluster ensemble approach. In: ICML (2003)Google Scholar
  20. 20.
    Fred, A.L.N., Jain, A.K.: Data Clustering using evidence accumulation. In: ICPR (2002)Google Scholar
  21. 21.
    Filkov, V., Skeina, S.: Integrating microarray data by consensus clustering. In: International Conference on tools with Artificial Intelligence, pp. 418–426 (2003)Google Scholar
  22. 22.
    Topchy, A., Jain, A.K., Punch, W.: A mixture model of clustering ensembles. In: SDM (2004)Google Scholar
  23. 23.
    Boulis, C., Ostendorf, M.: Combining multiple clustering systems. In: PKDD (2004)Google Scholar
  24. 24.
    Gionis, A., Mannila, H., Tsaparas, P.: Clustering Aggregation. In: International Conference on Data Engineering (ICDE) (2005)Google Scholar
  25. 25.
    Cleuziou, G.: Une méthode de classification non-supervisée pour l’apprentissage de règles et la recherche d’information Thèse de doctorat, Université d’Orléan (2004)Google Scholar
  26. 26.
    Merz, C.J., Murphy, P.M.: UCI repository of machine learning databases (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Safia Nait Bahloul
    • 1
  • Baroudi Rouba
    • 1
  • Youssef Amghar
    • 2
  1. 1.Computer department, Faculty of ScienceEs-Sénia, Oran UniversityAlgeria
  2. 2.INSA de LyonLIRIS UMR 5205 CNRSVilleurbanneFrance

Personalised recommendations