Block Clustering for Web Pages Categorization

  • Malika Charrad
  • Yves Lechevallier
  • Mohamed ben Ahmed
  • Gilbert Saporta
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5788)


With the growth of web-based applications and the increased popularity of the World Wide Web (WWW), the WWW became the greatest source of information available in the world leading to an increased difficulty of extracting relevant information. Moreover, the content of web sites is constantly changing leading to continual changes in Web users’ behaviours. Therefore, there is significant interest in analysing web content data to better serve users. Our proposed approach, which is grounded on automatic textual analysis of a web site independently from the usage attempts to define groups of documents dealing with the same topic. Both document clustering and word clustering are well studied problems. However, most existing algorithms cluster documents and words separately but not simultaneously. In this paper, we propose to apply a block clustering algorithm to categorize a web site pages according to their content. We report results of our recent testing of CROKI2 algorithm on a tourist web site.


web content mining text mining block clustering 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Landauer, T.K., Dumais, S.T.: How come you know so much? From practical problems to new memory theory. In: Hermann, D.J., McEvoy, C., Hertzog, C., Hertel, P., Johnson, M.K. (eds.) Basic and applied memory research: Theory in context, vol. 1, pp. 105–126. Lawrence Erlbaum Associates, Mahwah (1996)Google Scholar
  2. 2.
    Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, California, pp. 269–274 (2001)Google Scholar
  3. 3.
    Chen, H., Schuffels, C., Orwig, R.: Internet Categorization and Search: A Self-Organizing Approach. Journal of visual communication and image representation 7(1), 88–102 (1996)CrossRefGoogle Scholar
  4. 4.
    Rossi, F., El Golli, A., Lechevallier, Y.: Usage Guided Clustering of Web Pages with Mediann Self Organizing Map. In: Proceedings of ESANN 2005 (2005)Google Scholar
  5. 5.
    Pirolli, P., Pitkow, J., Rao, R.: Silk from a Sow’s Ear: Extracting Usable Structures from the Web. In: Proceedings of the Conference on Human Factors in Computing Systems, CHI 1996 (1996)Google Scholar
  6. 6.
    Charrad, M., Lechevallier, Y., Saporta, G., Ben Ahmed, M.: Web Content Data Mining: la classification croisée pour l’analyse textuelle d’un site Web. In: Actes des 8émes journées francophones Extraction et Gestion des Connaissances 2008, EGC 2008, Revue des Nouvelles Technologies Informatiques (RNTI), Cépadués-édn., vol. I, pp. 43–54 (2008)Google Scholar
  7. 7.
    Charrad, M., Lechevallier, Y., Saporta, G., Ben Ahmed, M.: Le bi-partitionnement: Etat de l’art sur les approches et les algorithmes. In: Ecol’IA 2008, Hammamet, Tunisie (2008)Google Scholar
  8. 8.
    Crimmins, F., Smeaton, A.F., Dkaki, T., Mothe, J.: TetraFusion: information discovery on the Internet. Journal of IEEExpert, 55–62 (1999)Google Scholar
  9. 9.
    Voorhees, E.M.: The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell University (1986)Google Scholar
  10. 10.
    Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1), 143–175 (2001); Also appears as IBM Research Report RJ 10147 (1999)CrossRefzbMATHGoogle Scholar
  11. 11.
    Schutze, H., Silverstein, C.: Projections for efficient document clustering. In: ACM SIGIR (1997)Google Scholar
  12. 12.
    Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: ACM SIGIR (1992)Google Scholar
  13. 13.
    Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI 2000 Workshop on AI for Web Search (2000)Google Scholar
  14. 14.
    Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)zbMATHGoogle Scholar
  15. 15.
    Govaert, G.: Classification croisée. Thése de doctorat d’état, Paris (1983)Google Scholar
  16. 16.
    Stricker, M.: Réseaux de neurones pour le traitement automatique du langage: conception et réalisatin de filtres d’information. Thése de Doctorat, Electronique, ESPCI (2000)Google Scholar
  17. 17.
    Madeira, S.C., Oliveira, A.L.: Biclustering Algorithms for Biological Data Analysis: A Survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 01(1), 24–45 (2004)CrossRefGoogle Scholar
  18. 18.
    Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 122–1129 (2006)Google Scholar
  19. 19.
    Forgy, E.: Cluster analysis of multivariate data:efficiency versus interpretability of classification. Biometrics 21, 768–780 (1965)Google Scholar
  20. 20.
    Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Jones, D., Somers, H. (eds.) New Methods in Language Processing Studies in Computational Linguistics (1997)Google Scholar
  21. 21.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Malika Charrad
    • 1
    • 2
    • 3
  • Yves Lechevallier
    • 1
    • 2
    • 3
  • Mohamed ben Ahmed
    • 1
    • 2
    • 3
  • Gilbert Saporta
    • 1
    • 2
    • 3
  1. 1.National School of Computer SciencesManoubaTunisia
  2. 2.INRIA-RocquencourtLe Chesnay, cedexFrance
  3. 3.Conservatoire National des Arts et MétiersParisFrance

Personalised recommendations