Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering

  • Alberto Pérez García-Plaza
  • Víctor Fresno
  • Raquel Martínez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7182)

Abstract

Document representation is an essential step in web page clustering. Web pages are usually written in HTML, offering useful information to select the most important features to represent them. In this paper we investigate the use of nonlinear combinations of criteria by means of a fuzzy system to find those important features. We start our research from a term weighting function called Fuzzy Combination of Criteria (fcc) that relies on term frequency, document title, emphasis and term positions in the text. Next, we analyze its drawbacks and explore the possibility of adding contextual information extracted from inlinks anchor texts, proposing an alternative way of combining criteria based on our experimental results. Finally, we apply a statistical test of significance to compare the original representation with our proposal.

Keywords

web page representation fuzzy logic clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dredze, M., Jansen, A., Coppersmith, G., Church, K.: Nlp on spoken documents without asr. In: EMNLP, pp. 460–470 (2010)Google Scholar
  2. 2.
    Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: Proceedings of the 26th SIGIR, pp. 459–460 (2003)Google Scholar
  3. 3.
    Fresno, V.: Representacion autocontenida de documentos HTML: una propuesta basada en combinaciones heuristicas de criterios. PhD thesis (2006)Google Scholar
  4. 4.
    Fresno, V., Ribeiro, A.: An analytical approach to concept extraction in html environments. J. Intell. Inf. Syst. 22(3), 215–235 (2004)MATHCrossRefGoogle Scholar
  5. 5.
    Hammouda, K., Kamel, M.: Distributed collaborative web document clustering using cluster keyphrase summaries. Information Fusion 9(4), 465–480 (2008)CrossRefGoogle Scholar
  6. 6.
    Karypis, G.: CLUTO - a clustering toolkit. Technical Report #02-017 (November 2003)Google Scholar
  7. 7.
    Kosko, B.: Global stability of generalized additive fuzzy systems. IEEE Transactions on Systems, Man, and Cybernetics - C 28, 441–452 (1998)CrossRefGoogle Scholar
  8. 8.
    Liu, Y., Liu, Z.: An improved hierarchical k-means algorithm for web document clustering. In: ICCSIT, September 2-29, pp. 606–610 (2008)Google Scholar
  9. 9.
    Noll, M.G., Meinel, C.: The metadata triumvirate: Social annotations, anchor texts and search queries. In: Proceedings of the WI-IAT, vol. 1, pp. 640–647 (2008)Google Scholar
  10. 10.
    Ribeiro, A., Fresno, V., Garcia-Alegre, M.C., Guinea, D.: A fuzzy system for the web page representation (2003)Google Scholar
  11. 11.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM (1975)Google Scholar
  12. 12.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)Google Scholar
  13. 13.
    Tan, Q., Mitra, P.: Clustering-based incremental web crawling. ACM Trans. Inf. Syst. 28, 17:1–17:27 (2010)CrossRefGoogle Scholar
  14. 14.
    Tang, B., Shepherd, M., Milios, E., Heywood, M.I.: Comparing and combining dimension reduction techniques for efficient text clustering. In: Proceedings of the Workshop on Feature Selection for Data Mining, SDM (2005)Google Scholar
  15. 15.
    Wang, Y., Kitsuregawa, M.: Evaluating contents-link coupled web page clustering for web search results. In: CIKM, pp. 499–506 (2002)Google Scholar
  16. 16.
    Zubiaga, A., Martínez, R., Fresno, V.: Getting the most out of social annotations for web page classification. In: ACM DocEng, pp. 74–83 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Alberto Pérez García-Plaza
    • 1
  • Víctor Fresno
    • 1
  • Raquel Martínez
    • 1
  1. 1.NLP & IR GroupUNEDMadridSpain

Personalised recommendations