Skip to main content

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering

  • Conference paper
  • 1327 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7182))

Abstract

Document representation is an essential step in web page clustering. Web pages are usually written in HTML, offering useful information to select the most important features to represent them. In this paper we investigate the use of nonlinear combinations of criteria by means of a fuzzy system to find those important features. We start our research from a term weighting function called Fuzzy Combination of Criteria (fcc) that relies on term frequency, document title, emphasis and term positions in the text. Next, we analyze its drawbacks and explore the possibility of adding contextual information extracted from inlinks anchor texts, proposing an alternative way of combining criteria based on our experimental results. Finally, we apply a statistical test of significance to compare the original representation with our proposal.

The authors would like to thank the financial support for this research to the Spanish research projects MA2VICMR (S2009/TIC-1542) and Holopedia: the automatic encyclopedia of people and organizations (TIN2010-21128-C02).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dredze, M., Jansen, A., Coppersmith, G., Church, K.: Nlp on spoken documents without asr. In: EMNLP, pp. 460–470 (2010)

    Google Scholar 

  2. Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: Proceedings of the 26th SIGIR, pp. 459–460 (2003)

    Google Scholar 

  3. Fresno, V.: Representacion autocontenida de documentos HTML: una propuesta basada en combinaciones heuristicas de criterios. PhD thesis (2006)

    Google Scholar 

  4. Fresno, V., Ribeiro, A.: An analytical approach to concept extraction in html environments. J. Intell. Inf. Syst. 22(3), 215–235 (2004)

    Article  MATH  Google Scholar 

  5. Hammouda, K., Kamel, M.: Distributed collaborative web document clustering using cluster keyphrase summaries. Information Fusion 9(4), 465–480 (2008)

    Article  Google Scholar 

  6. Karypis, G.: CLUTO - a clustering toolkit. Technical Report #02-017 (November 2003)

    Google Scholar 

  7. Kosko, B.: Global stability of generalized additive fuzzy systems. IEEE Transactions on Systems, Man, and Cybernetics - C 28, 441–452 (1998)

    Article  Google Scholar 

  8. Liu, Y., Liu, Z.: An improved hierarchical k-means algorithm for web document clustering. In: ICCSIT, September 2-29, pp. 606–610 (2008)

    Google Scholar 

  9. Noll, M.G., Meinel, C.: The metadata triumvirate: Social annotations, anchor texts and search queries. In: Proceedings of the WI-IAT, vol. 1, pp. 640–647 (2008)

    Google Scholar 

  10. Ribeiro, A., Fresno, V., Garcia-Alegre, M.C., Guinea, D.: A fuzzy system for the web page representation (2003)

    Google Scholar 

  11. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM (1975)

    Google Scholar 

  12. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)

    Google Scholar 

  13. Tan, Q., Mitra, P.: Clustering-based incremental web crawling. ACM Trans. Inf. Syst. 28, 17:1–17:27 (2010)

    Article  Google Scholar 

  14. Tang, B., Shepherd, M., Milios, E., Heywood, M.I.: Comparing and combining dimension reduction techniques for efficient text clustering. In: Proceedings of the Workshop on Feature Selection for Data Mining, SDM (2005)

    Google Scholar 

  15. Wang, Y., Kitsuregawa, M.: Evaluating contents-link coupled web page clustering for web search results. In: CIKM, pp. 499–506 (2002)

    Google Scholar 

  16. Zubiaga, A., Martínez, R., Fresno, V.: Getting the most out of social annotations for web page classification. In: ACM DocEng, pp. 74–83 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pérez García-Plaza, A., Fresno, V., Martínez, R. (2012). Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28601-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28601-8_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28600-1

  • Online ISBN: 978-3-642-28601-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics