Web Page Classification: A Soft Computing Approach

  • Angela Ribeiro
  • Víctor Fresno
  • María C. Garcia-Alegre
  • Domingo Guinea
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2663)


The Internet makes it possible to share and manipulate a vast quantity of information efficiently and effectively, but the rapid and chaotic growth experienced by the Net has generated a poorly organized environment that hinders the sharing and mining of useful data. The need for meaningful web-page classification techniques is therefore becoming an urgent issue. This paper describes a novel approach to web-page classification based on a fuzzy representation of web pages. A doublet representation that associates a weight with each of the most representative words of the web document so as to characterize its relevance in the document. This weight is derived by taking advantage of the characteristics of HTML language. Then a fuzzy-rule-based classifier is generated from a supervised learning process that uses a genetic algorithm to search for the minimum fuzzy-rule set that best covers the training examples. The proposed system has been demonstrated with two significantly different classes of web pages.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    UNCTAD E-Commerce and development report 2002. Report of the United Nations Conference on Trade and Development. United Nations, New York and Geneva (2002).Google Scholar
  2. 2.
    Gudivada, V.N., Raghavan, V.V., Grosky, W.I., and Kasanagottu, R.: Information retrieval on the World Wide Web. IEEE Internet Computing. September–October (1997) 58–68.Google Scholar
  3. 3.
    Chen, H. and Dumais, S.T.: Bringing order to the Web: automatically categorizing search results. Proceedings of the CHI’00, Human Factor in Computing Systems, Den Haag, New York, US. ACM Press (2000) 145–152.Google Scholar
  4. 4.
    Salton, G., Wong, A., and Yang, C.S.: A vector space model for information retrieval. Communications of the ACM. 18-11 (1975) 613–620.CrossRefGoogle Scholar
  5. 5.
    Baeza-Yates, R. and Ribeiro-Neto, B..:Modern information retrieval. ACM Press Books, Addison-Wesley (1999).Google Scholar
  6. 6.
    Kosala, R. and Blockeel H.: Web mining research: a survey. ACM SIGKDD Explorations. 2-1 (2000) 1–15.CrossRefGoogle Scholar
  7. 7.
    Koller, D. and Sahami, M.: Toward Optimal feature selection. Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA (1996) 284–292.Google Scholar
  8. 8.
    Henzinger, M.: Link analysis in web information retrieval. Bulletin of the Technical Committee on Data Engineering. 23-3 (2000) 3–8.Google Scholar
  9. 9.
    Yang, Y.: A study of approach to hypertext categorization. Journal of Intelligent Information Systems. 18-2/3 (2002) 219–241.CrossRefGoogle Scholar
  10. 10.
    Ribeiro, A., Fresno, V., García-Alegre, M.C., and Guinea, D.: A fuzzy system for the web representation. Intelligent Exploration of the Web. Studies in Fuzziness and Soft Computing. Szczepaniak, P.S., Segovia, J., Kacprzyk, J., and Zadeh, L.A. Editors. Physica-Verlag, Berlin Heidelberg New York (2003) 19–37.Google Scholar
  11. 11.
    Pierre, J.M.: On the automated classification of web sites. Linköping Electronic Articles in Computer and Information Science. Linköping University Electronic Press Linköping, Sweden. 6 (2001).Google Scholar
  12. 12.
    Fresno V. and Ribeiro.: A.feature selection and dimensionality reduction in web pages representation. Proceedings of the International Congress on Computational Intelligence: Methods & Applications. Bangor, Wales, U.K. (2001) 416–421.Google Scholar
  13. 13.
    Gasós J., Fernandéz P.D., García-Alegre M.C., Garcia Rosa R.: Environment for the development of fuzzy controllers. Proceedings of the International Conference. on AI: Applications & N.N. (1990) 121–124.Google Scholar
  14. 14.
    Michalewicz Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996).zbMATHGoogle Scholar
  15. 15.
    Freitas, A.A.: Data mining and knowledge discovery with evolutionary algorithms. Natural Computing Series. Springer-Verlag, Berlin Heidelberg New York (2002).Google Scholar
  16. 16.
    Dasgupta, D. and Gonzales, F.A.: Evolving complex fuzzy classifier rules using a linear tree genetic representation. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’2001). Morgan Kaufmann (2001) 299–305.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Angela Ribeiro
    • 1
  • Víctor Fresno
    • 2
  • María C. Garcia-Alegre
    • 1
  • Domingo Guinea
    • 1
  1. 1.Industrial Automation InstituteSpanish Council for Scientific ResearchArganda del Rey, MadridSpain
  2. 2.Escuela Superior de Ciencia y TecnologíaUniversidad Rey Juan CarlosSpain

Personalised recommendations