Web Page Classification: A Soft Computing Approach
The Internet makes it possible to share and manipulate a vast quantity of information efficiently and effectively, but the rapid and chaotic growth experienced by the Net has generated a poorly organized environment that hinders the sharing and mining of useful data. The need for meaningful web-page classification techniques is therefore becoming an urgent issue. This paper describes a novel approach to web-page classification based on a fuzzy representation of web pages. A doublet representation that associates a weight with each of the most representative words of the web document so as to characterize its relevance in the document. This weight is derived by taking advantage of the characteristics of HTML language. Then a fuzzy-rule-based classifier is generated from a supervised learning process that uses a genetic algorithm to search for the minimum fuzzy-rule set that best covers the training examples. The proposed system has been demonstrated with two significantly different classes of web pages.
Unable to display preview. Download preview PDF.
- 1.UNCTAD E-Commerce and development report 2002. Report of the United Nations Conference on Trade and Development. United Nations, New York and Geneva (2002).Google Scholar
- 2.Gudivada, V.N., Raghavan, V.V., Grosky, W.I., and Kasanagottu, R.: Information retrieval on the World Wide Web. IEEE Internet Computing. September–October (1997) 58–68.Google Scholar
- 3.Chen, H. and Dumais, S.T.: Bringing order to the Web: automatically categorizing search results. Proceedings of the CHI’00, Human Factor in Computing Systems, Den Haag, New York, US. ACM Press (2000) 145–152.Google Scholar
- 5.Baeza-Yates, R. and Ribeiro-Neto, B..:Modern information retrieval. ACM Press Books, Addison-Wesley (1999).Google Scholar
- 7.Koller, D. and Sahami, M.: Toward Optimal feature selection. Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA (1996) 284–292.Google Scholar
- 8.Henzinger, M.: Link analysis in web information retrieval. Bulletin of the Technical Committee on Data Engineering. 23-3 (2000) 3–8.Google Scholar
- 10.Ribeiro, A., Fresno, V., García-Alegre, M.C., and Guinea, D.: A fuzzy system for the web representation. Intelligent Exploration of the Web. Studies in Fuzziness and Soft Computing. Szczepaniak, P.S., Segovia, J., Kacprzyk, J., and Zadeh, L.A. Editors. Physica-Verlag, Berlin Heidelberg New York (2003) 19–37.Google Scholar
- 11.Pierre, J.M.: On the automated classification of web sites. Linköping Electronic Articles in Computer and Information Science. Linköping University Electronic Press Linköping, Sweden. 6 (2001).Google Scholar
- 12.Fresno V. and Ribeiro.: A.feature selection and dimensionality reduction in web pages representation. Proceedings of the International Congress on Computational Intelligence: Methods & Applications. Bangor, Wales, U.K. (2001) 416–421.Google Scholar
- 13.Gasós J., Fernandéz P.D., García-Alegre M.C., Garcia Rosa R.: Environment for the development of fuzzy controllers. Proceedings of the International Conference. on AI: Applications & N.N. (1990) 121–124.Google Scholar
- 15.Freitas, A.A.: Data mining and knowledge discovery with evolutionary algorithms. Natural Computing Series. Springer-Verlag, Berlin Heidelberg New York (2002).Google Scholar
- 16.Dasgupta, D. and Gonzales, F.A.: Evolving complex fuzzy classifier rules using a linear tree genetic representation. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’2001). Morgan Kaufmann (2001) 299–305.Google Scholar