Abstract
Web page classification is one of the essential techniques for Web mining. This paper presents a framework for Web page classification. It is hybrid architecture of neural network PCA (principle components analysis) and SOFM (self-organizing map). In order to perform the classification, a web page is firstly represented by a vector of features with different weights according to the term frequency and the importance of each sentence in the page. As the number of the features is big, PCA is used to select the relevant features. Finally the output of PCA is sent to SOFM for classification. To compare with the proposed framework, two conventional classifiers are used in our experiments: k-NN and Naïve Bayes. Our new method makes a significant improvement in classifications on both data sets compared with the two conventional methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
McCallum, A., Nigam, K.: A comparison of event models for Naïve Bayes text classification. In: AAAI 1998 workshop on learning for text categorization, pp. 41–48 (1998)
Lewis, D.D., Schapire, R.E., Callan, J.P., Papka, R.: Training algorithms for linear text classifiers. In: Proceedings of the 19th international conference on research and development in information retrieval, pp. 289–297 (1996)
Yang, Y., Slattery, S., Ghani, R.: A study of approaches to hypertext categorization. Journal of Information Systems 18(2-3) (March-May 2002)
Gentili, G.L., Marinilli, M., Micarelli, A., Sciarrone, F.: Text categorization in an intelligent agent for filtering information on the Web. International Journal of Pattern Recognition and Aritificial Intelligence 15(3), 527–549 (2002)
Wermeter, S.: Neural network agents for learning semantic text classification. Information Retrieval 3(2), 87–103 (2000)
Ruiz, E.M., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002)
Calvo, R.A., Ceccatto, H.A.: Intelligent document classification. Intelligent Data Analysis 4(5), 411–420 (2000)
Calvo, R.A., Ceccatto, H.A.: Intelligent document classification. Intelligent Data Analysis 4(5), 411–420 (2000)
Kohonen, T.: Self-Organizing Maps, 2nd Extended edn., Berlin, Heidelberg, New York. Springer Series in Information Sciences, vol. 30 (1997)
Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice-Hall, Englewood Cliffs (1999)
Calvo, R.A., Partridge, M., Jabri, M.: A comparative study of principal components analysis techniques. In: Proceedings 9th Australian Conference on Neural Networks, Brisbane, QLD, pp. 276–281 (1998)
Johnson, R.A., Wichern, W.D.: Applied Multivariate Statistical Analysis, 5th edn. Prentice-Hall, Englewood Cliffs (2002)
Nouali, O., Blache, P.: A semantic vector space and features-based approach for automatic information filtering. Expert Systems with Applications 26, 171–179 (2004)
Selamat, A.: Web page feature selection and classification using neural networks. Information Sciences 158, 69–88 (2004)
Ko, Y., Park, J., Seo, J.: Inproving text categorization using the importance of sentences. Information Processing and Management 40, 65–79 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, Y., Cao, Y., Zhu, Q., Zhu, Z. (2005). A Novel Framework for Web Page Classification Using Two-Stage Neural Network. In: Li, X., Wang, S., Dong, Z.Y. (eds) Advanced Data Mining and Applications. ADMA 2005. Lecture Notes in Computer Science(), vol 3584. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527503_60
Download citation
DOI: https://doi.org/10.1007/11527503_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27894-8
Online ISBN: 978-3-540-31877-4
eBook Packages: Computer ScienceComputer Science (R0)