Neural Computing & Applications

, Volume 13, Issue 3, pp 229–236 | Cite as

A generalised regression algorithm for Web page categorisation

  • Ioannis Anagnostopoulos
  • Christos Anagnostopoulos
  • George Kouzas
  • Dimitrios D. Vergados
Original Article

Abstract

This paper proposes an information system that classifies Web pages according a taxonomy, which is mainly used from seven search engines/directories. The proposed classifier is a four-layer generalised regression neural network (GRNN) that aims to perform the information segmentation according to information filtering techniques using content descriptor vectors. Eight categories of Web pages were used in order to evaluate the robustness of the method, while no restrictions were imposed except for the language of the content, which is English. The system can be used as an assistant and consultative tool for classification purposes as well as for estimating the population of Web pages at any given point in time.

Keywords

Neural network Web page classification GRNN 

List of symbols

tfk

Normalised frequency of term k

idfk

Inverse document frequency of term k

hf

Tag hierarchical rating

\(\bar x\)

Mean value

σ

Variance (distributions of normalised and inverse document frequencies over the terms’ rank order)

f(x,z)

The probability density function (pdf) of the vector random variable x and its scalar random variable z

Di

The Euclidean distance between vector random variable x and sample points xi

\( \bar \sigma \)

A width parameter, which satisfies the asymptotic behaviour as the number of Parzen windows becomes large

β

The ‘beta’ coefficient for all the local approximators in the middle layer of the proposed neural network classifier

References

  1. 1.
    van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworths, LondonGoogle Scholar
  2. 2.
    Salton G (1989) Automatic text processing. Addison-Wesley, Reading, MAGoogle Scholar
  3. 3.
    Kohonen T (1995) Self-organizing maps. Springer, Berlin Heidelberg New YorkGoogle Scholar
  4. 4.
    Kohonen T, Kaski S, Lagus K, Salojarvi J, Honkela J, Paatero V, Saarela A (2000) Self organization of a massive document collection. IEEE Trans Neural Netw 11(3):574–585. Special Issue on Neural Networks for Data Mining and Knowledge DiscoveryCrossRefGoogle Scholar
  5. 5.
    Rialle V, Meunier J, Oussedik S, Nault G (1997) Semiotic and modeling computer classification of text with genetic algorithm: analysis and first results. In: Proceedings of ISAS’97, Caracas, Venezuela, July 1997, pp 325–330Google Scholar
  6. 6.
    Mitaim S, Kosko B (1997) Fuzzy function approximation and intelligent agents. In: Proc SPIE 3165:2–13Google Scholar
  7. 7.
    Petridis V, Kaburlasos VG (2001) Clustering and classification in structured data domains using fuzzy lattice neurocomputing (FLN). IEEE Trans Knowl Data Eng 13(2):245–260CrossRefGoogle Scholar
  8. 8.
    Haruechaiyasak C, Mei-Ling Shyu, Shu-Ching Chen, Xiuqi Li (2002) Web document classification based on fuzzy association. In: Proceedings of the 26th Annual International Computer Software and Applications Conference, Oxford, UK, August 2002, pp 487–492Google Scholar
  9. 9.
    Albrecht S, Busch J, Kloppenburg M, Metze F, Tavan P (2000) Generalised radial basis function networks for classification and novelty detection: self-organisation of optimal Bayesian decision. Neural Netw 13:1075–1093CrossRefGoogle Scholar
  10. 10.
    Chung-Hsin Lin, Hsinchun Chen (1996) An automatic indexing and neural network approach to concept retrieval and classification of multilingual (Chinese–English) documents. In: IEEE Trans Syst Man Cybern B:75–88Google Scholar
  11. 11.
    Anagnostopoulos I, Psoroulas I, Loumos V, Kayafas E (2002) Implementing a customised meta-search interface for user query personalisation. In: Proceedings of the IEEE 24th International Conference on Information Technology Interfaces, Cavtat/Dubrovnik, June 2002, pp 79–84Google Scholar
  12. 12.
    Fox C A stop list for general text. ACM Spec Interest Group Inf Retrieval 24(1–2):19–35Google Scholar
  13. 13.
    Ricardo B, Berthier R (1999) Modern information retrieval. Addison-Wesley, Reading, MA, Appendix: Porter’s AlgorithmGoogle Scholar
  14. 14.
    Soderland S (1997) Learning to extract text-based information from the World Wide Web. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, August 1997Google Scholar
  15. 15.
    Specht DF (1991) A general regression neural network. IEEE Trans Neural Netw 2:568–576CrossRefGoogle Scholar
  16. 16.
    Kaban A, Girolami M (2000) Initialized and guided EM-clustering of sparse binary data with application to text based documents. In: Proceedings of the 15th International Conference on Pattern Recognition 2:744–747Google Scholar
  17. 17.
    Chou PA (1991) Optimal partitioning for classification and regression trees. IEEE Trans Pattern Anal Machine Intell 13(4):340–354CrossRefGoogle Scholar
  18. 18.
    Hoya T, Chambers JA (2001) Heuristic pattern correction scheme using adaptively trained generalized regression neural networks. IEEE Trans Neural Netw 12(1):91–100CrossRefGoogle Scholar
  19. 19.
    Parzen E (1962) On the estimation of a probability density function and mode. Annals Math Stat 33:1064–1076Google Scholar
  20. 20.
    Specht DF (1996) Fuzzy logic and neural network handbook: chapter 3—probabilistic and general regression neural networks. McGraw-Hill, New YorkGoogle Scholar
  21. 21.
    Timothy M (1995) Advanced algorithms for neural networks: a C++ coursebook. Wiley, CanadaGoogle Scholar
  22. 22.
    Teo Lian Seng, Khalid M, Yusof R (1999) Tuning of a neuro-fuzzy controller by genetic algorithm. IEEE Trans Syst Man Cybern Part B 29(2):226–236CrossRefGoogle Scholar
  23. 23.
    Teo Lian Seng, Khalid M, Yusof R (2002) Adaptive GRNN for the modelling of dynamic plants. In: Proceedings of the 2002 IEEE Internatinal Symposium on Intelligent Control, Vancouver, Canada, 27–30 October 2002, pp 217–222Google Scholar
  24. 24.
    Burrascano P (1995) Learning vector quantization for the probabilistic neural network. IEEE Trans Neural Netw 2:458–461CrossRefGoogle Scholar
  25. 25.
    Traven HGC (1991) A neural network approach to statistical pattern classification by semiparametric estimation of probability density function. IEEE Trans Neural Netw 2:366–377CrossRefGoogle Scholar
  26. 26.
    Stamatios V. Kartalopoulos (1996) Understanding neural networks and fuzzy logic. IEEE Press, New YorkGoogle Scholar
  27. 27.
    Shian-Hua Lin, Meng Chang Chen, Jan-Ming Ho, Yueh-Ming Huang (2002) ACIRD: intelligent Internet document organization and retrieval. IEEE Trans Knowl Data Eng 14(3):599–614CrossRefGoogle Scholar
  28. 28.
    Lee PY, Hui SC, Fong (2002) Neural networks for web content filtering. A.C.M. IEEE Intell Syst 17(5):48–57CrossRefMATHGoogle Scholar
  29. 29.
    Kouzas GS, Stavropoulos P, Anagnostopoulos I, Anagnostopoulos C, Loumos V, Kayafas E (2003) Measuring the population of web pages in the wild web. In: Proceedings of the XVII IMEKO World Congress, Dubrovnik, Poland, 22–27 June 2003, pp 720–725Google Scholar

Copyright information

© Springer-Verlag London Limited 2004

Authors and Affiliations

  • Ioannis Anagnostopoulos
    • 1
  • Christos Anagnostopoulos
    • 2
  • George Kouzas
    • 1
  • Dimitrios D. Vergados
    • 3
  1. 1.School of Electrical and Computer Engineering, Department of Communication, Electronics and Information SystemsNational Technical University of AthensAthensGreece
  2. 2.Department of Cultural Technology and CommunicationsUniversity of the AegeanMytilini-LesvosGreece
  3. 3.Department of Information and Communication Systems EngineeringUniversity of the AegeanKarlovassi-SamosGreece

Personalised recommendations