Advertisement

Applying Semantic Links for Classifying Web Pages

  • Ben Choi
  • Qing Guo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2718)

Abstract

Automatic hypertext classification is an essential technique for organizing vast amount of Internet Web pages or HTML documents. One the of problems in classifying Web pages is that Web pages are usually short and contain insufficient text to clearly identify its category. Text classification mechanisms, by analyzing only the contents of the document itself, are relatively ineffective in classifying short Web pages. This paper proposes a new hypertext classification mechanism to address the problem by analyzing not only the Web page itself but also its linked Web pages referred by the URLs contained within the page. The URLs are treated as semantic links. The hypothesis is that the linked Web pages contain related information to help identifying the category of the Web page. Experimental results show that the proposed approach could increase the accuracy by 35% over the approach of analyzing only the Web page itself.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    G. Salton, A. Wong, and C.S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, 18, pp. 613–620, 1975zbMATHCrossRefGoogle Scholar
  2. 2.
    D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, Training Algorithms for Linear Text Classifiers, In SIGIR’ 96, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 298–306, 1996Google Scholar
  3. 3.
    A. Bookstein and W.S. Cooper, A General Mathematical Model for Information Retrieval Systems, Library Quarterly, 46, pp. 153–167, 1976CrossRefGoogle Scholar
  4. 4.
    E. Wiener, J.O. Pedersen, and A.S. Weigend, A Neural Network Approach to Topic Spotting, In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95), 1995.Google Scholar
  5. 5.
    Joachims Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Proceedings of International Conference on Machine Learning (ICML), 1997.Google Scholar
  6. 6.
    Y Yang and C. G. Chute, An Example-based Mapping Method for Text Categorization and Retrieval, ACM Transaction on Information Systems (TOIS), 12(3), pp.252–277, 1996.CrossRefGoogle Scholar
  7. 7.
    Bulur v. Dasarathy, Nearest Neighbor (NN) Norms: NN pattern Classification Techniques, McGraw-Hill Computer Science Series. IEEE Computer society Press, Las Alamitos, California, 1991.Google Scholar
  8. 8.
    William W. Cohen and Yoram Singer, Context-sensitive learning methods for text categorization, In SIGIR 96: Proceedings of the 19th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.307–315, 1996Google Scholar
  9. 9.
    J. G. Ganascia, Deriving the Learning Bias from Rule Prosperities, Machine Intelligence 12, pp. 151–167, Clarendon Press, Oxford, 1991.Google Scholar
  10. 10.
    C. Apte, F. Damerau, and S. M. Weiss, Automated Learning of Decision Rules for Text Categorization, ACM Transactions on Information Systems, 1994Google Scholar
  11. 11.
    B. Widrow and S. D. Stearns, Information Retrieval, Butterworths, London, Second edition, 1996Google Scholar
  12. 12.
    J. Kivinen and M. K. Kivinen, Worst-case Loss Bounds for Single Neurons, In Advances in Neural Information Processing System, In SIGIR’94, pp.192–201Google Scholar
  13. 13.
    William W. Cohen and Yoram Singer, Context-sensitive learning methods for text categorization, In SIGIR 96: Proceedings of the 19th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, 307–315Google Scholar
  14. 14.
    G Salton and C. Buckley, Term weighting approach in automatic text retrieval, Information Processing and Management, 24(5), pp. 513–523, 1988CrossRefGoogle Scholar
  15. 15.
    Joachims Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Proceedings of International Conference on Machine Learning (ICML), 1997.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Ben Choi
    • 1
  • Qing Guo
    • 1
  1. 1.Computer Science, College of Engineering and ScienceLouisiana Tech UniversityRustonUSA

Personalised recommendations