An Automatic Approach to Classify Web Documents Using a Domain Ontology

  • Mu-Hee Song
  • Soo-Yeon Lim
  • Seong-Bae Park
  • Dong-Jin Kang
  • Sang-Jo Lee
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3776)

Abstract

This paper suggests an automated method for document classification using an ontology, which expresses terminology information and vocabulary contained in Web documents by way of a hierarchical structure. Ontologybased document classification involves determining document features that represent the Web documents most accurately, and classifying them into the most appropriate categories after analyzing their contents by using at least two pre-defined categories per given document features. In this paper, Web documents are classified in real time not with experimental data or a learning process, but by similar calculations between the terminology information extracted from Web texts and ontology categories. This results in a more accurate document classification since the meanings and relationships unique to each document are determined.

Keywords

Document classification Ontology Web Page classification 

References

  1. 1.
    Apt, C., Damerau, F., Weis, S.M.: Towards Language Independent Automated Learning of Text Categorization models. In: Proc. of the 17th annual international ACM-SIGIR (1994)Google Scholar
  2. 2.
    Shapire, R.E., Singhal, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: Proc. Of the 21th annual international ACM-SIGIR (1998)Google Scholar
  3. 3.
    Hearst, M.A.: Support Vector Machines. IEEE Information Systems 13(4), 18–28 (1998)Google Scholar
  4. 4.
    Prabowo, R., Jackson, M., Burden, P., Knoell, H.-D.: Ontology-Based Automatic Classification for the Web Pages:Design,Implementation and Evaluation. In: Proc. Of the 3rd International Conference on Web Information Systems Engineering (2002)Google Scholar
  5. 5.
    Jenkins, C., Jackson, M., Burden, P., Wallis, J.: Automatic RDF metadata generation for resource discovery. In: Proc. Of 8th International WWW Conference, Toronto, May 1999, pp. 11–14 (1999)Google Scholar
  6. 6.
    Ng, Y., Tang, J., Goodrich, M.: A binary categorization approach for classifying multiple-record Web documents using application ontologies and a probabilistic model. In: Proc. of 7th International Conference on Database Systems for Advances Applications, April 2001, pp. 58–65 (2001)Google Scholar
  7. 7.
    Dumais, S.T., Chen, H.: Hierarchical classification of Web content. In: Proc. of the 23rd Annual International ACM SIGIR, Arthens, Greece, July 24-28 (2000)Google Scholar
  8. 8.
    Goevert, N., Lalmas, M., Fuhr, N.: A probabilistic description-oriented approach for categorisiong Web documents. In: Proc. Of the 8th ACM International Conference on Information and Knowledge Management, Kansas City, U.S, November 2-4, pp. 475–482 (1999)Google Scholar
  9. 9.
    Salton, McGill: Introduction to modern information retrival. Mcgraw-Hill, New York (1983)Google Scholar
  10. 10.
    Hotho, A., Maedche, A., Staab, S.: Ontology-based Text Document Clustering, http://www.aifb.uni-karlsruhe.de/WBS
  11. 11.
    Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: Fox, E.A., Ingwersen, P., Fidel, R. (eds.) SIGIR 1995: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrival, New York, pp. 246–254 (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Mu-Hee Song
    • 1
  • Soo-Yeon Lim
    • 1
  • Seong-Bae Park
    • 1
  • Dong-Jin Kang
    • 1
  • Sang-Jo Lee
    • 1
  1. 1.Dept. of Computer Engineering, Information Technology ServicesKyungpook National UniversityDaeguThe Korea

Personalised recommendations