Automatic Genre Detection of Web Documents

  • Chul Su Lim
  • Kong Joo Lee
  • Gil Chang Kim
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3248)


A genre or a style is another view of documents different from a subject or a topic. The genre is also a criterion to classify the documents. There have been several studies on detecting a genre of textual documents. However, only a few of them dealt with web documents. In this paper we suggest sets of features to detect genres of web documents. Web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce the features specific to web documents, which are extracted from URL and HTML tags. Experimental results enable us to evaluate their characteristics and performances.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proc. of Computational Linguistics, pp. 1071–1075 (1994)Google Scholar
  2. 2.
    Karlgren, J., Bretan, I., Dewe, J., Hallberg, A., Wolkert, N.: Iterative information retrieval using fast clustering and usage-specific genres. In: Proc. of the DELOS Workshop on User Interfaces in Digital Libraries, pp. 85–92 (1998)Google Scholar
  3. 3.
    Michos, S., Stamatatos, E., Kokkinakis, G.: An empirical text categorizing computational model based on stylistic aspects. In: Proc. of the Eighth Int. Conf. on Tools with Artificial Intelligence, pp. 71–77 (1996)Google Scholar
  4. 4.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Text genre detection using common word frequencies. In: COLING, pp. 808–814 (2000)Google Scholar
  5. 5.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Computational Linguistics 26(4), 471–495 (2000)CrossRefGoogle Scholar
  6. 6.
    Lee, Y.B., Myaeng, S.H.: Text genre classification with genre-revealing and subjectrevealing features. In: ACM SIGIR., pp. 145–150 (2002)Google Scholar
  7. 7.
    Dewe, J., Bretan, I., Karlgren, J.: Assembling a balanced corpus from the internet. In: Nordic Computational Linguistics Conference, pp. 100–107 (1998)Google Scholar
  8. 8.
    Kraaij, W., Westerveld, T., Hiemstra, D.: The importance of prior probabilities for entry page search. In: ACM SIGIR., pp. 27–34 (2002)Google Scholar
  9. 9.
    Daelemans, W., Zavrel, J., van der Sloot, K.: Timbl: Tilburg memory based learner version 4.3 reference guide. Technical Report ILK-0210, Tilburg University (2002)Google Scholar
  10. 10.
    Pierre, J.: Practical issues for automated categorization of web pages. In: ECDL 2000 Workshop on the Semantic Web (2000)Google Scholar
  11. 11.
    Caruana, R., Freitag, D.: Greedy attribute selection. In: Int. Conf. on Machine Learning, pp. 28–36 (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Chul Su Lim
    • 1
  • Kong Joo Lee
    • 2
  • Gil Chang Kim
    • 1
  1. 1.Division of Computer Science, Department of EECSKAISTTaejon
  2. 2.School of Computer & Information TechnologyKyungIn Women’s CollegeIncheon

Personalised recommendations