KI 2004: KI 2004: Advances in Artificial Intelligence pp 256-269 | Cite as
Genre Classification of Web Pages
Abstract
Genre classification means to discriminate between documents bymeans of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents’ contents.
While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea here is applied to arbitrary Web pages. We see genre classification as a powerful instrument to bring Web-based search services closer to a user’s information need. This objective raises two questions:
- 1
What are useful genres when searching the WWW?
- 2
Can these genres be reliably identified?
The paper in hand presents results from a user study on Web genre usefulness as well as results from the construction of a genre classifier using discriminant analysis, neural network learning, and support vector machines. Particular attention is turned to a classifier’s underlying feature set: Aside from the standard feature types we introduce new features that are based on word frequency classes and that can be computed with minimum computational effort. They allow us to construct compact feature sets with few elements, with which a satisfactory genre diversification is achieved. About 70% of the Web-documents are assigned to their true genre; note in this connection that no genre classification benchmark for Web pages has been published so far.
Keywords
Genre Classification Machine Learning User Study Information Need Information Retrieval WWWPreview
Unable to display preview. Download preview PDF.
References
- 1.Biber, D.: The multidimensional approach to linguistic analyses of genre variation: An overview of methodology and findings. Computers and the Humanities 26, 331–345 (1992)CrossRefGoogle Scholar
- 2.Bretan, I., Dewe, J., Hallberg, A., Wolkert, N.: Web-specific genre visualization (1999)Google Scholar
- 3.Crowston, K., Williams, M.: The effects of linking on genres of web documents. In: HICSS1999.Google Scholar
- 4.Dennis, S.: The sydney morning herald word database (1995), http://www2.psy.uq.edu.au/CogPsych/Noetica/OpenForumIssue4/SMH.html
- 5.Dewdney, N., VanEss-Dykema, C., MacMillan, R.: The form is the substance: Classification of genres in text. In: Proceedings of ACL Workshop on HumanLanguage Technology and Knowledge Management (2001)Google Scholar
- 6.Dimitrova, M., Finn, A., Kushmerick, N., Smyth, B.: Web genre visualization. In: Proceedings of the Conference on Human Factors in Computing Systems (2002)Google Scholar
- 7.Finn, A., Kushmerick, N.: Learning to classify documents according to genre. In: IJCAI 2003 WS on Computational Approaches to Style Analysis and Synthesis (2003)Google Scholar
- 8.Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proceedings of the 15th. International Conference on Computational Linguistics COLING 1994, Kyoto, Japan, vol. II, pp. 1071–1075 (1994)Google Scholar
- 9.Kessler, B., Nunberg, G., Schütze, H.: Automatic detection of text genre. In: Cohen, P.R., Wahlster, W. (eds.) Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Somerset, New Jersey, pp. 32–38 (1997)Google Scholar
- 10.Lee, Y.-B., Myaeng, S.: Text genre classification with genre-revealing and subjectrevealing features. In: Proc. 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 145–150. ACM Press, New York (2002) ISBN 1- 58113-561-0CrossRefGoogle Scholar
- 11.Levenshtein, V.: Binary codes capable of correcting deletions insertions and reversals. ISov Phys Dokl 6, 707–710 (1966)Google Scholar
- 12.U. of Leipzig. Wortschatz (1995), http://wortschatz.uni-leipzig.de
- 13.Rehm, G.: Towards AutomaticWeb Genre Identification. In: Proceedings of the 35th Hawaii International Conference on System Sciences (HICSS 2002), January 2002, IEEE Computer Society, Los Alamitos (2002)Google Scholar
- 14.Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Cai, J., Liu, X.: Genre based navigation on the web. In: Proceedings of the 34th Hawaii International Conference on System Sciences (2001)Google Scholar
- 15.Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Text genre detection using common word frequencies. In: Proceedings of the 18th Int. Conference on Computational Linguistics, Saarbrücken, Germany (2000)Google Scholar
- 16.University of Stuttgart. The decision tree tagger (1996), http://www.ims.uni-stuttgart.de