Cross-Testing a Genre Classification Model for the Web

  • Marina SantiniEmail author
Part of the Text, Speech and Language Technology book series (TLTB, volume 42)


The main aim of the experiments described in this chapter is to investigate ways of assessing the robustness and stability of an Automatic Genre Identification (AGI) model for the web. More specifically, a series of comparisons using four genre collections are illustrated and analysed. I call this comparative approach cross-testing.


Genre classification Genre corpora Multi-labelled classification Web genre modelling Cross-testing Genre evaluation 


  1. 1.
    Berninger V., Y. Kim, and R. Ross. 2008. Building a document genre corpus: A profile of the KRYS I corpus. Corpus profiling for information retrieva and natural language processing. Workshop Held in Conjunction with IIiX 2008, 18th Oct 2008. London.Google Scholar
  2. 2.
    Biber, D. 1988. Variations across speech and writing. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
  3. 3.
    Biber, D. and Kurjian, J. (2007). Towards a taxanomy of web registers and text types: a multi-dimensional analysis. In Corpus linguistics and the web, eds., M. Hundt, N. Nesselhauf, and C. Biewer, 109–131. Rodopi – Amsterdam – New York.Google Scholar
  4. 4.
    Blood, R. 2000. Weblogs: A history and perspective. Rebecca’s pocket. http://www. Accessed 7 Sep 2000.
  5. 5.
    Bruce, I. 2008. Academic writing and genre. A systematic analysis. London-New York: Continuum International Publishing Group Ltd.Google Scholar
  6. 6.
    Dewdney, N., C. Vaness-Dikema, and R. Macmillan. 2001. The form is the substance: Classification of genres in text. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics and 10th Conference of the European Chapter of the Association for Computational Linguistics. Toulouse.Google Scholar
  7. 7.
    Dewe, J., J. Karlgren, and I. Bretan. 1998. Assembling a balanced corpus from the internet. In Proceedings of the 11th Nordic Conference of Computational Linguistics. Copenhagen.Google Scholar
  8. 8.
    Döring, N. 2002. Personal home pages on the web: A review of research. Journal of Computer-Mediated Communication (JCMC) 7(3).Google Scholar
  9. 9.
    Duda, R., J. Gasching, and P. Hart. 1979. Model design in the prospector consultant system for mineral exploration. In Expert systems in the micro-electronic age, ed. D. Michie, 153–167. Edinburgh: Edinburgh University Press. Reprinted in 1984.Google Scholar
  10. 10.
    Duda, R., P. Hart, and N. Nilsson. 1981. Subjective methods for rule-based inference system. In Readings in artificial intelligence, eds. B. Weber and N. Nilsson, 192–199. Palo Alto, CA: Tioga Publishing Company.Google Scholar
  11. 11.
    Freund, L. 2008. Exploiting task-document relations in support of information retrieval in the workplace. Doctoral dissertation, Faculty of Information Studies, University of Toronto, Toronto. PhD_thesis.pdf
  12. 12.
    Freund, L., C.L.A. Clarke, and E.G. Toms. 2006. Genre classification for IR in the workplace. In Proceedings of Information Interaction in Context (IIiX 2006) Copenhagen, Denmark.Google Scholar
  13. 13.
    Görlach, M. 2004. Text types and the history of English. Berlin-New York: Mouton de Gruyter.CrossRefGoogle Scholar
  14. 14.
    Heyd, T. 2008. Email Hoaxes. Form, function, genre ecology. Amsterdam; Philadelphia, PA: J. Benjamins Publishing Company.Google Scholar
  15. 15.
    Joho, H., and M. Sanderson. 2004. The SPIRIT collection: An overview of a large web collection. SIGIR Forum, 38(2), December 2004.Google Scholar
  16. 16.
    Kanaris, I. and E. Stamatatos. 2007. Webpage genre identification using variable-length character n-grams. In Proceedings of the 19th IEEE Int. Conf. on Tools with Artificial Intelligence. Washington, DC.Google Scholar
  17. 17.
    Kanaris, I., and E. Stamatatos. 2009. Learning to recognize webpage genres. Information Processing and Management 45(5):499–512.CrossRefGoogle Scholar
  18. 18.
    Karlgren, J., and D. Cutting. 1994. Recognizing text genre with simple metrics using discriminant analysis. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994). Kyoto.Google Scholar
  19. 19.
    Lee, D. 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC Jungle. Language Learning & Technology 5(3):37–72.Google Scholar
  20. 20.
    Levering, R., M. Cutler, and L. Yu. 2008. Using visual features for fine-grained genre classification of web pages. In Proceedings of the 41st Hawaii International Conference on System Sciences. Big Island, Hawaii.Google Scholar
  21. 21.
    Mason, J., M. Shepherd, and J. Duffy. 2009. An n-gram based approach to automatically identifying web page genre. In Proceedings of the 42nd Annual Hawaii International Conference on System Sciences. Big Island, Hawaii.Google Scholar
  22. 22.
    Meyer zu Eissen, S., and B. Stein. 2004. Genre classification of web pages: User study and feasibility analysis. In Advances in artificial intelligence, eds. S. Biundo, T. Frühwirth, and G. Palm, 256–269. Berlin: Springer.Google Scholar
  23. 23.
    Rehm, G., M. Santini, M. Mehler, P. Braslavski, R. Gleim, A. Stubbe, S. Symonenko, M. Tavosanis, and V. Vidulin. 2008. Towards a reference corpus of web genres for the evaluation of genre identification systems. In Proceedings of LREC 2008, May 28–30. Marrakech, Morocco.Google Scholar
  24. 24.
    Rosso, M. 2008. User-based identification of Web genres. Journal of the American Society for Information Science and Technology 59(7):1053–1072.MathSciNetCrossRefGoogle Scholar
  25. 25.
    Santini, M. 2005. Building on syntactic annotation: Labelling subordinate clauses. In Proceedings of the Workshop on Exploring Syntactically Annotated Corpora (held in conjunction with Corpus Linguistics 2005 Conference). Birmingham.Google Scholar
  26. 26.
    Santini, M. 2006. Common criteria for genre classification: Annotation and granularity. In Proceedings of the Workshop on Text-based Information Retrieval (TIR-06) (held in conjunction with ECAI 2006). Riva del Garda.Google Scholar
  27. 27.
    Santini, M. 2007a. Automatic genre identification: Towards a flexible classification scheme. BCS IRSG Symposium: Future Directions in Information Access 2007 (FDIA 2007a) (held in conjunction with the European Summer School on IR (ESSIR 2007)), Tuesday, 28th and Wednesday, 29th of Aug. Glasgow.Google Scholar
  28. 28.
    Santini, M. 2007b. Characterizing genres of web pages: Genre hybridism and individualization. In Proceedings of the 40th Hawaii International Conference on System Sciences (HICSS-40). Hawaii.Google Scholar
  29. 29.
    Santini, M. 2007c. Automatic identification of genre in web pages. PhD thesis, University of Brighton, Brighton.Google Scholar
  30. 30.
    Santini, M. 2008. Zero, single, or multi? Genre of web pages through the users’ perspective. Information Processing and Management 44(2):702–737.MathSciNetCrossRefGoogle Scholar
  31. 31.
    Santini, M., and M. Rosso. 2008. Testing a genre-enabled application: A preliminary assessment. In Proceedings of Future Direction in Information Access (FDIA-2008). BCS, London.Google Scholar
  32. 32.
    Santini, M., and S. Sharoff. 2009. Web genre benchmark under construction. Journal for Language Technology and Computational Linguistics (JLCL) 24(1):129–145.Google Scholar
  33. 33.
    Santini, M., R. Power, and R. Evans. 2006. Implementing a characterization of genre for automatic genre identification of web pages. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (ACL/COLING 2006). Main Conference Poster Paper. Sydney.Google Scholar
  34. 34.
    Shepherd, M., C. Watters, and A. Kennedy. 2004. Cybergenre: Automatic identification of home pages on the web. Journal of Web Engineering 3(3–4):236–251.Google Scholar
  35. 35.
    Stein, B., and S. Meyer zu Eissen. 2008. Retrieval Models for Genre Classification. Scandinavian Journal of Information Systems (SJIS) 20(1):91–117.Google Scholar
  36. 36.
    Stubbe, A., and C. Ringlstetter. 2007. Recognizing Genres. In Abstract Proceedings of the Colloqium “Towards a Reference Corpus of Web Genres” (held in conjunction with Corpus Linguistics 2007), 27 Jul 2007, eds. M. Santini and S. Sharoff. Birmingham.Google Scholar
  37. 37.
    Stubbe, A., C. Ringlstetter, and K. Schulz. 2007. Genre to classify noise – noise to classify genre. In Proceedings of the IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data, 8 Jan 2007. Hyderabad, India. International Journal on Document Analysis and Recognition (IJDAR), Dec 2007.Google Scholar
  38. 38.
    Thelwall, M. 2008a. Text in social network web sites: A word frequency analysis of Live Spaces. First Monday 13(2).Google Scholar
  39. 39.
    Thelwall, M. 2008b. Quantitative comparisons of search engine results. Journal of the American Society for Information Science and Technology 59(11):1702–1710.CrossRefGoogle Scholar
  40. 40.
    Thelwall, M. 2008c. Extracting accurate and complete results from search engines: Case study Windows Live. Journal of the American Society for Information Science and Technology 59(1):38–50.CrossRefGoogle Scholar
  41. 41.
    Vidulin, V., M. Luštrek, and M. Gams. 2007. Using genres to improve search engines. In Proceedings of Towards Genre-enable Search Engines: The Impact of Natural Language Processing Workshop, Sept 2007. Borovets, Bulgaria.Google Scholar
  42. 42.
    Vidulin, V., M. Luštrek, and M. Gams. 2009. Multi-label approaches to web genre identification. Journal for Language Technology and Computational Linguistics (JLCL) 24(1):97–114.Google Scholar
  43. 43.
    Waltinger, U., and A. Mehler. 2009. The feature difference coefficient: Classification by means of feature distributions. In Proceedings of the Conference on Text Mining Services (TMS 2009), 159–168. Leipzig, Germany.Google Scholar
  44. 44.
    Xu, J., Y. Cao, H. Li, N. Craswell, and Y. Huang. 2007. Searching documents based on relevance and type. In Proceeding of ECIR 2007. Rome, Italy.Google Scholar
  45. 45.
    Yeung, P., S. Büttcher, C. Clarke, and M. Kolla. 2007a. A Bayesian approach for learning document type relevance. ECIR 2007. Rome.Google Scholar
  46. 46.
    Yeung, P., C. Clarke, and S. Büttcher. 2007b. Improving retrieval accuracy by weighting document types with clickthrough data. SIGIR’07. Amsterdam, The Netherlands.Google Scholar
  47. 47.
    Yeung, P., L. Freund, and C. Clarke. 2007c. X-Site: A workplace search tool for software engineers. System demo presented at the 30th International ACM SIGIR Conference. Amsterdam.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.KYHStockholmSweden

Personalised recommendations