In the Garden and in the Jungle

Comparing Genres in the BNC and Internet
  • Serge SharoffEmail author
Part of the Text, Speech and Language Technology book series (TLTB, volume 42)


The jungle metaphor is quite common in genre studies. This paper presents a set of genre categories to compare web-derived corpora to their traditional counterparts (such as the BNC), and a set of methods for automatic assessment of the genre composition of newly collected webcorpora.


Comparability of corpora Assessing corpus composition SVM for genre classification 



I’m grateful to Silvia Bernardini, Adam Kilgarriff, Katja Markert and Marina Santini for useful discussions. The usual disclaimers apply. The tools for genre classification described in this chapter and the results of classifications of the Internet corpora are available from


  1. 1.
    Allen, P., J.A. Bateman, and J.L. Delin. 1999. Genre and layout in multimodal documents: Towards an empirical account. In Proceedings of the AAAI Fall Symposium on Using Layout for the Generation, Understanding, or Retrieval of Documents, eds. R. Power and D. Scott, 27–34. Cape Cod, MA: American Association for Artificial Intelligence. URL Google Scholar
  2. 2.
    Baroni, M., and S. Bernardini. 2004. Bootcat: Bootstrapping corpora and terms from the web. In Proceedings of LREC2004. Lisbon.Google Scholar
  3. 3.
    Baroni, M., and A. Kilgarriff. 2006. Large linguistically-processed Web corpora for multiple languages. In Companion Volume to Proceedings of the European Association of Computational Linguistics, 87–90. Trento.Google Scholar
  4. 4.
    Baroni, M., F. Chantree, A. Kilgarriff, and S. Sharoff. 2008. Cleaneval: A competition for cleaning web pages. In Proceedings of the 6th Language Resources and Evaluation Conference, LREC 2008. Marrakech. URL
  5. 5.
    Biber, D. 1988. Variations across speech and writing. Cambridge, MA: Cambridge University Press.CrossRefGoogle Scholar
  6. 6.
    Biber, D., and J. Kurjian. 2006. Towards a taxonomy of web registers and text types: A multidimensional analysis. In Corpus linguistics and the web, eds. M. Hundt, N. Nesselhauf, and C. Biewer, 109–131. Amsterdam: Rodopi.Google Scholar
  7. 7.
    Braslavski, P. 2004. Document style recognition using shallow statistical analysis. In ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP, 1–9. Nancy.Google Scholar
  8. 8.
    Crossley, S.A., and M. Lowerse. 2007. Multi-dimensional register classification using bigrams. International Journal of Corpus Linguistics 12(4):453–478.Google Scholar
  9. 9.
    EAGLES. 1996. Preliminary recommendations on text typology. Technical Report EAG-TCWG-TTYP/P, Expert Advisory Group on Language Engineering Standards document. URL
  10. 10.
    Ferraresi, A. 2007. Building a very large corpus of English obtained by web crawling: ukwac. Master’s thesis, University of Bologna.Google Scholar
  11. 11.
    Halliday, M.A.K. 1985. An introduction to functional grammar. London: Edward Arnold.Google Scholar
  12. 12.
    Jakobson, R. 1960. Linguistics and poetics. In Style in Language, ed. T.A. Sebeok, 350–377. Cambridge, MA: MIT Press.Google Scholar
  13. 13.
    Joho, H., and M. Sanderson. 2004. The SPIRIT collection: An overview of a large web collection. SIGIR Forum 38(2):57–61. doi: CrossRefGoogle Scholar
  14. 14.
    Kessler, B., Nunberg, G., and H. Schütze. 1997. Automatic detection of text genre. In Proceedings of the 35th ACL/8th EACL, 32–38. Madrid.Google Scholar
  15. 15.
    Kilgarriff, A. 2001. The web as corpus. In proceeding of corpus linguistics 2001. Lancaster. URL
  16. 16.
    Lee, D. 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology 5(3):37–72. URL Google Scholar
  17. 17.
    Macdonald, C., and I. Ounis. 2006. The TREC blogs06 collection: Creating and analysing a blog test collection. Technical Report TR-2006-224, Department of Computing Science, University of Glasgow. URL
  18. 18.
    Martin, J.R. 1984. Language, register and genre. In Children Writing: Reader (ECT language studies: Children writing), ed. F. Christie, 21–30. Geelong, VIC: Deakin University Press.Google Scholar
  19. 19.
    Mehler, A., and R. Gleim. 2006. The net for the graphs – towards webgenre representation for corpus linguistic studies. In WaCky! Working papers on the Web as Corpus, eds. M. Baroni and S. Bernardini. Bologna: Gedit.Google Scholar
  20. 20.
    Meyer zu Eissen, S., and B. Stein. 2004. Genre classification of web pages. In Proceedings of the 27th German Conference on Artificial Intelligence. Ulm.Google Scholar
  21. 21.
    Rehm, G., M. Santini, A. Mehler, P. Braslavski, R. Gleim, A. Stubbe, S. Symonenko, M. Tavosanis, and V. Vidulin. 2008. Towards a reference corpus of web genres for the evaluation of genre identification systems. In Proceedings of the 6th Language Resources and Evaluation Conference, LREC 2008. Marrakech.Google Scholar
  22. 22.
    Santini, M. 2007. Automatic identification of genre in web pages. PhD thesis, University of Brighton.Google Scholar
  23. 23.
    Sharoff, S. 2005. Methods and tools for development of the Russian reference corpus. In Corpus linguistics around the world, eds. D. Archer, A. Wilson, and P. Rayson, 167–180. Amsterdam: Rodopi.Google Scholar
  24. 24.
    Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In WaCky! Working papers on the Web as Corpus, eds. M. Baroni and S. Bernardini. Bologna: Gedit. Google Scholar
  25. 25.
    Sharoff, S. 2007. Classifying web corpora into domain and genre using automatic feature identification. In Proceedings of Web as Corpus Workshop. Louvain-la-Neuve.Google Scholar
  26. 26.
    Sinclair, J. 2003. Corpora for lexicography. In A practical guide to lexicography, ed. P. van Sterkenberg, 167–178. Amsterdam: Benjamins.Google Scholar
  27. 27.
    Vidulin, V., M. Luštrek, and M. Gams. 2007. Using genres to improve search engines. In Proceeding Towards Genre-Enabled Search Engines: The Impact of NLP. RANLP, URL
  28. 28.
    Witten, I.H., and E. Frank. 2005. Data Mining: Practical machine learning tools and techniques. San Francisco, CA: Morgan Kaufmann.zbMATHGoogle Scholar
  29. 29.
    Xiao, Z., and A. McEnery. 2005. Three genres in modern American English. Journal of English Linguistics 33(1):62–82.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Centre for Translation Studies, University of LeedsLeedsUK

Personalised recommendations