Abstract
The jungle metaphor is quite common in genre studies. This paper presents a set of genre categories to compare web-derived corpora to their traditional counterparts (such as the BNC), and a set of methods for automatic assessment of the genre composition of newly collected webcorpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Throughout this chapter I refer to BNC texts using their ids from the BNC Index, which is available from http://clix.to/davidlee00
- 2.
The quote refers to the purposes Michael Halliday intended for his “Introduction to Functional Grammar” [11].
- 3.
This example assumes that the function of narration is actively used in the respective societies for approximately the same purposes, but for modern corpora this can be taken for granted.
- 4.
A similar pattern is evident in the accuracy drop from about 90% in the “crisp” 7-webgenre corpus to 66% in a fuzzy KI-04 corpus in experiments described in [22].
- 5.
The BNC has been retagged with TreeTagger, the same tool used for tagging I-EN, so there was no difference in the tagset and tagging between the two corpora (this could have caused variations in accuracy otherwise).
- 6.
References
Allen, P., J.A. Bateman, and J.L. Delin. 1999. Genre and layout in multimodal documents: Towards an empirical account. In Proceedings of the AAAI Fall Symposium on Using Layout for the Generation, Understanding, or Retrieval of Documents, eds. R. Power and D. Scott, 27–34. Cape Cod, MA: American Association for Artificial Intelligence. URL http://www.fb10.uni-bremen.de/anglistik/langpro/projects/gem/downloads/allen-bateman-delin.PDF
Baroni, M., and S. Bernardini. 2004. Bootcat: Bootstrapping corpora and terms from the web. In Proceedings of LREC2004. Lisbon.
Baroni, M., and A. Kilgarriff. 2006. Large linguistically-processed Web corpora for multiple languages. In Companion Volume to Proceedings of the European Association of Computational Linguistics, 87–90. Trento.
Baroni, M., F. Chantree, A. Kilgarriff, and S. Sharoff. 2008. Cleaneval: A competition for cleaning web pages. In Proceedings of the 6th Language Resources and Evaluation Conference, LREC 2008. Marrakech. URL http://corpus.leeds.ac.uk/serge/publications/lrec2008-cleaneval.pdf
Biber, D. 1988. Variations across speech and writing. Cambridge, MA: Cambridge University Press.
Biber, D., and J. Kurjian. 2006. Towards a taxonomy of web registers and text types: A multidimensional analysis. In Corpus linguistics and the web, eds. M. Hundt, N. Nesselhauf, and C. Biewer, 109–131. Amsterdam: Rodopi.
Braslavski, P. 2004. Document style recognition using shallow statistical analysis. In ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP, 1–9. Nancy.
Crossley, S.A., and M. Lowerse. 2007. Multi-dimensional register classification using bigrams. International Journal of Corpus Linguistics 12(4):453–478.
EAGLES. 1996. Preliminary recommendations on text typology. Technical Report EAG-TCWG-TTYP/P, Expert Advisory Group on Language Engineering Standards document. URL http://www.ilc.cnr.it/EAGLES96/texttyp/texttyp.html
Ferraresi, A. 2007. Building a very large corpus of English obtained by web crawling: ukwac. Master’s thesis, University of Bologna.
Halliday, M.A.K. 1985. An introduction to functional grammar. London: Edward Arnold.
Jakobson, R. 1960. Linguistics and poetics. In Style in Language, ed. T.A. Sebeok, 350–377. Cambridge, MA: MIT Press.
Joho, H., and M. Sanderson. 2004. The SPIRIT collection: An overview of a large web collection. SIGIR Forum 38(2):57–61. doi: http://doi.acm.org/10.1145/1041394.1041395
Kessler, B., Nunberg, G., and H. Schütze. 1997. Automatic detection of text genre. In Proceedings of the 35th ACL/8th EACL, 32–38. Madrid.
Kilgarriff, A. 2001. The web as corpus. In proceeding of corpus linguistics 2001. Lancaster. URL http://www.itri.bton.ac.uk/techreports/ITRI-01-14.abs.html
Lee, D. 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology 5(3):37–72. URL http://llt.msu.edu/vol5num3/pdf/lee.pdf
Macdonald, C., and I. Ounis. 2006. The TREC blogs06 collection: Creating and analysing a blog test collection. Technical Report TR-2006-224, Department of Computing Science, University of Glasgow. URL http://ir.dcs.gla.ac.uk/terrier/publications/macdonald06creating.pdf
Martin, J.R. 1984. Language, register and genre. In Children Writing: Reader (ECT language studies: Children writing), ed. F. Christie, 21–30. Geelong, VIC: Deakin University Press.
Mehler, A., and R. Gleim. 2006. The net for the graphs – towards webgenre representation for corpus linguistic studies. In WaCky! Working papers on the Web as Corpus, eds. M. Baroni and S. Bernardini. Bologna: Gedit.
Meyer zu Eissen, S., and B. Stein. 2004. Genre classification of web pages. In Proceedings of the 27th German Conference on Artificial Intelligence. Ulm.
Rehm, G., M. Santini, A. Mehler, P. Braslavski, R. Gleim, A. Stubbe, S. Symonenko, M. Tavosanis, and V. Vidulin. 2008. Towards a reference corpus of web genres for the evaluation of genre identification systems. In Proceedings of the 6th Language Resources and Evaluation Conference, LREC 2008. Marrakech.
Santini, M. 2007. Automatic identification of genre in web pages. PhD thesis, University of Brighton.
Sharoff, S. 2005. Methods and tools for development of the Russian reference corpus. In Corpus linguistics around the world, eds. D. Archer, A. Wilson, and P. Rayson, 167–180. Amsterdam: Rodopi.
Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In WaCky! Working papers on the Web as Corpus, eds. M. Baroni and S. Bernardini. Bologna: Gedit. http://wackybook.sslmit.unibo.it
Sharoff, S. 2007. Classifying web corpora into domain and genre using automatic feature identification. In Proceedings of Web as Corpus Workshop. Louvain-la-Neuve.
Sinclair, J. 2003. Corpora for lexicography. In A practical guide to lexicography, ed. P. van Sterkenberg, 167–178. Amsterdam: Benjamins.
Vidulin, V., M. Luštrek, and M. Gams. 2007. Using genres to improve search engines. In Proceeding Towards Genre-Enabled Search Engines: The Impact of NLP. RANLP, URL http://dis.ijs.si/MitjaL/documents/Vidulin-Using_Genres_to_Improve_Search_Engines-RANLP-07-TGESE.pdf
Witten, I.H., and E. Frank. 2005. Data Mining: Practical machine learning tools and techniques. San Francisco, CA: Morgan Kaufmann.
Xiao, Z., and A. McEnery. 2005. Three genres in modern American English. Journal of English Linguistics 33(1):62–82.
Acknowledgements
I’m grateful to Silvia Bernardini, Adam Kilgarriff, Katja Markert and Marina Santini for useful discussions. The usual disclaimers apply. The tools for genre classification described in this chapter and the results of classifications of the Internet corpora are available from http://corpus.leeds.ac.uk/serge/webgenres/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media B.V.
About this chapter
Cite this chapter
Sharoff, S. (2010). In the Garden and in the Jungle. In: Mehler, A., Sharoff, S., Santini, M. (eds) Genres on the Web. Text, Speech and Language Technology, vol 42. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9178-9_7
Download citation
DOI: https://doi.org/10.1007/978-90-481-9178-9_7
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-9177-2
Online ISBN: 978-90-481-9178-9
eBook Packages: Computer ScienceComputer Science (R0)