Advertisement

Riding the Rough Waves of Genre on the Web

Concepts and Research Questions
  • Marina SantiniEmail author
  • Alexander Mehler
  • Serge Sharoff
Chapter
Part of the Text, Speech and Language Technology book series (TLTB, volume 42)

Abstract

This chapter outlines the state of the art of empirical and computational webgenre research. First, it highlights why the concept of genre is profitable for a range of disciplines. At the same time, it lists a number of recent interpretations that can inform and influence present and future genre research. Last but not least, it breaks down a series of open issues that relate to the modelling of the concept of webgenre in empirical and computational studies.

Keywords

Genre Web Automatic Classification Web genre Web documents 

References

  1. 1.
    Amitay, E., D. Carmel, A. Darlow, R. Lempel, and A. Soffer. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, 38–47. University of Nottingham, UK.Google Scholar
  2. 2.
    Andersen, J. 2008. The concept of genre in information studies. Annual Review of Information Science & Technology 42:339, 2007.CrossRefGoogle Scholar
  3. 3.
    Andersen, J. 2008. Bringing genre into focus: Lis and genre between people, texts, activity and situation. Bulletin of the American Society for Information Science and Technology 34(5):31–34.CrossRefGoogle Scholar
  4. 4.
    Askehave, I., and A.E. Nielsen. 2005. Digital genres: A challenge to traditional genre theory. Information Technology & People 18(2):120–141.CrossRefGoogle Scholar
  5. 5.
    Aston, G., and L. Burnard. 1998. The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.Google Scholar
  6. 6.
    Barnard, D.T., L. Burnard, S.J. DeRose, D.G. Durand, and C.M. Sperberg-McQueen. 1995. Lessons for the World Wide Web from the text encoding initiative. In Proceedings of the 4th international World Wide Web conference “The Web Revolution”. Boston, MA.Google Scholar
  7. 7.
    Baroni, M., and A. Kilgarriff. 2006. Large linguistically-processed Web corpora for multiple languages. In Companion Volume to Proceedings of the European Association of Computational Linguistics, 87–90. Trento.Google Scholar
  8. 8.
    Baroni, M., F. Chantree, A. Kilgarriff, and S. Sharoff. 2008. Cleaneval: A competition for cleaning web pages. In Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008). Marrakech.Google Scholar
  9. 9.
    Bateman, J.A. 2008. Multimodality and genre: A foundation for the systematic analysis of multimodal documents. London: Palgrave Macmillan.CrossRefGoogle Scholar
  10. 10.
    Bateman, J.A., T. Kamps, J. Kleinz, and K. Reichenberger. 2001. Towards constructive text, diagram, and layout generation for information presentation. Computational Linguistics 27(3):409–449.CrossRefGoogle Scholar
  11. 11.
    Biber, D. 1988. Variation across speech and writing. Cambridge, MA: Cambridge University Press.CrossRefGoogle Scholar
  12. 12.
    Biber, D. 1989. A typology of English texts. Linguistics 27(3):43–58.Google Scholar
  13. 13.
    Biber, D. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge, MA: Cambridge University Press.CrossRefGoogle Scholar
  14. 14.
    Biber, D., U. Connor, and T.A. Upton. 2007. Discourse on the move: Using corpus analysis to describe discourse structure. Amsterdam: Benjamins.Google Scholar
  15. 15.
    Björneborn, L. 2004. Small-world link structures across an academic web space: A library and information science approach. PhD thesis, Royal School of Library and Information Science, Department of Information Studies, Denmark.Google Scholar
  16. 16.
    Björneborn, L. 2010. Genre connectivity and genre drift in a web of genres. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  17. 17.
    Braslavski, P. 2010. Marrying relevance and genre rankings: An exploratory study. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  18. 18.
    Bruce, I. 2008. Academic writing and genre: A systematic analysis. London: Continuum.Google Scholar
  19. 19.
    Bruce, I. 2010. Evolving genres in online domains: The hybrid genre of the participatory news article. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  20. 20.
    Chakrabarti, S. 2001. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference, May 1–5, 211–220. Hong Kong.Google Scholar
  21. 21.
    Chakrabarti, S., M. van den Berg, and B. Dom. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International World Wide Web Conference. Toronto, ON.Google Scholar
  22. 22.
    Chakrabarti, S., M. Joshi, K. Punera, and D.M. Pennock. 2002. The structure of broad topics on the web. In Proceedings of the 11th International World Wide Web Conference, 251–262. New York, NY: ACM Press.Google Scholar
  23. 23.
    Cohn, D.A., and T. Hofmann. 2000. The missing link – a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS), eds. T.K. Leen, T.G. Dietterich, and V. Tresp, 430–436. Denver, CO: MIT Press,Google Scholar
  24. 24.
    Condamines, A. 2008. Taking genre into account when analysing conceptual relation patterns. Corpora 3(2):115–140.CrossRefGoogle Scholar
  25. 25.
    Craven, M., D. DiPasquo, D. Freitag, A.K. McCallum, T.M. Mitchell, K. Nigam, and S. Slattery. 2000. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence 118(1–2):69–113.zbMATHCrossRefGoogle Scholar
  26. 26.
    Dehmer, M., and F. Emmert-Streib. 2010. Mining graph patterns in web-based systems: A conceptual view. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  27. 27.
    Denoyer, L., and P. Gallinari. 2004. Un modèle de mixture de modèles gùnùratifs pour les documents structurùs multimùdias. Document numùrique 8(3):35–54.CrossRefGoogle Scholar
  28. 28.
    Diligenti, M., M. Gori, M. Maggini, and F. Scarselli. 2001. Classification of HTML documents by hidden tree-markov models. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 849–853. Seattle, WA.Google Scholar
  29. 29.
    Dillon, A. 2008. Bringing genre into focus: Why information has shape. Bulletin of the American Society for Information Science and Technology 34(5):17–19.CrossRefGoogle Scholar
  30. 30.
    Donato, D., L. Laura, S. Leonardi, and S. Millozzi. 2007. The web as a graph: How far we are. ACM Transactions on Internet Technology 7(1):4.CrossRefGoogle Scholar
  31. 31.
    Eiron, N., and K.S. McCurley. 2003. Untangling compound documents on the web. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, 85–94. Nottingham.Google Scholar
  32. 32.
    Ester, M., H.-P. Kriegel, and M. Schubert. 2002. Web site mining: A new way to spot competitors, customers and suppliers in the world wide web. In KDD ’02: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 249–258. New York, NY: ACM Press.Google Scholar
  33. 33.
    Ferraresi, A., E. Zanchetta, S. Bernardini, and M. Baroni. 2008. Introducing and evaluating ukWaC, a very large web-derived corpus of English. In The 4th Web as Corpus Workshop: Can We Beat Google? (At LREC 2008). Marrakech.Google Scholar
  34. 34.
    Fletcher, W.H. 2004. Making the web more useful as a source for linguistic corpora. In Corpus linguistics in North America 2002: Selections from the 4th North American Symposium of the American Association for applied corpus linguistics, eds. U. Connor, and T. Upton. Editions Rodopi: Amsterdam/New York.Google Scholar
  35. 35.
    Frasconi, P., G. Soda, and A. Vullo. 2002. Hidden Markov models for text categorization in multi-page documents. Journal of Intelligent Information Systems 18(2–3):195–217.CrossRefGoogle Scholar
  36. 36.
    Freund, L. 2008. Exploiting task-document relationships to support information retrieval in the workplace. PhD thesis, University of Toronto.Google Scholar
  37. 37.
    Freund, L., and C. Nilsen. 2008. Assessing a genre-based approach to online government information. In Proceedings of the 36th Annual Conference of the Canadian Association for Information Science (CAIS). University of British Columbia, Vancouver.Google Scholar
  38. 38.
    Grieve, J., D. Biber, E. Friginal, and T. Nekrasova. 2010. Variation among blogs: A multi-dimensional analysis. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  39. 39.
    Gunnarsson, M. 2010. Classification along genre dimensions. PhD, Inst. f. Biblioteks- och Informationsvetenskap, Göteborgs Universitet.Google Scholar
  40. 40.
    Gupta, S., H. Becker, G. Kaiser, and S. Stolfo. 2006. Verifying genre-based clustering approach to content extraction. In Proceedings of the 15th International Conference on World Wide Web, 875–876. New York, NY: ACM Press.Google Scholar
  41. 41.
    He, B., M. Patel, Z. Zhang, and K. Chen-Chuan Chang. 2007. Accessing the deep web: A survey. Communications of the ACM 50(2):94–101.CrossRefGoogle Scholar
  42. 42.
    Herring, S.C., I. Kouper, J.C. Paolillo, L.A. Scheidt, M. Tyworth, P. Welsch, E. Wright, and N. Yu. 2005. Conversations in the blogosphere: An analysis “from the bottom up”. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05). Big Island, Hawaii.Google Scholar
  43. 43.
    Heyd, T. 2008. Email hoaxes: Form, function, genre ecology. Amsterdam: Benjamins.Google Scholar
  44. 44.
    Ide, N., R. Reppen, and K. Suderman. 2002. The American National Corpus: More than the Web can provide. In Proceedings of the 3rd Language Resources and Evaluation Conference, 839–844. Las Palmas.Google Scholar
  45. 45.
    Joachims, T., N. Cristianini, and J. Shawe-Taylor. 2001. Composite kernels for hypertext categorisation. In Proceedings of the 11th International Conference on Machine Learning, 250–257. San Fransisco, CA: Morgan Kaufmann.Google Scholar
  46. 46.
    Kanaris, I., and E. Stamatatos. 2007. Webpage genre identification using variable-length character n-grams. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’07), Washington, DC: IEEE Computer Society.Google Scholar
  47. 47.
    Karlgren, J., and D. Cutting. 1994. Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th Conference on Computational Linguistics, vol. 2, 1071–1075. Kyoto.Google Scholar
  48. 48.
    Kessler, B., G. Nunberg, and H. Schütze. 1997. Automatic detection of text genre. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. 32–38. Madrid, Spain.Google Scholar
  49. 49.
    Kim, Y., and S. Ross. 2010. Formulating representative features with respect to genre classification. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  50. 50.
    Kriegel, H.-P., and M. Schubert. 2004. Classification of websites as sets of feature vectors. In Databases and applications, ed. M.H. Hamza, 127–132. Anaheim, CA: IASTED/ACTA Press.Google Scholar
  51. 51.
    Kucera, H., and W.N. Francis. 1967. Computational analysis of presentday American English. Providence, RI: Brown University Press.Google Scholar
  52. 52.
    Kumar, R., J. Novak, P. Raghavan, and A. Tomkins. 2004. Structure and evolution of blogspace. Communications of the ACM 47(12):35–39.CrossRefGoogle Scholar
  53. 53.
    Lee, D. 2001. Genres, registers, text types, domains, and styles: clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology 5(3): 37–72.Google Scholar
  54. 54.
    Li, W.-S., O. Kolak, Q. Vu, and H. Takano. 2000. Defining logical domains in a web site. In Proceedings of the 11th ACM on Hypertext and Hypermedia, 123–132. San Antonio, TX.Google Scholar
  55. 55.
    Li, W.-S., K.S. Candan, Q. Vu, and D. Agrawal. 2002. Query relaxation by structure and semantics for retrieval of logical web documents. IEEE Transactions on Knowledge and Data Engineering 14(4):768–791.CrossRefGoogle Scholar
  56. 56.
    Lim, C.S., K.J. Lee, and G.C. Kim. 2005. Multiple sets of features for automatic genre classification of web documents. Information Processing & Management 41(5):1263–1276.CrossRefGoogle Scholar
  57. 57.
    Lindemann, C., and L. Littig. 2010. Classification of web sites at super-genre level. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  58. 58.
    Marshman, E., M.-C. L’Homme, and V. Surtees. 2008. Portability of cause-effect relation markers across specialised domains and text genres: a comparative evaluation. Corpora 3(2):141–172.CrossRefGoogle Scholar
  59. 59.
    Martin, J.R. 1994. Macro-genres: The ecology of the page. Network 21: 29–52.Google Scholar
  60. 60.
    Martin, J.R., and D. Rose. 2008. Genre relations: Mapping culture. London & Oakland: Equinox Pub.Google Scholar
  61. 61.
    Mehler, A. 2008. Structural similarities of complex networks: A computational model by example of wiki graphs. Applied Artificial Intelligence 22(7&8):619–683.CrossRefGoogle Scholar
  62. 62.
    Mehler, A. 2010. Structure formation in the web. A graph-theoretical model of hypertext types. In Linguistic modeling of information and markup languages. Contributions to language technology, eds. A. Witt and D. Metzing, Text, Speech and Language Technology, 225–247. Dordrecht: Springer.CrossRefGoogle Scholar
  63. 63.
    Mehler, A. 2009b. Generalised shortest paths trees: A novel graph class applied to semiotic networks. In Analysis of complex networks: From biology to linguistics, eds. M. Dehmer and F. Emmert-Streib. Weinheim: Wiley-VCH.Google Scholar
  64. 64.
    Mehler, A. 2010. A quantitative graph model of social ontologies by example of Wikipedia. In Towards an information theory of complex networks: Statistical methods and applications, eds. M. Dehmer, F. Emmert-Streib, and A. Mehler. Boston, MA/Basel: Birkhäuser.Google Scholar
  65. 65.
    Mehler, A., M. Dehmer, and R. Gleim. 2006. Towards logical hypertext structure: A graph-theoretic perspective. In Proceedings of the 4th International Workshop on Innovative Internet Computing Systems (I2CS ’04), eds. T. Böhme and G. Heyer, Lecture Notes in Computer Science, vol. 3473, 136–150. Berlin/New York, NY: Springer.CrossRefGoogle Scholar
  66. 66.
    Mehler, A., R. Gleim, and A. Wegner. 2007. Structural uncertainty of hypertext types. An empirical study. In Proceedings of the Workshop “Towards Genre-Enabled Search Engines: The Impact of NLP”, September, 30, 2007, in Conjunction with RANLP 2007, 13–19. Borovets, Bulgaria.Google Scholar
  67. 67.
    Menczer, F. 2004. Lexical and semantic clustering by web links. Journal of the American Society for Information Science and Technology 55(14):1261–1269.CrossRefGoogle Scholar
  68. 68.
    Montesi, M., and T. Navarrete. 2008. Classifying web genres in context: A case study documenting the web genres used by a software engineer. Information Processing and Management 44:1410–1430.CrossRefGoogle Scholar
  69. 69.
    Ounis, I., M. de Rijke, C. Macdonald, G. Mishne, and I. Soboroff. 2006. Overview of the trec 2006 blog track. In Proceedings of the Text Retrieval Conference (TREC). NIST.Google Scholar
  70. 70.
    Päivärinta, T., M. Shepherd, L. Svensson, and M. Rossi. 2008. A special issue editorial. Scandinavian Journal of Information Systems 20(1).Google Scholar
  71. 71.
    Pirolli, P., J. Pitkow, and R. Rao. 1996. Silk from a sow’s ear: Extracting usable structures from the web. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing, 118–125. New York, NY: ACM Press.Google Scholar
  72. 72.
    Power, R., D. Scott, and N. Bouayad-Agha. 2003. Document structure. Computational Linguistics 29(2):211–260.CrossRefGoogle Scholar
  73. 73.
    Raiko, T., K. Kersting, J. Karhunen, and L. de Raedt. 2002. Bayesian learning of logical hidden Markov models. In Proceedings of the Finnish AI Conference (STeP-2002), 64–71. Finland.Google Scholar
  74. 74.
    Rehm, G. 2002. Towards automatic web genre identification – A corpus-based approach in the domain of academia by example of the academic’s personal homepage. In Proceedings of the Hawaii International Conference on System Sciences. Big Island, Hawaii.Google Scholar
  75. 75.
    Rehm, G. 2010. Hypertext types and markup languages. The relationship between HTML and web genres. In Linguistic Modeling of Information and Markup Languages. Contributions to Language Technology, eds. A. Witt and D. Metzing, Text, Speech and Language Technology, 143–164. Dordrecht: Springer.Google Scholar
  76. 76.
    Rosso, M.A. 2008. Bringing genre into focus: Stalking the wild web genre (with apologies to euell gibbons). Bulletin of the American Society for Information Science and Technology 34(5):20–22.CrossRefGoogle Scholar
  77. 77.
    Rosso, M.A., and S.W. Haas. 2010. Identification of web genres by user warrant. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  78. 78.
    Santini, M. 2007a. Characterizing genres of web pages: Genre hybridism and individualization. In Proceedings of the 40th Annual Hawaii International Conference on System Sciences (HICSS’07). Big Island, Hawaii.Google Scholar
  79. 79.
    Santini, M. 2007b. Automatic identification of genre in Web pages. PhD thesis, University of Brighton, Brighton.Google Scholar
  80. 80.
    Santini, M. 2010. Cross-testing a genre classification model for the web. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  81. 81.
    Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In WaCky! Working Papers on the Web as Corpus, eds. M. Baroni and S. Bernardini, 63–68. Bologna: Gedit.Google Scholar
  82. 82.
    Sharoff, S. 2007. Classifying web corpora into domain and genre using automatic feature identification. In Proceedings of Web as Corpus Workshop. Louvain-la-Neuve.Google Scholar
  83. 83.
    Sharoff, S. 2010. In the garden and in the jungle. Comparing genres in the bnc and internet. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  84. 84.
    Sinclair, J. ed. 1987. Looking up: An account of the COBUILD project in lexical computing. London and Glasgow: Collins.Google Scholar
  85. 85.
    Sinclair, J. 2003. Corpora for lexicography. In ed. P. van Sterkenberg, A practical guide to lexicography, 167–178. Amsterdam: Benjamins.Google Scholar
  86. 86.
    Stein, B., S. Meyer zu Eissen, and N. Lipka. 2010. Web genre analysis: Use cases, retrieval models, and implementation issues. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  87. 87.
    Stewart, J.G. 2008. Genre oriented summarization. PhD thesis, Carnegie Mellon University.Google Scholar
  88. 88.
    Sun, A., and E.-P. Lim. 2003. Web unit mining: Finding and classifying subgraphs of web pages. In CIKM ’03: Proceedings of the 12th International Conference on Information and Knowledge Management, 108–115, New York, NY: ACM Press.Google Scholar
  89. 89.
    Swales, J.M. 1990. Genre analysis: English in academic and research settings. Cambridge, MA: Cambridge University Press.Google Scholar
  90. 90.
    Tajima, K., Y. Mizuuchi, M. Kitagawa, and K. Tanaka. 1998. Cut as a querying unit for WWW, netnews, e-mail. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, 235–244. New York, NY: ACM Press.
  91. 91.
    Tajima, K., and K. Tanaka. 1999. New techniques for the discovery of logical documents in web. In International Symposium on Database Applications in Non-traditional Environments. IEEE, 125–132.Google Scholar
  92. 92.
    Thelwall,M., L. Vaughan, and L. Björneborn. 2006. Webometrics. Annual Review of Information Science Technology 6(8):81–135.CrossRefGoogle Scholar
  93. 93.
    Tian, Y.H., T.J. Huang, W. Gao, J. Cheng, and P. Bo Kang. 2003. Two-phase web site classification based on hidden Markov tree models. In WI ’03: Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence. IEEE Computer Society, 227, Washington, DC.Google Scholar
  94. 94.
    Waltinger, U., A. Mehler, and A. Wegner. 2009. A two-level approach to web genre classification. In Proceedings of the 5th International Conference on Web Information Systems and Technologies (WEBIST ’09), March 23–26, 2007. Lisboa.Google Scholar
  95. 95.
    Wisniewski, G., F. Maes, L. Denoyer, and P. Gallinari. 2007. Modèle probabiliste pour l’extraction de structures dans les documents web. Document numùrique, 10(1):151–170.Google Scholar
  96. 96.
    Wodak, R. 2008. Introduction: Discourse studies – important concepts and terms. In Qualitative Discourse Analysis in the Social Sciences, eds. Wodak, R. and Krzyzanowski, M., 1–29. Palgrave.Google Scholar
  97. 97.
    Yates, S.J., and T.R. Sumner. 1997. Digital genres and the new burden of fixity. In Proceedings of the 30th Hawaii International Conference on System Sciences, vol. 6. Maui, HI.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  • Marina Santini
    • 1
    Email author
  • Alexander Mehler
    • 2
  • Serge Sharoff
    • 3
  1. 1.KYHStockholmSweden
  2. 2.Computer Science and Mathematics, Goethe-Universität Frankfurt am MainFrankfurt am MainGermany
  3. 3.Centre for Translation StudiesUniversity of LeedsLeedsUK

Personalised recommendations