Skip to main content

Hypertext Types and Markup Languages

The Relationship Between HTML and Web Genres

  • Chapter
  • First Online:
Book cover Linguistic Modeling of Information and Markup Languages

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 41))

Abstract

Text technological applications such as automatic summarisation or information extraction systems often process web documents. An important aspect of web documents that most systems ignore is the document type or document genre and the relationship the respective genre, as well as genres in general, have with regard to the Hypertext Markup Language. This chapter introduces the concept of hypertext types to highlight some of the most relevant aspects and applications. Hypertext types are very similar to traditional text types because hypertexts can be grouped into categories that share certain linguistic, textual, or pragmatic features such as communicative function, hypertextual structure, or content. Processing web documents based on their genres entails several advantages, the most prominent of which is the automatic identification of web genres for improved information retrieval aproaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Barnard, David T., Burnard, Lou, DeRose, Steven J., Durand, David G., and Sperberg-McQueen, C.M. (1996). Lessons for the World Wide Web from the Text Encoding Initiative. The World Wide Web Journal, 1(1):349–357.

    Google Scholar 

  • Bazerman, Charles (1994). Systems of Genres and the Enactment of Social Intentions. In Freedman, Aviva and Medway, Peter, editors, Genre and the New Rhetoric, pages 79–101. Taylor and Francis, London.

    Google Scholar 

  • Bray, Tim, Paoli, Jean, Sperberg-McQueen, C. M., Maler, Eve, Yergeau, François, and Cowan, John (2004). Extensible Markup Language (XML) 1.1. Technical Specification, W3C. http://www.w3.org/TR/2004/REC-xml11-20040204/.

  • Brinker, Klaus, Antos, Gerd, Heinemann, Wolfgang, and Sager, Sven F., editors (2000). Text- und Gesprächslinguistik, volume 16.1 of Handbücher zur Sprach- und Kommunikationswissenschaft (HSK). de Gruyter, Berlin, New York.

    Google Scholar 

  • Crowston, Kevin and Williams, Marie (2000). Reproduced and Emergent Genres of Communication on the World Wide Web. The Information Society, 16(3):201–215.

    Article  Google Scholar 

  • Dewe, Johan, Karlgren, Jussi, and Bretan, Ivan (1998). Assembling a Balanced Corpus from the Internet. In Proceedings of the 11th Nordic Conference of Computational Linguistics, pages 100–107, Copenhagen.

    Google Scholar 

  • Eckkrammer, Eva Martha (2001). Textsortenkonventionen im Medienwechsel. In Handler, Peter, editor, E-Text: Strategien und Kompetenzen – Elektronische Kommunikation in Wissenschaft, Bildung und Beruf, volume 7 of Textproduktion und Medium, pages 45–66. Peter Lang, Frankfurt/Main, Berlin, Bern etc.

    Google Scholar 

  • Eiron, Nadav and McCurley, Kevin S. (2003). Untangling Compound Documents on the Web. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, pages 85–94. Nottingham.

    Google Scholar 

  • Emigh, William and Herring, Susan C. (2005). Collaborative Authoring on the Web: A Genre Analysis of Online Encyclopedias. In Proceedings of the 38th Hawaii International Conference on Systems Sciences (HICSS-38), Big Island, Hawaii.

    Google Scholar 

  • Erdmann, Michael (2001). Ontologien zur konzeptuellen Modellierung der Semantik von XML. Ph.d. thesis, University of Karlsruhe, Karlsruhe.

    Google Scholar 

  • Erickson, Thomas (2000). Making Sense of Computer-Mediated Communication (CMC): Conversations as Genres, CMC Systems as Genre Ecologies. In Proceedings of the 33rd Hawaii International Conference on Systems Sciences (HICSS-33).

    Google Scholar 

  • Eriksen, Lars Bo and Ihlström, Carina (1999). In the Path of the Pioneers – Longitudinal Study of Web News Genre. In Käkölä, Timo K., editor, Proceedings of the 22nd Information Systems Research Seminar in Scandinavia (IRIS 22): “Enterprise Architectures for Virtual Organizations”, pages 289–304, Keuruu. University of Jyväskylä.

    Google Scholar 

  • Furuta, Richard and Marshall, Catherine C. (1996). Genre as Reflection of Technology in the World-Wide Web. In Fraïssé, Sylvain, Garzotto, Franca, Isakowitz, Tomás, Nanard, Jocelyne, and Nanard, Marc, editors, Hypermedia Design, Proceedings of the International Workshop on Hypermedia Design (IWHD 1995), Workshops in Computing, pages 182–195. Springer, Berlin, Heidelberg, New York etc.

    Google Scholar 

  • Gleim, Rüdiger, Mehler, Alexander, Eikmeyer, Hans-Jürgen, and Rieser, Hannes (2007). Ein Ansatz zur Repräsentation und Verarbeitung großer Korpora. In Rehm, Georg, Witt, Andreas, and Lemnitzer, Lothar, editors, Datenstrukturen für linguistische Ressourcen und ihre Anwendungen – Data Structures for Linguistic Resources and Applications: Proceedings of the Biennial GLDV Conference 2007, pages 275–284. Gunter Narr, Tübingen.

    Google Scholar 

  • Haas, Stephanie W. and Grams, Erika S. (1998). Page and Link Classifications: Connecting Diverse Resources. In Witten, I., Akscyn, R., and Shipman, F., editors, Proceedings of Digital Libraries ’98 – Third ACM Conference on Digital Libraries, pages 99–107, Pittsburgh.

    Google Scholar 

  • Haas, Stephanie W. and Grams, Erika S. (2000). Readers, Authors, and Page Structure – A Discussion of Four Questions Arising from a Content Analysis of Web Pages. Journal of the American Society for Information Science, 51(2):181–192.

    Article  Google Scholar 

  • Hammwöhner, Rainer (1997). Offene Hypertextsysteme – Das Konstanzer Hypertextsystem (KHS) im wissenschaftlichen und technischen Kontext. Number 32 in Schriften zur Informationswissenschaft. Universitätsverlag Konstanz, Konstanz.

    Google Scholar 

  • ISO 8879 (1986). Information Processing – Text and Office Information Systems – Standard Generalized Markup Language. Internationaler Standard, International Organization for Standardization, Genf.

    Google Scholar 

  • Jakobs, Eva-Maria (2003). Hypertextsorten. Zeitschrift für germanistische Linguistik, 31(2):232–252.

    Article  Google Scholar 

  • Kilgarriff, Adam (2001). Web as Corpus. In Rayson, Paul, Wilson, Andrew, McEnery, Tony, Hardie, Andrew, and Khoja, Shereen, editors, Proceedings of the Corpus Linguistics 2001 Conference, pages 342–344, Lancaster.

    Google Scholar 

  • Kilgarriff, Adam and Grefenstette, Gregory (2003). Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 29(3): 333–348.

    Article  MathSciNet  Google Scholar 

  • Kim, Yunhyong and Ross, Seamus (2008). Examining Variations of Prominent Features in Classification. In Proceedings of the 41st Hawaii International Conference on Systems Sciences (HICSS-41), Big Island, Hawaii.

    Google Scholar 

  • Kuhlen, Rainer (1991). Hypertext – Ein nicht-lineares Medium zwischen Buch und Wissensbank. Springer, Berlin, Heidelberg, New York etc.

    Google Scholar 

  • Levering, Ryan, Cutler, Michal, and Yu, Lei (2008). Using Visual Features for Fine-Grained Genre Classification of Web Pages. In Proceedings of the 41st Hawaii International Conference on Systems Sciences (HICSS-41), Big Island, Hawaii.

    Google Scholar 

  • Lim, Chul Su, Lee, Kong Joo, and Kim, Gil Chang (2005). Multiple Sets of Features for Automatic Genre Classification of Web Documents.Information Processing and Management, 41(5):1263–1276.

    Article  Google Scholar 

  • Lobin, Henning (2000). Service-Handbücher – Linguistische Aspekte im Document Lifecycle. In Richter, Gerd, Riecke, Jörg, and Schuster, Britt-Marie, editors, Raum, Zeit, Medium – Sprache und ihre Determinanten. Festschrift für Hans Ramge, pages 791–808. Hessische Historische Kommission, Darmstadt.

    Google Scholar 

  • Lobin, Henning (2001).Informationsmodellierung in XML und SGML. Springer, Berlin, Heidelberg, New York etc.

    Google Scholar 

  • Maler, Eve and Andaloussi, Jeanne El (1996). Developing SGML DTDs – From Text to Model to Markup. Prentice Hall, Upper Saddle River.

    Google Scholar 

  • Mehler, Alexander, Dehmer, Matthias, and Gleim, Rüdiger (2004). Towards Logical Hypertext Structure — A Graph-Theoretic Perspective. In Böhme, Thomas and Heyer, Gerhard, editors, Proceedings of the Fourth International Workshop on Innovative Internet Computing Systems (I2CS ’04), Lecture Notes in Computer Science, Berlin, New York. Springer.

    Google Scholar 

  • Mehler, Alexander, Sharoff, Serge, Rehm, Georg, and Santini, Marina, editors (2009). Genres on the Web: Computational Models and Empirical Studies, Springer, New York.

    Google Scholar 

  • Miller, Carolyn R. (1984). Genre as Social Action. Quarterly Journal of Speech, (70):151–167.

    Google Scholar 

  • Myllymaki, Jussi (2001). Effective Web Data Extraction with Standard XML Technologies. In Proceedings of the 10th International World Wide Web Conference (WWW-10), pages 689–696, Hong Kong.

    Google Scholar 

  • Orlikowski, Wanda J. and Yates, JoAnne (1994). Genre Repertoire: The Structuring of Communicative Practices in Organizations. Administrative Science Quarterly, (39):541–574.

    Google Scholar 

  • Pemberton, Steven (2002). XHTML 1.0: The Extensible Hypertext Markup Language (Second Edition). Technical Specification, W3C. http://www.w3.org/TR/xhtml1/.

  • Raggett, Dave, Hors, Arnaud Le, and Jacbos, Ian (1999). HTML 4.01 Specification. Technical Specification, W3C. http://www.w3.org/TR/html401/.

  • Rehm, Georg (2001). korpus.html – Zur Sammlung, Datenbank-basierten Erfassung, Annotation und Auswertung von HTML-Dokumenten. In Lobin, Henning, editor, Proceedings of the GLDV Spring Meeting 2001, pages 93–103, Giessen, Germany. Gesellschaft für linguistische Datenverarbeitung (Society for Computational Linguistics and Language Technology). http://www.uni-giessen.de/fb09/ascl/gldv2001/.

  • Rehm, Georg (2002). Towards Automatic Web Genre Identification – A Corpus-Based Approach in the Domain of Academia by Example of the Academic’s Personal Homepage. In Proceedings of the 35th Hawaii International Conference on System Sciences (HICSS-35), Big Island, Hawaii.

    Google Scholar 

  • Rehm, Georg (2004a). Hypertextsorten-Klassifikation als Grundlage generischer Informationsextraktion. In Mehler, Alexander and Lobin, Henning, editors, Automatische Textanalyse – Systeme und Methoden zur Annotation und Analyse natürlichsprachlicher Texte, pages 219–233. Verlag für Sozialwissenschaften, Wiesbaden.

    Google Scholar 

  • Rehm, Georg (2004b). Ontologie-basierte Hypertextsorten-Klassifikation. In Mehler, Alexander and Lobin, Henning, editors, Automatische Textanalyse – Systeme und Methoden zur Annotation und Analyse natürlichsprachlicher Texte, pages 121–137. Verlag für Sozialwissenschaften, Wiesbaden.

    Google Scholar 

  • Rehm, Georg (2004c). Texttechnologische Grundlagen. In Carstensen, Kai-Uwe, Ebert, Christian, Endriss, Cornelia, Jekat, Susanne, Klabunde, Ralf, and Langer, Hagen, editors, Computerlinguistik und Sprachtechnologie – Eine Einführung, pages 138–147. Spektrum, Heidelberg, 2 edition.

    Google Scholar 

  • Rehm, Georg (2005). Language-Independent Text Parsing of Arbitrary HTML-Documents – Towards A Foundation For Web Genre Identification. LDV Forum, 20(2):53–74.

    Google Scholar 

  • Rehm, Georg (2007). Hypertextsorten: Definition – Struktur – Klassifikation. Books on Demand, Norderstedt.(Ph.D. thesis in Applied and Computational Linguistics, Giessen University, 2005).

    Google Scholar 

  • Rehm, Georg (2009). A Comparative Analysis of Genre Category Sets as a Prerequisite for a Reference Corpus of Web Genres. In Mehler, Alexander, Sharoff, Serge, Rehm, Georg, and Santini, Marina, editors, Genres on the Web: Computational Models and Empirical Studies, Springer, New York.

    Google Scholar 

  • Rehm, Georg and Santini, Marina, editors (2007). Proceedings of the International Workshop Towards Genre-Enabled Search Engines: The Impact of Natural Language Processing, Borovets, Bulgaria. Held in conjunction with RANLP 2007.

    Google Scholar 

  • Rehm, Georg, Santini, Marina, Mehler, Alexander, Braslavski, Pavel, Gleim, Rüdiger, Stubbe, Andrea, Symonenko, Svetlana, Tavosanis, Mirko, and Vidulin, Vedrana (2008). Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems. In Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech, Morocco.

    Google Scholar 

  • Reiss, Eric L. (2000). Practical Information Architecture – A Hands-On Approach to Structuring Successful Websites. Addison-Wesley, Harlow, London, New York etc.

    Google Scholar 

  • Rosso, Mark A. (2005). Using Genre to Improve Web Search. Ph. D. thesis, School of Information and Library Science, University of North Carolina at Chapel Hill.

    Google Scholar 

  • Ryan, Terry, Field, Richard H. G., and Olfman, Lorne (2003). The evolution of US state government home pages from 1997 to 2002. International Journal of Human-Computer Studies, 59(4):403–430.

    Article  Google Scholar 

  • Santini, Marina (2007). Characterizing Genres of Web Pages: Genre Hybridism and Individualization. In Proceedings of the 40th Hawaii International Conference on Systems Sciences (HICSS-40), Big Island, Hawaii.

    Google Scholar 

  • Shepherd, Michael and Watters, Carolyn (1998). The Evolution of Cybergenres. In Proceedings of the 31st Hawaii International Conference on Systems Sciences (HICSS-31), volume 2, pages 97–109.

    Google Scholar 

  • Shepherd, Michael and Watters, Carolyn (1999). The Functionality Attribute of Cybergenres. In Proceedings of the 32nd Hawaii International Conference on Systems Sciences (HICSS-32).

    Google Scholar 

  • Storrer, Angelika (2004). Text und Hypertext. In Lobin, Henning and Lemnitzer, Lothar, editors, Texttechnologie – Anwendungen und Perspektiven, Stauffenburg Handbücher, pages 13–49. Stauffenburg, Tübingen.

    Google Scholar 

  • Swales, John M. (1990). Genre Analysis – English in academic and research settings. The Cambridge Applied Linguistics Series. Cambridge University Press, Cambridge.

    Google Scholar 

  • Walker, Derek (1999). Taking Snapshots of the Web with a TEI Camera. Computers and the Humanities, 33(1–2):185–192.

    Article  Google Scholar 

  • Yates, Joanne and Orlikowski, Wanda J. (1992). Genres of Organizational Communication: A Structurational Approach to Studying Communication and Media. Academy of Management Review, 17(2):299–326.

    Article  Google Scholar 

  • Yates, Simeon J. and Sumner, Tamara R. (1997). Digital Genres and the New Burden of Fixity. In Proceedings of the 30th Hawaii International Conference on Systems Sciences (HICSS-30), volume 6, pages 3–12.

    Google Scholar 

  • Yoshioka, Takeshi, Herman, George, Yates, JoAnne, and Orlikowski, Wanda (2001). Genre Taxonomy. ACM Transactions on Information Systems, 19(4):431–456.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Georg Rehm .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Rehm, G. (2010). Hypertext Types and Markup Languages. In: Witt, A., Metzing, D. (eds) Linguistic Modeling of Information and Markup Languages. Text, Speech and Language Technology, vol 41. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3331-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-90-481-3331-4_8

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-3330-7

  • Online ISBN: 978-90-481-3331-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics