Abstract
Text technological applications such as automatic summarisation or information extraction systems often process web documents. An important aspect of web documents that most systems ignore is the document type or document genre and the relationship the respective genre, as well as genres in general, have with regard to the Hypertext Markup Language. This chapter introduces the concept of hypertext types to highlight some of the most relevant aspects and applications. Hypertext types are very similar to traditional text types because hypertexts can be grouped into categories that share certain linguistic, textual, or pragmatic features such as communicative function, hypertextual structure, or content. Processing web documents based on their genres entails several advantages, the most prominent of which is the automatic identification of web genres for improved information retrieval aproaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barnard, David T., Burnard, Lou, DeRose, Steven J., Durand, David G., and Sperberg-McQueen, C.M. (1996). Lessons for the World Wide Web from the Text Encoding Initiative. The World Wide Web Journal, 1(1):349–357.
Bazerman, Charles (1994). Systems of Genres and the Enactment of Social Intentions. In Freedman, Aviva and Medway, Peter, editors, Genre and the New Rhetoric, pages 79–101. Taylor and Francis, London.
Bray, Tim, Paoli, Jean, Sperberg-McQueen, C. M., Maler, Eve, Yergeau, François, and Cowan, John (2004). Extensible Markup Language (XML) 1.1. Technical Specification, W3C. http://www.w3.org/TR/2004/REC-xml11-20040204/.
Brinker, Klaus, Antos, Gerd, Heinemann, Wolfgang, and Sager, Sven F., editors (2000). Text- und Gesprächslinguistik, volume 16.1 of Handbücher zur Sprach- und Kommunikationswissenschaft (HSK). de Gruyter, Berlin, New York.
Crowston, Kevin and Williams, Marie (2000). Reproduced and Emergent Genres of Communication on the World Wide Web. The Information Society, 16(3):201–215.
Dewe, Johan, Karlgren, Jussi, and Bretan, Ivan (1998). Assembling a Balanced Corpus from the Internet. In Proceedings of the 11th Nordic Conference of Computational Linguistics, pages 100–107, Copenhagen.
Eckkrammer, Eva Martha (2001). Textsortenkonventionen im Medienwechsel. In Handler, Peter, editor, E-Text: Strategien und Kompetenzen – Elektronische Kommunikation in Wissenschaft, Bildung und Beruf, volume 7 of Textproduktion und Medium, pages 45–66. Peter Lang, Frankfurt/Main, Berlin, Bern etc.
Eiron, Nadav and McCurley, Kevin S. (2003). Untangling Compound Documents on the Web. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, pages 85–94. Nottingham.
Emigh, William and Herring, Susan C. (2005). Collaborative Authoring on the Web: A Genre Analysis of Online Encyclopedias. In Proceedings of the 38th Hawaii International Conference on Systems Sciences (HICSS-38), Big Island, Hawaii.
Erdmann, Michael (2001). Ontologien zur konzeptuellen Modellierung der Semantik von XML. Ph.d. thesis, University of Karlsruhe, Karlsruhe.
Erickson, Thomas (2000). Making Sense of Computer-Mediated Communication (CMC): Conversations as Genres, CMC Systems as Genre Ecologies. In Proceedings of the 33rd Hawaii International Conference on Systems Sciences (HICSS-33).
Eriksen, Lars Bo and Ihlström, Carina (1999). In the Path of the Pioneers – Longitudinal Study of Web News Genre. In Käkölä, Timo K., editor, Proceedings of the 22nd Information Systems Research Seminar in Scandinavia (IRIS 22): “Enterprise Architectures for Virtual Organizations”, pages 289–304, Keuruu. University of Jyväskylä.
Furuta, Richard and Marshall, Catherine C. (1996). Genre as Reflection of Technology in the World-Wide Web. In Fraïssé, Sylvain, Garzotto, Franca, Isakowitz, Tomás, Nanard, Jocelyne, and Nanard, Marc, editors, Hypermedia Design, Proceedings of the International Workshop on Hypermedia Design (IWHD 1995), Workshops in Computing, pages 182–195. Springer, Berlin, Heidelberg, New York etc.
Gleim, Rüdiger, Mehler, Alexander, Eikmeyer, Hans-Jürgen, and Rieser, Hannes (2007). Ein Ansatz zur Repräsentation und Verarbeitung großer Korpora. In Rehm, Georg, Witt, Andreas, and Lemnitzer, Lothar, editors, Datenstrukturen für linguistische Ressourcen und ihre Anwendungen – Data Structures for Linguistic Resources and Applications: Proceedings of the Biennial GLDV Conference 2007, pages 275–284. Gunter Narr, Tübingen.
Haas, Stephanie W. and Grams, Erika S. (1998). Page and Link Classifications: Connecting Diverse Resources. In Witten, I., Akscyn, R., and Shipman, F., editors, Proceedings of Digital Libraries ’98 – Third ACM Conference on Digital Libraries, pages 99–107, Pittsburgh.
Haas, Stephanie W. and Grams, Erika S. (2000). Readers, Authors, and Page Structure – A Discussion of Four Questions Arising from a Content Analysis of Web Pages. Journal of the American Society for Information Science, 51(2):181–192.
Hammwöhner, Rainer (1997). Offene Hypertextsysteme – Das Konstanzer Hypertextsystem (KHS) im wissenschaftlichen und technischen Kontext. Number 32 in Schriften zur Informationswissenschaft. Universitätsverlag Konstanz, Konstanz.
ISO 8879 (1986). Information Processing – Text and Office Information Systems – Standard Generalized Markup Language. Internationaler Standard, International Organization for Standardization, Genf.
Jakobs, Eva-Maria (2003). Hypertextsorten. Zeitschrift für germanistische Linguistik, 31(2):232–252.
Kilgarriff, Adam (2001). Web as Corpus. In Rayson, Paul, Wilson, Andrew, McEnery, Tony, Hardie, Andrew, and Khoja, Shereen, editors, Proceedings of the Corpus Linguistics 2001 Conference, pages 342–344, Lancaster.
Kilgarriff, Adam and Grefenstette, Gregory (2003). Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 29(3): 333–348.
Kim, Yunhyong and Ross, Seamus (2008). Examining Variations of Prominent Features in Classification. In Proceedings of the 41st Hawaii International Conference on Systems Sciences (HICSS-41), Big Island, Hawaii.
Kuhlen, Rainer (1991). Hypertext – Ein nicht-lineares Medium zwischen Buch und Wissensbank. Springer, Berlin, Heidelberg, New York etc.
Levering, Ryan, Cutler, Michal, and Yu, Lei (2008). Using Visual Features for Fine-Grained Genre Classification of Web Pages. In Proceedings of the 41st Hawaii International Conference on Systems Sciences (HICSS-41), Big Island, Hawaii.
Lim, Chul Su, Lee, Kong Joo, and Kim, Gil Chang (2005). Multiple Sets of Features for Automatic Genre Classification of Web Documents.Information Processing and Management, 41(5):1263–1276.
Lobin, Henning (2000). Service-Handbücher – Linguistische Aspekte im Document Lifecycle. In Richter, Gerd, Riecke, Jörg, and Schuster, Britt-Marie, editors, Raum, Zeit, Medium – Sprache und ihre Determinanten. Festschrift für Hans Ramge, pages 791–808. Hessische Historische Kommission, Darmstadt.
Lobin, Henning (2001).Informationsmodellierung in XML und SGML. Springer, Berlin, Heidelberg, New York etc.
Maler, Eve and Andaloussi, Jeanne El (1996). Developing SGML DTDs – From Text to Model to Markup. Prentice Hall, Upper Saddle River.
Mehler, Alexander, Dehmer, Matthias, and Gleim, Rüdiger (2004). Towards Logical Hypertext Structure — A Graph-Theoretic Perspective. In Böhme, Thomas and Heyer, Gerhard, editors, Proceedings of the Fourth International Workshop on Innovative Internet Computing Systems (I2CS ’04), Lecture Notes in Computer Science, Berlin, New York. Springer.
Mehler, Alexander, Sharoff, Serge, Rehm, Georg, and Santini, Marina, editors (2009). Genres on the Web: Computational Models and Empirical Studies, Springer, New York.
Miller, Carolyn R. (1984). Genre as Social Action. Quarterly Journal of Speech, (70):151–167.
Myllymaki, Jussi (2001). Effective Web Data Extraction with Standard XML Technologies. In Proceedings of the 10th International World Wide Web Conference (WWW-10), pages 689–696, Hong Kong.
Orlikowski, Wanda J. and Yates, JoAnne (1994). Genre Repertoire: The Structuring of Communicative Practices in Organizations. Administrative Science Quarterly, (39):541–574.
Pemberton, Steven (2002). XHTML 1.0: The Extensible Hypertext Markup Language (Second Edition). Technical Specification, W3C. http://www.w3.org/TR/xhtml1/.
Raggett, Dave, Hors, Arnaud Le, and Jacbos, Ian (1999). HTML 4.01 Specification. Technical Specification, W3C. http://www.w3.org/TR/html401/.
Rehm, Georg (2001). korpus.html – Zur Sammlung, Datenbank-basierten Erfassung, Annotation und Auswertung von HTML-Dokumenten. In Lobin, Henning, editor, Proceedings of the GLDV Spring Meeting 2001, pages 93–103, Giessen, Germany. Gesellschaft für linguistische Datenverarbeitung (Society for Computational Linguistics and Language Technology). http://www.uni-giessen.de/fb09/ascl/gldv2001/.
Rehm, Georg (2002). Towards Automatic Web Genre Identification – A Corpus-Based Approach in the Domain of Academia by Example of the Academic’s Personal Homepage. In Proceedings of the 35th Hawaii International Conference on System Sciences (HICSS-35), Big Island, Hawaii.
Rehm, Georg (2004a). Hypertextsorten-Klassifikation als Grundlage generischer Informationsextraktion. In Mehler, Alexander and Lobin, Henning, editors, Automatische Textanalyse – Systeme und Methoden zur Annotation und Analyse natürlichsprachlicher Texte, pages 219–233. Verlag für Sozialwissenschaften, Wiesbaden.
Rehm, Georg (2004b). Ontologie-basierte Hypertextsorten-Klassifikation. In Mehler, Alexander and Lobin, Henning, editors, Automatische Textanalyse – Systeme und Methoden zur Annotation und Analyse natürlichsprachlicher Texte, pages 121–137. Verlag für Sozialwissenschaften, Wiesbaden.
Rehm, Georg (2004c). Texttechnologische Grundlagen. In Carstensen, Kai-Uwe, Ebert, Christian, Endriss, Cornelia, Jekat, Susanne, Klabunde, Ralf, and Langer, Hagen, editors, Computerlinguistik und Sprachtechnologie – Eine Einführung, pages 138–147. Spektrum, Heidelberg, 2 edition.
Rehm, Georg (2005). Language-Independent Text Parsing of Arbitrary HTML-Documents – Towards A Foundation For Web Genre Identification. LDV Forum, 20(2):53–74.
Rehm, Georg (2007). Hypertextsorten: Definition – Struktur – Klassifikation. Books on Demand, Norderstedt.(Ph.D. thesis in Applied and Computational Linguistics, Giessen University, 2005).
Rehm, Georg (2009). A Comparative Analysis of Genre Category Sets as a Prerequisite for a Reference Corpus of Web Genres. In Mehler, Alexander, Sharoff, Serge, Rehm, Georg, and Santini, Marina, editors, Genres on the Web: Computational Models and Empirical Studies, Springer, New York.
Rehm, Georg and Santini, Marina, editors (2007). Proceedings of the International Workshop Towards Genre-Enabled Search Engines: The Impact of Natural Language Processing, Borovets, Bulgaria. Held in conjunction with RANLP 2007.
Rehm, Georg, Santini, Marina, Mehler, Alexander, Braslavski, Pavel, Gleim, Rüdiger, Stubbe, Andrea, Symonenko, Svetlana, Tavosanis, Mirko, and Vidulin, Vedrana (2008). Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems. In Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech, Morocco.
Reiss, Eric L. (2000). Practical Information Architecture – A Hands-On Approach to Structuring Successful Websites. Addison-Wesley, Harlow, London, New York etc.
Rosso, Mark A. (2005). Using Genre to Improve Web Search. Ph. D. thesis, School of Information and Library Science, University of North Carolina at Chapel Hill.
Ryan, Terry, Field, Richard H. G., and Olfman, Lorne (2003). The evolution of US state government home pages from 1997 to 2002. International Journal of Human-Computer Studies, 59(4):403–430.
Santini, Marina (2007). Characterizing Genres of Web Pages: Genre Hybridism and Individualization. In Proceedings of the 40th Hawaii International Conference on Systems Sciences (HICSS-40), Big Island, Hawaii.
Shepherd, Michael and Watters, Carolyn (1998). The Evolution of Cybergenres. In Proceedings of the 31st Hawaii International Conference on Systems Sciences (HICSS-31), volume 2, pages 97–109.
Shepherd, Michael and Watters, Carolyn (1999). The Functionality Attribute of Cybergenres. In Proceedings of the 32nd Hawaii International Conference on Systems Sciences (HICSS-32).
Storrer, Angelika (2004). Text und Hypertext. In Lobin, Henning and Lemnitzer, Lothar, editors, Texttechnologie – Anwendungen und Perspektiven, Stauffenburg Handbücher, pages 13–49. Stauffenburg, Tübingen.
Swales, John M. (1990). Genre Analysis – English in academic and research settings. The Cambridge Applied Linguistics Series. Cambridge University Press, Cambridge.
Walker, Derek (1999). Taking Snapshots of the Web with a TEI Camera. Computers and the Humanities, 33(1–2):185–192.
Yates, Joanne and Orlikowski, Wanda J. (1992). Genres of Organizational Communication: A Structurational Approach to Studying Communication and Media. Academy of Management Review, 17(2):299–326.
Yates, Simeon J. and Sumner, Tamara R. (1997). Digital Genres and the New Burden of Fixity. In Proceedings of the 30th Hawaii International Conference on Systems Sciences (HICSS-30), volume 6, pages 3–12.
Yoshioka, Takeshi, Herman, George, Yates, JoAnne, and Orlikowski, Wanda (2001). Genre Taxonomy. ACM Transactions on Information Systems, 19(4):431–456.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media B.V.
About this chapter
Cite this chapter
Rehm, G. (2010). Hypertext Types and Markup Languages. In: Witt, A., Metzing, D. (eds) Linguistic Modeling of Information and Markup Languages. Text, Speech and Language Technology, vol 41. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3331-4_8
Download citation
DOI: https://doi.org/10.1007/978-90-481-3331-4_8
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-3330-7
Online ISBN: 978-90-481-3331-4
eBook Packages: Computer ScienceComputer Science (R0)