Web Genre Analysis: Use Cases, Retrieval Models, and Implementation Issues

Chapter
Part of the Text, Speech and Language Technology book series (TLTB, volume 42)

Abstract

People who search the World Wide Web often have a multi-faceted understanding of their information need: they know what they are searching for, and they know of which form or type the desired documents should be. The former aspect relates to the content of a desired document (= topic), the latter to the presentation of its content and the intended target group. Due to the different user groups and the technical means of the World Wide Web several favorite specializations of Web documents emerged: a document may contain many links (e.g. a link collection), scientific text (e.g. a research article), almost no text but pictures (e.g. an advertisement page), or a short answer to a specific question (e.g. a mail in a help forum). These examples suggest that it can be of much help if the retrieval process is capable to address a user’s information need regarding to – what is called here – “genre” or “Web genre”.

This chapter contributes to Web genre analysis. It presents relevant use cases, discusses existing and new technology for the construction of Web genre retrieval models, and outlines implementation aspects for a genre-enabled Web search. Special focus is put on the generalization capability of Web genre retrieval models, for which we present new evaluation measures and, for the first time, a quantitative analysis.

Keywords

Web Genre Use Cases Text Classification Evaluation Measures 

References

  1. 1.
    Antunes, P., C.J. Costa, and J. Ferreira Dias. 2001. Applying genre analysis to ems design: The example of a small accounting firm. In Proceedings of the 7th International Workshop on Groupware, CRIWG 2001, 74–81. Darmstadt: IEEE CS Press.Google Scholar
  2. 2.
    Bay, S.D. 2000. Multivariate discretization of continuous variables for set mining. In KDD ’00: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 315–319, New York, NY: ACM Press. ISBN 1-58113-233-6. doi: http://doi.acm.org/10.1145/347090.347159
  3. 3.
    Boese, E.S., and A.E. Howe. 2005. Effects of web document evolution on genre classification. In Proceedings of the CIKM’05, Nov 2005. ACM Press.Google Scholar
  4. 4.
    Bretan, I., J. Dewe, A. Hallberg, N. Wolkert, and J. Karlgren. 1998. Web-specific genre visualization. In Proceedings of the Webnet World Conference on the WWW and Internet.
  5. 5.
    Broder, A.Z. 2002. A Taxonomy of Web Search. SIGIR Forum 36(2):3–10.CrossRefGoogle Scholar
  6. 6.
    Crowston, K., and M. Williams. 2000. Reproduced and emergent genres of communication on the World-Wide Web. The Information Society 16(3):201–216.CrossRefGoogle Scholar
  7. 7.
    Deerwester, S.C., S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6):391–407.CrossRefGoogle Scholar
  8. 8.
    Dewdney, N., C. VanEss-Dykema, and R. MacMillan. 2001. The form is the substance: Classification of genres in text. In Proceedings of ACL Workshop on HumanLanguage Technology and Knowledge Management. Toulouse, France.Google Scholar
  9. 9.
    Dimitrova, M., A. Finn, N. Kushmerick, and B. Smyth. 2002. Web genre visualization. In Proceedings of the Conference on Human Factors in Computing Systems. Minneapolis, Minnesota, USA.Google Scholar
  10. 10.
    Dougherty, J., R. Kohavi, and M. Sahami. Jul 1995. Supervised and unsupervised discretization of continuous features. In Proceedings of the 12th International Conference on Machine Learning, eds. A. Prieditis and S. Russell, 194–202, Menlo Park, CA: Morgan Kaufmann.Google Scholar
  11. 11.
    Fayyad, U.M., and K.B. Irani. 1993. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the International Joint Conference on Uncertainty in AI (IJCAI), 1022–1027. Chambery, France.Google Scholar
  12. 12.
    Finn, A., and N. Kushmerick. 2003. Learning to classify documents according to genre. In IJCAI-03 Workshop on Computational Approaches to Style Analysis and Synthesis. Acapulco, Mexico.Google Scholar
  13. 13.
    Freund, L., C.L.A. Clarke, and E.G. Toms. 2006. Towards genre classification for IR in the workplace. In Proceedings of the 1st International Conference on Information Interaction in Context, 30–36. New York, NY: ACM Press. ISBN 1-59593-482-0.Google Scholar
  14. 14.
    Karlgren, J., and D. Cutting. 1994. Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th. International Conference on Computational Linguistics, Coling 94, vol. II, 1071–1075. Kyoto.Google Scholar
  15. 15.
    Kennedy, A., and M. Shepherd. 2005. Automatic identification of home pages on the web. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences, HICSS-38. Big Island, Hawaii.Google Scholar
  16. 16.
    Kessler, B., G. Nunberg, and H. Schütze. 1997. Automatic detection of text genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, eds. P.R. Cohen and W. Wahlster, 32–38. Somerset, NJ: Association for Computational Linguistics.CrossRefGoogle Scholar
  17. 17.
    Lawrie, D., W.B. Croft, and A.L. Rosenberg. 2001. Finding topic words for hierarchical summarization. In SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 9–13 Sept 2001, 349–357. New Orleans, LA.Google Scholar
  18. 18.
    Lawrie, D.J., and W.B. Croft. 2003. Generating hierarchical summaries for web searches. In SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul 28–Aug 1, 2003, 457–458. Toronto, ON.Google Scholar
  19. 19.
    Lee, Y.-B., and S.H. Myaeng. 2002. Text genre classification with genre-revealing and subject-revealing features. In SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 145–150. ACM Press. ISBN 1-58113-561-0. doi: http://doi.acm.org/10.1145/564376.564403
  20. 20.
    Lim, C.S., K.J. Lee, and G.C. Kim. 2005. Automatic genre detection of web documents. In Proceedings of Natural Language Processing, IJCNLP 2004, eds. K. Su, J. Tsujii, J. Lee, and O.Y. Kwong, 310–319. Springer.Google Scholar
  21. 21.
    Meyer zu Eißen, S., and B. Stein. 2004. Genre classification of web pages: User study and feasibility analysis. In KI 2004: Advances in Artificial Intelligence, Sept 2004, eds. S. Biundo, T. Frühwirth, and G. Palm, LNAI of Lecture Notes in Artificial Intelligence, vol. 3228, 256–269, New York, NY: Springer. ISBN 0302-9743.Google Scholar
  22. 22.
    Mitchell, T.M. 1997. Machine learning. New York, NY: McGraw-Hill Higher Education. ISBN 0070428077.MATHGoogle Scholar
  23. 23.
    Popescul, A., and L.H. Ungar. Automatic labeling of document clusters. http://citeseer.nj.nec.com/popescul00automatic.html, 2000Google Scholar
  24. 24.
    Quinlan, J.R. 1993. C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers.Google Scholar
  25. 25.
    Rauber, A., and A. Müller-Kögler. 2001. Integrating automatic genre analysis into digital libraries. In ACM/IEEE Joint Conference on Digital Libraries, 1–10. Roanoke, Virginia, USA.Google Scholar
  26. 26.
    Rehm, G. 2002. Towards automatic web genre identification. In Proceedings of the 35th Hawaii International Conference on System Sciences (HICSS’02), Jan 2002. IEEE Computer Society.Google Scholar
  27. 27.
    Rendell, L.A. 1986. A general framework for induction and a study of selective induction. Machine Learning 1:177–226.Google Scholar
  28. 28.
    Robertson, S.E., and K. Sparck-Jones. 1976. Relevance weighting of search terms. American Society for Information Science 27(3):129–146.CrossRefGoogle Scholar
  29. 29.
    Roussinov, D., K. Crowston, M. Nilan, B. Kwasnik, J. Cai, and X. Liu. 2001. Genre based navigation on the web. In Proceedings of the 34th Hawaii International Conference on System Sciences. Maui, Hawaii.Google Scholar
  30. 30.
    Salton, G., A. Wong, and C.S. Yang. 1975. A vector space model for automatic indexing. Communicaitons of the ACM 18(11):613–620.MATHCrossRefGoogle Scholar
  31. 31.
    Santini, M. 2006. Common criteria for genre classification: Annotation and granularity. In Proceedings of the ECAI-Workshop TIR-06. Riva del Garda.Google Scholar
  32. 32.
    Santini, M. 2007. Automatic identification of genre in web pages. PhD thesis, University of Brighton.Google Scholar
  33. 33.
    Sellentin, J. 1999. Konzepte und Techniken der Datenversorgung für komponentenbasierte Informationssysteme. PhD thesis, University of Stuttgart, Stuttgart.Google Scholar
  34. 34.
    Stamatatos, E., N. Fakotakis, and G. Kokkinakis. 2000. Text genre detection using common word frequencies. In Proceedings of the 18th International Conference on Computational Linguistics. Saarbrücken.Google Scholar
  35. 35.
    Stein, B., and S. Meyer zu Eißen. 2004. Topic identification: framework and application. In Proceedings of the 4th International Conference on Knowledge Management (I-KNOW 04), Graz, Austria, July 2004, eds. K. Tochtermann and H. Maurer, Journal of Universal Computer Science, 353–360. Graz: Know-Center.Google Scholar
  36. 36.
    Stein, B., and M. Busch. 2005. Density-based cluster algorithms in low-dimensional and high-dimensional applications. In Proceedings of the 2nd International Workshop on Text-Based Information Retrieval (TIR 05), Fachberichte Informatik, Sept 2005, eds. B. Stein and S. Meyer zu Eißen, 45–56. Universität Koblenz-Landau.Google Scholar
  37. 37.
    Stein, B., and S. Meyer zu Eissen. 2006. Distinguishing topic from genre. In Proceedings of the 6th International Conference on Knowledge Management (I-KNOW 06), Graz, Sept 2006, Journal of Universal Computer Science, eds. K. Tochtermann and H. Maurer, 449–456. Springer.Google Scholar
  38. 38.
    Stein, B., and S. Meyer zu Eißen. 2008. Retrieval models for genre classification. Scandinavian Journal of Information Systems (SJIS) 20(1):91–117. ISSN 0905-0167.Google Scholar
  39. 39.
    Turney, P.D. 1995. Technical note: Bias and the quantification of stability. Machine Learning 20(1–2):23–33.Google Scholar
  40. 40.
    Utgoff, P.E. 1986. Shift of bias for inductive concept learning. In Machine learning: An artificial intelligence approach, eds. R.S. Michalski, J.G. Carbonell, and T.M. Mitchell, vol. II, 107–148. Los Altos, CA: Kaufmann.Google Scholar
  41. 41.
    Yoshioka, T., and G. Herman. Coordinating information using genres. CCS WP 214. Cambridge, MA: Massachusetts Institute of Technology (MIT), Sloan School of Management, Aug 2000.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  • Benno Stein
    • 1
  • Sven Meyer zu Eissen
    • 1
  • Nedim Lipka
    • 1
  1. 1.Faculty of Media/Media SystemsBauhaus-Universität WeimarWeimarGermany

Personalised recommendations