Skip to main content

Classification of Web Sites at Super-genre Level

  • Chapter
  • First Online:
Genres on the Web

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 42))

Abstract

The World Wide Web has developed into a central source of information, a very important marketplace, a highly noticed presentation platform, and a frequented meeting place, to mention only some. Furthermore, the ever-growing number of users and content creators leads to a rapid evolution and emergence of different Web sites. As a consequence, it is more and more difficult to identify the Web sites providing the information and services of interest.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Amitay, E., D. Carmel, A. Darlow, R. Lempel, and A. Soffer. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th Conference on Hypertext and Hypermedia. Nottingham.

    Google Scholar 

  2. Biber, D. 1988. Variation across speech and writing. Cambridge, MA: Cambridge University Press.

    Book  Google Scholar 

  3. Björneborn, L. 2010. Genre connectivity and genre drift in a web of genres. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.

    Google Scholar 

  4. Braslavski, P. 2010. Marrying relevance and genre rankings: An Exploratory Study. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.

    Google Scholar 

  5. Bruce, I. 2010. Evolving genres in online domains: The hybrid genre of the participatory news article. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, M. Dordrecht: Springer.

    Google Scholar 

  6. Chakrabarti, S. 2003. Mining the web. San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  7. Cho, J., and H. Garcia-Molina. 2000. The evolution of the web and its implications for an incremental crawler. In 26th Conference on Very Large Data Bases. Cairo.

    Google Scholar 

  8. Cooley, R. 2003. The use of web structure and content to identify subjectively interesting web usage patterns. ACM Transactions on Internet Technology 3(2):93–116.

    Article  Google Scholar 

  9. Dehmer, M., and F. Emmert-Streib. 2010. Mining graph patterns in web-based systems: A conceptual view. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.

    Google Scholar 

  10. DMOZ. Open directory project, http://www.dmoz.org

  11. Domingos, P., and M. Pazzani. 1997. On the optimality of the bayesian classifier under zero-one loss. Machine Learning 29:103–137.

    Article  MATH  Google Scholar 

  12. Duda, R., P. Hart, and D. Stork. 2001. Pattern classification, 2nd Ed. New York, NY: Wiley.

    MATH  Google Scholar 

  13. Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19:61–74.

    Google Scholar 

  14. Ester, M., H.-P. Kriegel, and M. Schubert. 2002. Web site mining: A new way to spot competitors, customers and suppliers in the World Wide Web. In Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining. Edmonton.

    Google Scholar 

  15. Fetterly, D., M. Manasse, and M. Najork. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases. Paris.

    Google Scholar 

  16. Gibson, D., K. Punera, and A. Tomkins. 2005. The volume and evolution of web page templates. In Proceedings of the 14th International World Wide Web Conference. Chiba.

    Google Scholar 

  17. Han, J., and M. Kamber. 2006. Data mining, 2nd Ed. San Francisco, CA: Morgan Kaufmann.

    MATH  Google Scholar 

  18. Kohavi, R., and G. John. 1997. Wrappers for feature subset selection. Artificial Intelligence, 97:273–324.

    Article  MATH  Google Scholar 

  19. Kriegel, H.-P., and M. Schubert. 2004. Classification of websites as sets of feature vectors. In International Conference on Databases and Applications. Innsbruck.

    Google Scholar 

  20. Kumar, R., K. Punera, and A. Tomkins. 2006. Hierarchical topic segmentation of websites. In Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA.

    Google Scholar 

  21. Kwon, O.-W., and J.-H. Lee. 2003. Text categorization based on k-nearest neighbor approach for web site classification. Information Processing and Management 39:25–44.

    Article  MATH  Google Scholar 

  22. Lee, D. 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC Jungle. Language Learning & Technology 5:37–72.

    Google Scholar 

  23. Lindemann, C., and L. Littig. 2006. Coarse-grained classification of web sites by their structural properties. In Proceedings of the 8th International Workshop on Web Information and Data Management. Arlington, VA.

    Google Scholar 

  24. Lindemann, C., and L. Littig. 2007. Classifying web sites. In Proceedings of the 16th International World Wide Web Conference. Banff.

    Google Scholar 

  25. Liu, B. 2007. Web data mining: Exploring hyperlinks, contents and usage data. Heidelberg: Springer.

    MATH  Google Scholar 

  26. Pierre, J.M. 2001. On the automated classification of web sites. Linköping Electronic Articles in Computer and Information Science 6.

    Google Scholar 

  27. Sharoff, S. 2010. In the garden and in the jungle: Comparing genres in the BNC and internet. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.

    Google Scholar 

  28. Stein, B., S. Meyer zu Eissen, and N. Lipka. 2010. Web genre analysis: Use cases, retrieval models, and implementation issues. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.

    Google Scholar 

  29. Tian, Y-H., T. Huang, and W. Gao. 2004. Two-phase web site classification based on hidden Markov tree models. Web Intelligence and Agent Systems 2:249–264.

    Google Scholar 

  30. Vogel, D. 2003. Using generic corpora to learn domain-specific terminology. In Workshop on Link Analysis for Detecting Complex Behavior. Washington, DC.

    Google Scholar 

  31. Weiss, N.A. 2002. Introductory Statistics, 6th Ed., Greg Tobin. Reading MA: Addison Wesley.

    Google Scholar 

  32. Yahoo! Mindset, http://mindset.research.yahoo.com

  33. Yang, Y., and Webb, G. 2003. Weighted proportional k-interval discretization for naive-bayes classifiers. Artificial Intelligence 2637:501–512.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christoph Lindemann .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Lindemann, C., Littig, L. (2010). Classification of Web Sites at Super-genre Level. In: Mehler, A., Sharoff, S., Santini, M. (eds) Genres on the Web. Text, Speech and Language Technology, vol 42. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9178-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-90-481-9178-9_10

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-9177-2

  • Online ISBN: 978-90-481-9178-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics