Advertisement

Classification of Web Sites at Super-genre Level

  • Christoph LindemannEmail author
  • Lars Littig
Chapter
Part of the Text, Speech and Language Technology book series (TLTB, volume 42)

Abstract

The World Wide Web has developed into a central source of information, a very important marketplace, a highly noticed presentation platform, and a frequented meeting place, to mention only some. Furthermore, the ever-growing number of users and content creators leads to a rapid evolution and emergence of different Web sites. As a consequence, it is more and more difficult to identify the Web sites providing the information and services of interest.

Keywords

Web site classification Web mining Web measurement 

References

  1. 1.
    Amitay, E., D. Carmel, A. Darlow, R. Lempel, and A. Soffer. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th Conference on Hypertext and Hypermedia. Nottingham.Google Scholar
  2. 2.
    Biber, D. 1988. Variation across speech and writing. Cambridge, MA: Cambridge University Press.CrossRefGoogle Scholar
  3. 3.
    Björneborn, L. 2010. Genre connectivity and genre drift in a web of genres. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.Google Scholar
  4. 4.
    Braslavski, P. 2010. Marrying relevance and genre rankings: An Exploratory Study. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.Google Scholar
  5. 5.
    Bruce, I. 2010. Evolving genres in online domains: The hybrid genre of the participatory news article. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, M. Dordrecht: Springer.Google Scholar
  6. 6.
    Chakrabarti, S. 2003. Mining the web. San Francisco, CA: Morgan Kaufmann.Google Scholar
  7. 7.
    Cho, J., and H. Garcia-Molina. 2000. The evolution of the web and its implications for an incremental crawler. In 26th Conference on Very Large Data Bases. Cairo.Google Scholar
  8. 8.
    Cooley, R. 2003. The use of web structure and content to identify subjectively interesting web usage patterns. ACM Transactions on Internet Technology 3(2):93–116.CrossRefGoogle Scholar
  9. 9.
    Dehmer, M., and F. Emmert-Streib. 2010. Mining graph patterns in web-based systems: A conceptual view. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.Google Scholar
  10. 10.
    DMOZ. Open directory project, http://www.dmoz.org
  11. 11.
    Domingos, P., and M. Pazzani. 1997. On the optimality of the bayesian classifier under zero-one loss. Machine Learning 29:103–137.zbMATHCrossRefGoogle Scholar
  12. 12.
    Duda, R., P. Hart, and D. Stork. 2001. Pattern classification, 2nd Ed. New York, NY: Wiley.zbMATHGoogle Scholar
  13. 13.
    Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19:61–74.Google Scholar
  14. 14.
    Ester, M., H.-P. Kriegel, and M. Schubert. 2002. Web site mining: A new way to spot competitors, customers and suppliers in the World Wide Web. In Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining. Edmonton.Google Scholar
  15. 15.
    Fetterly, D., M. Manasse, and M. Najork. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases. Paris.Google Scholar
  16. 16.
    Gibson, D., K. Punera, and A. Tomkins. 2005. The volume and evolution of web page templates. In Proceedings of the 14th International World Wide Web Conference. Chiba.Google Scholar
  17. 17.
    Han, J., and M. Kamber. 2006. Data mining, 2nd Ed. San Francisco, CA: Morgan Kaufmann.zbMATHGoogle Scholar
  18. 18.
    Kohavi, R., and G. John. 1997. Wrappers for feature subset selection. Artificial Intelligence, 97:273–324.zbMATHCrossRefGoogle Scholar
  19. 19.
    Kriegel, H.-P., and M. Schubert. 2004. Classification of websites as sets of feature vectors. In International Conference on Databases and Applications. Innsbruck.Google Scholar
  20. 20.
    Kumar, R., K. Punera, and A. Tomkins. 2006. Hierarchical topic segmentation of websites. In Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA.Google Scholar
  21. 21.
    Kwon, O.-W., and J.-H. Lee. 2003. Text categorization based on k-nearest neighbor approach for web site classification. Information Processing and Management 39:25–44.zbMATHCrossRefGoogle Scholar
  22. 22.
    Lee, D. 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC Jungle. Language Learning & Technology 5:37–72.Google Scholar
  23. 23.
    Lindemann, C., and L. Littig. 2006. Coarse-grained classification of web sites by their structural properties. In Proceedings of the 8th International Workshop on Web Information and Data Management. Arlington, VA.Google Scholar
  24. 24.
    Lindemann, C., and L. Littig. 2007. Classifying web sites. In Proceedings of the 16th International World Wide Web Conference. Banff.Google Scholar
  25. 25.
    Liu, B. 2007. Web data mining: Exploring hyperlinks, contents and usage data. Heidelberg: Springer.zbMATHGoogle Scholar
  26. 26.
    Pierre, J.M. 2001. On the automated classification of web sites. Linköping Electronic Articles in Computer and Information Science 6.Google Scholar
  27. 27.
    Sharoff, S. 2010. In the garden and in the jungle: Comparing genres in the BNC and internet. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.Google Scholar
  28. 28.
    Stein, B., S. Meyer zu Eissen, and N. Lipka. 2010. Web genre analysis: Use cases, retrieval models, and implementation issues. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.Google Scholar
  29. 29.
    Tian, Y-H., T. Huang, and W. Gao. 2004. Two-phase web site classification based on hidden Markov tree models. Web Intelligence and Agent Systems 2:249–264.Google Scholar
  30. 30.
    Vogel, D. 2003. Using generic corpora to learn domain-specific terminology. In Workshop on Link Analysis for Detecting Complex Behavior. Washington, DC.Google Scholar
  31. 31.
    Weiss, N.A. 2002. Introductory Statistics, 6th Ed., Greg Tobin. Reading MA: Addison Wesley.Google Scholar
  32. 32.
  33. 33.
    Yang, Y., and Webb, G. 2003. Weighted proportional k-interval discretization for naive-bayes classifiers. Artificial Intelligence 2637:501–512.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of LeipzigLeipzigGermany

Personalised recommendations