Classification of Web Sites at Super-genre Level

Lindemann, Christoph; Littig, Lars

doi:10.1007/978-90-481-9178-9_10

Christoph Lindemann⁴ &
Lars Littig⁴

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 42))

Abstract

The World Wide Web has developed into a central source of information, a very important marketplace, a highly noticed presentation platform, and a frequented meeting place, to mention only some. Furthermore, the ever-growing number of users and content creators leads to a rapid evolution and emergence of different Web sites. As a consequence, it is more and more difficult to identify the Web sites providing the information and services of interest.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amitay, E., D. Carmel, A. Darlow, R. Lempel, and A. Soffer. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th Conference on Hypertext and Hypermedia. Nottingham.
Google Scholar
Biber, D. 1988. Variation across speech and writing. Cambridge, MA: Cambridge University Press.
Book Google Scholar
Björneborn, L. 2010. Genre connectivity and genre drift in a web of genres. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.
Google Scholar
Braslavski, P. 2010. Marrying relevance and genre rankings: An Exploratory Study. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.
Google Scholar
Bruce, I. 2010. Evolving genres in online domains: The hybrid genre of the participatory news article. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, M. Dordrecht: Springer.
Google Scholar
Chakrabarti, S. 2003. Mining the web. San Francisco, CA: Morgan Kaufmann.
Google Scholar
Cho, J., and H. Garcia-Molina. 2000. The evolution of the web and its implications for an incremental crawler. In 26th Conference on Very Large Data Bases. Cairo.
Google Scholar
Cooley, R. 2003. The use of web structure and content to identify subjectively interesting web usage patterns. ACM Transactions on Internet Technology 3(2):93–116.
Article Google Scholar
Dehmer, M., and F. Emmert-Streib. 2010. Mining graph patterns in web-based systems: A conceptual view. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.
Google Scholar
DMOZ. Open directory project, http://www.dmoz.org
Domingos, P., and M. Pazzani. 1997. On the optimality of the bayesian classifier under zero-one loss. Machine Learning 29:103–137.
Article MATH Google Scholar
Duda, R., P. Hart, and D. Stork. 2001. Pattern classification, 2nd Ed. New York, NY: Wiley.
MATH Google Scholar
Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19:61–74.
Google Scholar
Ester, M., H.-P. Kriegel, and M. Schubert. 2002. Web site mining: A new way to spot competitors, customers and suppliers in the World Wide Web. In Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining. Edmonton.
Google Scholar
Fetterly, D., M. Manasse, and M. Najork. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases. Paris.
Google Scholar
Gibson, D., K. Punera, and A. Tomkins. 2005. The volume and evolution of web page templates. In Proceedings of the 14th International World Wide Web Conference. Chiba.
Google Scholar
Han, J., and M. Kamber. 2006. Data mining, 2nd Ed. San Francisco, CA: Morgan Kaufmann.
MATH Google Scholar
Kohavi, R., and G. John. 1997. Wrappers for feature subset selection. Artificial Intelligence, 97:273–324.
Article MATH Google Scholar
Kriegel, H.-P., and M. Schubert. 2004. Classification of websites as sets of feature vectors. In International Conference on Databases and Applications. Innsbruck.
Google Scholar
Kumar, R., K. Punera, and A. Tomkins. 2006. Hierarchical topic segmentation of websites. In Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA.
Google Scholar
Kwon, O.-W., and J.-H. Lee. 2003. Text categorization based on k-nearest neighbor approach for web site classification. Information Processing and Management 39:25–44.
Article MATH Google Scholar
Lee, D. 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC Jungle. Language Learning & Technology 5:37–72.
Google Scholar
Lindemann, C., and L. Littig. 2006. Coarse-grained classification of web sites by their structural properties. In Proceedings of the 8th International Workshop on Web Information and Data Management. Arlington, VA.
Google Scholar
Lindemann, C., and L. Littig. 2007. Classifying web sites. In Proceedings of the 16th International World Wide Web Conference. Banff.
Google Scholar
Liu, B. 2007. Web data mining: Exploring hyperlinks, contents and usage data. Heidelberg: Springer.
MATH Google Scholar
Pierre, J.M. 2001. On the automated classification of web sites. Linköping Electronic Articles in Computer and Information Science 6.
Google Scholar
Sharoff, S. 2010. In the garden and in the jungle: Comparing genres in the BNC and internet. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.
Google Scholar
Stein, B., S. Meyer zu Eissen, and N. Lipka. 2010. Web genre analysis: Use cases, retrieval models, and implementation issues. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini. Dordrecht: Springer.
Google Scholar
Tian, Y-H., T. Huang, and W. Gao. 2004. Two-phase web site classification based on hidden Markov tree models. Web Intelligence and Agent Systems 2:249–264.
Google Scholar
Vogel, D. 2003. Using generic corpora to learn domain-specific terminology. In Workshop on Link Analysis for Detecting Complex Behavior. Washington, DC.
Google Scholar
Weiss, N.A. 2002. Introductory Statistics, 6th Ed., Greg Tobin. Reading MA: Addison Wesley.
Google Scholar
Yahoo! Mindset, http://mindset.research.yahoo.com
Yang, Y., and Webb, G. 2003. Weighted proportional k-interval discretization for naive-bayes classifiers. Artificial Intelligence 2637:501–512.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Leipzig, Leipzig, Germany
Christoph Lindemann & Lars Littig

Authors

Christoph Lindemann
View author publications
You can also search for this author in PubMed Google Scholar
Lars Littig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christoph Lindemann .

Editor information

Editors and Affiliations

, Text Technology/Applied Comp. Ling., Bielefeld University, Universitätsstrasse 25, Bielefeld, 33615, Germany
Alexander Mehler
LS2 9JT Leeds, United Kingdom
Serge Sharoff
Varvsgatan 25, Stockholm, 117 29, Sweden
Marina Santini

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lindemann, C., Littig, L. (2010). Classification of Web Sites at Super-genre Level. In: Mehler, A., Sharoff, S., Santini, M. (eds) Genres on the Web. Text, Speech and Language Technology, vol 42. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9178-9_10

Download citation

DOI: https://doi.org/10.1007/978-90-481-9178-9_10
Published: 16 August 2010
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-9177-2
Online ISBN: 978-90-481-9178-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics