Exploiting Multiple Features with MEMMs for Focused Web Crawling

  • Hongyu Liu
  • Evangelos Milios
  • Larry Korba
Conference paper

DOI: 10.1007/978-3-540-69858-6_11

Part of the Lecture Notes in Computer Science book series (LNCS, volume 5039)
Cite this paper as:
Liu H., Milios E., Korba L. (2008) Exploiting Multiple Features with MEMMs for Focused Web Crawling. In: Kapetanios E., Sugumaran V., Spiliopoulou M. (eds) Natural Language and Information Systems. NLDB 2008. Lecture Notes in Computer Science, vol 5039. Springer, Berlin, Heidelberg

Abstract

Focused web crawling traverses the Web to collect documents on a specific topic. This is not an easy task, since focused crawlers need to identify the next most promising link to follow based on the topic and the content and links of previously crawled pages. In this paper, we present a framework based on Maximum Entropy Markov Models(MEMMs) for an enhanced focused web crawler to take advantage of richer representations of multiple features extracted from Web pages, such as anchor text and the keywords embedded in the link URL, to represent useful context. The key idea of our approach is to treat the focused web crawling problem as a sequential task and use a combination of content analysis and link structure to capture sequential patterns leading to targets. The experimental results showed that focused crawling using MEMMs is a very competitive crawler in general over Best-First crawling on Web Data in terms of two metrics: Precision and Maximum Average Similarity.

Keywords

Focused Crawling Web Search Feature Selection MEMMs 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Hongyu Liu
    • 1
  • Evangelos Milios
    • 1
  • Larry Korba
    • 1
  1. 1.National Research Council Institute for Information Technology, Canada Faculty of Computer ScienceDalhousie UniversityCanada

Personalised recommendations