Exploiting Multiple Features with MEMMs for Focused Web Crawling

  • Hongyu Liu
  • Evangelos Milios
  • Larry Korba
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5039)

Abstract

Focused web crawling traverses the Web to collect documents on a specific topic. This is not an easy task, since focused crawlers need to identify the next most promising link to follow based on the topic and the content and links of previously crawled pages. In this paper, we present a framework based on Maximum Entropy Markov Models(MEMMs) for an enhanced focused web crawler to take advantage of richer representations of multiple features extracted from Web pages, such as anchor text and the keywords embedded in the link URL, to represent useful context. The key idea of our approach is to treat the focused web crawling problem as a sequential task and use a combination of content analysis and link structure to capture sequential patterns leading to targets. The experimental results showed that focused crawling using MEMMs is a very competitive crawler in general over Best-First crawling on Web Data in terms of two metrics: Precision and Maximum Average Similarity.

Keywords

Focused Crawling Web Search Feature Selection MEMMs 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL Ordering. In: Proceedings of the 7th World Wide Web Conference (1998)Google Scholar
  2. 2.
    Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated Focused Crawling through Online Relevance Feedback. In: Proceedings of the 11th International WWW Conference (1999)Google Scholar
  3. 3.
    Aggarwal, C., Al-Garawi, F., Yu, P.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: Proceedings of the 10th International WWW Conference (2001)Google Scholar
  4. 4.
    Menczer, F., Belew, R.K.: Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning 39(2/3), 203–242 (2000)MATHCrossRefGoogle Scholar
  5. 5.
    Johnson, J., Tsioutsiouliklis, K., Giles, C.L.: Evolving Strategies for Focused Web Crawling. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003) (2003)Google Scholar
  6. 6.
    Ehrig, M., Maedche, A.: Ontology-focused crawling of web documents. In: SAC 2003: Proceedings of the 2003 ACM symposium on Applied computing, pp. 1174–1178. ACM, New York (2003)CrossRefGoogle Scholar
  7. 7.
    Pant, G., Srinivasan, P.: Learning to Crawl: Comparing Classification Schemes. ACM Trans. Information Systems. 23(4) (2005)Google Scholar
  8. 8.
    Pant, G., Srinivasan, P.: Link Contexts in Classifier-Guided Topical Crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006)CrossRefGoogle Scholar
  9. 9.
    Frnkranz, J.: Hyperlink ensembles: A case study in hypertext classification. Information Fusion 3(4), 299–312 (2002)CrossRefGoogle Scholar
  10. 10.
    Rennie, J., McCallum, A.: Using Reinforcement Learning to Spider the Web Efficiently. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999) (1999)Google Scholar
  11. 11.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., Gori, M.: Focused Crawling Using Context Graphs. In: Proceedings of the 26th International Conference on Very Large Databases (VLDB 2000) (2000)Google Scholar
  12. 12.
    Liu, H., Janssen, J., Milios, E.: Using hmm to learn user browsing patterns for focused web crawling. Data & Knowledge Engineering 59(2), 270–291 (2006)CrossRefGoogle Scholar
  13. 13.
    McCallum, A., Freitag, D., Pereira, F.: Maxiumu Entropy Markov Models for Information Extraction and Segmantation. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 591–598 (2000)Google Scholar
  14. 14.
    Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Heidelberg (1999)MATHGoogle Scholar
  15. 15.
    Sha, F., Pereira, F.: Shallow Parsing with Conditional Random Fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141 (2003)Google Scholar
  16. 16.
    Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.: Evaluating Topic-Driven Web Crawlers. In: Proceedings of the 24th ACM/SIGIR Conference. Research and Development in Information Retrieval (2001)Google Scholar
  17. 17.
    Menczer, F., Pant, G., Srinivasan, P.: Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM TOIT 4(4), 378–419 (2004)CrossRefGoogle Scholar
  18. 18.
    Srinivasan, P., Menczer, F., Pant, G.: A General Evaluation Framework for Topical Crawlers. Information Retrieval 8(3), 417–447 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Hongyu Liu
    • 1
  • Evangelos Milios
    • 1
  • Larry Korba
    • 1
  1. 1.National Research Council Institute for Information Technology, Canada Faculty of Computer ScienceDalhousie UniversityCanada

Personalised recommendations