Knowledge and Information Systems

, Volume 25, Issue 2, pp 303–326

xCrawl: a high-recall crawling method for Web mining

  • Kostyantyn Shchekotykhin
  • Dietmar Jannach
  • Gerhard Friedrich
Regular Paper

Abstract

Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.

Keywords

Web mining Information retrieval Web crawling Information extraction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal CC, Al-Garawi F, Yu PS (2001) Intelligent crawling on the World Wide Web with arbitrary predicates. In: Shen VY, Saito N, Lyu RM, Zurko ME (eds) Proceedings of the 10th international world wide web conference. ACM, New York, pp 96–105Google Scholar
  2. 2.
    Agichtein E, Gravano L (2003) Querying text databases for efficient information extraction. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) Proceedings of the 19th IEEE international conference on data engineering. IEEE Computer Society, Los Alamitos, pp 113–124Google Scholar
  3. 3.
    Bergholz A, Chidlovskii B (2003) Crawling for domain-specific Hidden Web resources. In: Catarci T, Mercella M, Mylopoulos J, Orlowska ME (eds) Proceedings of the fourth international conference on web information systems engineering. IEEE Computer Society, Los Alamitos, pp 125–133Google Scholar
  4. 4.
    Chakrabarti S (2003) Mining the Web: discovering knowledge from hypertext data. Morgan Kaufmann, San FranciscoGoogle Scholar
  5. 5.
    Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw 31: 1623–1640CrossRefGoogle Scholar
  6. 6.
    Chakrabarti S, Punera K, Subramanyam M (2002) Accelerated focused crawling through online relevance feedback. In: Lassner D, De Roure D, Iyengar A (eds) Proceedings of the 11th International World Wide Web Conference. ACM, New York, pp 148–159Google Scholar
  7. 7.
    Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Comput Netw ISDN Syst 30: 161–172CrossRefGoogle Scholar
  8. 8.
    Craven M, DiPasquo D, Freitag D et al (2000) Learning to construct knowledge bases from the World Wide Web. Artif Intell 118: 69–113MATHCrossRefGoogle Scholar
  9. 9.
    Dasgupta A, Ghosh A, Kumar R et al (2007) The discoverability of the Web. In: Williamson CL, Zurko ME, Patel-Schneider PF et al (eds) Proceedings of the 16th international conference on world wide web. ACM, New York, pp 421–430CrossRefGoogle Scholar
  10. 10.
    Diligenti M, Coetzee F, Lawrence S et al (2000) Focused crawling using context graphs. In: Abbadi AE, Brodie ML, Chakravarthy S et al (eds) Proceedings of 26th international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 527–534Google Scholar
  11. 11.
    Dill S, Eiron N, Gibson D et al (2003) SemTag and seeker: bootstrapping the semantic Web via automated semantic annotation. In: Hencsey G, White B, Chen Y et al (eds) Proceedings of the 12th international conference on world wide web. ACM, New York, pp 178–186Google Scholar
  12. 12.
    Ester M, Kriegel HP, Schubert M (2004) Accurate and efficient crawling for relevant websites. In: Nascimento MA, Özsu MT, Kossmann D et al (eds) Proceedings of the thirtieth international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 396–407Google Scholar
  13. 13.
    Felfernig A, Friedrich G, Jannach D et al (2007) An integrated environment for the development of knowledge-based recommender applications. Int J Electron Commer 11: 11–34CrossRefGoogle Scholar
  14. 14.
    Gatterbauer W, Bohunsky P, Herzog M et al (2007) Towards domain-independent information extraction from web tables. In: Williamson CL, Zurko ME, Patel-Schneider PF et al (eds) Proceedings of the 16th international conference on world wide web. ACM, New YorkGoogle Scholar
  15. 15.
    Haveliwala TH (2003) Topic-Sensitive PageRank: a context-sensitive ranking algorithm for Web search. IEEE Trans Knowl Data Eng 15: 784–796CrossRefGoogle Scholar
  16. 16.
    Ipeirotis PG, Agichtein E, Jain P et al (2006) To search or to crawl?: towards a query optimizer for text-centric tasks. In: Chaudhuri S, Hristidis V, Polyzotis N (eds) Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, New York, pp 265–276CrossRefGoogle Scholar
  17. 17.
    Jannach D, Shchekotykhin K, Friedrich G (2009) Automated ontology instantiation from tabular web sources—the AllRight system, Web semantics: science, services and agents on the world wide web (in press)Google Scholar
  18. 18.
    Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46: 604–632MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Kleinberg J, Kumar R, Raghavan P et al (1999) The Web as a graph: measurements, models, and methods. In: Asano T, Imai H, Lee DT et al (eds) Proceedings of the 5th annual international conference on computing and combinatorics. Lecture notes in computer science, vol 1627. Springer, Berlin, pp 1–17Google Scholar
  20. 20.
    Kruger A, Giles CL, Coetzee F et al (2000) DEADLINER: building a new Niche search engine. In: Agah A, Callan J, Rundensteiner E et al (eds) Proceedings of 9th international conference on information and knowledge management. ACM, New York, pp 272–281Google Scholar
  21. 21.
    Menczer F, Pant G, Srinivasan P et al (2001) Evaluating topic-driven web crawlers. In: Kraft DH, Croft WB, Harper DJ et al (eds) Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 241–249CrossRefGoogle Scholar
  22. 22.
    Mesbah A, Bozdag E, van Deursen A (2008) Crawling AJAX by inferring user interface state changes. In: Schwabe D, Curbera F, Dantzig P (eds) Proceedings of the 8th international conference on web engineering. IEEE Computer Society, Los Alamitos, pp 122–134CrossRefGoogle Scholar
  23. 23.
    Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst 16: 281–301CrossRefGoogle Scholar
  24. 24.
    Rennie J, McCallum A (1999) Using reinforcement learning to spider the Web efficiently. In: Bratko I, Dzeroski S (eds) Proceedings of the 16th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 335–343Google Scholar
  25. 25.
    Robertson SE (1990) On term selection for query expansion. J Documentation 46: 359–364CrossRefGoogle Scholar
  26. 26.
    Schonfeld U, Bar-Yossef Z, Keidar I (2009) Do not crawl in the DUST: different URLs with similar text. ACM Trans Web 3: 3–31Google Scholar
  27. 27.
    Shchekotykhin K, Jannach D, Friedrich G (2007) Clustering Web documents with tables for information extraction. In: Sleeman D, Barker K (eds) Proccedings of the 4th international conference on knowledge capture. ACM, New York, pp 169–170CrossRefGoogle Scholar
  28. 28.
    Shchekotykhin K, Jannach D, Friedrich G et al (2007) AllRight: automatic ontology instantiation from tabular web documents. In: Aberer K, Choi K, Noy N et al (eds) Proceedings of the 6th international semantic web conference and 2nd Asian semantic web conference. Springer, Berlin, pp 463–476Google Scholar
  29. 29.
    Tong H, Faloutsos C, Pan JY (2008) Random walk with restart: fast solutions and applications. Knowl Inf Syst 14: 327–346MATHCrossRefGoogle Scholar
  30. 30.
    Wang P, Hu J, Zeng HJ et al (2009) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281CrossRefGoogle Scholar
  31. 31.
    Witten I, Frank E (2000) Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San FranciscoGoogle Scholar
  32. 32.
    Yu H, Han J, Chang KCC (2004) PEBL: web page classification without negative examples. IEEE Trans Knowl Data Eng 16: 70–81CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  • Kostyantyn Shchekotykhin
    • 1
  • Dietmar Jannach
    • 2
  • Gerhard Friedrich
    • 1
  1. 1.University KlagenfurtKlagenfurtAustria
  2. 2.Technische Universität DortmundDortmundGermany

Personalised recommendations