Advertisement

Focused Crawling: An Approach for URL Queue Optimization Using Link Score

  • Sunita RawatEmail author
Chapter
  • 1.1k Downloads
Part of the Signals and Communication Technology book series (SCT)

Abstract

The hasty expansion of the World Wide Web poses exceptional scaling challenges for traditional crawlers and search engines. Web crawlers incessantly carry on crawling the Web and locate any novel Web pages that have been added to or removed from the Web. Because of dynamic and growing nature of the Web, it is tricky to deal with inappropriate pages and to forecast which links lead to excellence pages. Since the crawler is just a computer program, it cannot decide how pertinent a Web page is. In this paper, a method of efficient focused crawling is implemented to enhance the quality of Web navigation. We compute the unvisited URL score based on various factors such as its description in Google search engine and its anchor text relevancy and compute the similarity measure of description with given query or topic keywords. Relevancy score is calculated based on vector space model (VSM). Queue optimization is done on the basis of duplicate link and content similarity.

Keywords

Focused crawler Search engine Weight table Queue optimization 

References

  1. 1.
    Pal, A., Tomar, D.S., Shrivastava, S.C.: Effective focused Crawling based on content and link structure analysis. (IJCSIS) Int. J. Comput. Sci. Inf. Sec. 2(1) (2009)Google Scholar
  2. 2.
    Hati, D., Sahoo, B., Kumar, A.: Adaptive focused Crawling based on link analysis. In: 2nd International Conference on Education Technology and Computer (ICETC) (2010)Google Scholar
  3. 3.
    Chakrabarti, S., van der Berg, M., Dom, B.: Focused Crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the 8th International World-Wide Web Conference (WWW8) (1999)Google Scholar
  4. 4.
    Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL ordering. In: Proceedings of the Seventh World-Wide Web Conference (1998)Google Scholar
  5. 5.
    Cheng, Q., Beizhan, W., Pianpian, W.: Efficient focused Crawling strategy using combination of link structure and content similarity. IEEE (2008)Google Scholar
  6. 6.
    Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovic, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)CrossRefGoogle Scholar
  7. 7.
    Chang, C., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. TKDE-0475-1104.R3 (2006)Google Scholar
  8. 8.
    McCown, F., Nelson, M.: Agreeing to disagree: search engines and their public interfaces. In: ACM IEEE Joint Conference on Digital Libraries (JCDL 2007), pp. 309–318. Vancouver, British Columbia, Canada, 17–23 June 2007Google Scholar
  9. 9.
    Bao, S., Li, R., Yu, Y., Cao, Y.: Competitor Mining with the web knowledge. IEEE Trans. Data Eng. 20(10), 1297–1310 (2008)CrossRefGoogle Scholar
  10. 10.
    Menczer, F., Pant, G., Srinivasan, P.: Topical web Crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. (TOIT) 4(4), 378–419 (2004)CrossRefGoogle Scholar
  11. 11.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar
  12. 12.
    Ehrig, M., Maedche, A.: Ontology-focused Crawling of web documents. In: Proceedings of the Symposium on Applied Computing (SAC 2003), 9–12 Mar 2003Google Scholar
  13. 13.
    Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E.G.M., Milios, E.: Information retrieval by semantic similarity. Int. J. Seman. Web Inf. Syst. (IJSWIS) Spec. Issue Multimedia 3(3), 55–73 (2006)CrossRefGoogle Scholar
  14. 14.
    Pant, G., Srinivasan, P.: Learning to Crawl: comparing classification schemes. ACM Trans. Inf. Syst. (TOIS) 23(4), 430–462 (2005)CrossRefGoogle Scholar
  15. 15.
    Ehrig, M., Maedche, A.: Ontology-focused Crawling of web documents. In: Proceedings of the Symposium on Applied Computing, March, 67.2–12.72 Florida, USA (2003)Google Scholar
  16. 16.
    Yuvarani, M., Ch., N., Iyengar, S.N., Kannan, A., Crawler, L.S.: A framework for an enhanced focused web Crawler based on link semantics. In: Proceedings of the IEEEIWIC/ACM International Conference on Web Intelligence (2006)Google Scholar
  17. 17.
    Liu, H., Janssen, J., Milios, E.: Using HMM to learn user browsing patterns for focused web Crawling. Data Knowl. Eng. 59(2), 270–329 (2006)CrossRefGoogle Scholar
  18. 18.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., Gori., M.: Focused Crawling using context graphs. In: Proceedings of 26th International Conference on Very Large Databases (VLDB 2000), pp. 527–534 (2000)Google Scholar
  19. 19.
    Chen, Y.: A novel hybrid focused Crawling algorithm to build domain-specific collections. Ph.D. thesis, Virginia Polytechnic Institute and State University (2007)Google Scholar
  20. 20.
    Zhang, X., Zhou, T., Yu, Z., Chen, D.: URL rule based focused Crawlers. In: IEEE International Conference on e-Business Engineering (2008)Google Scholar
  21. 21.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: a new approach to topic-specific Web resource discovery. In: 8th International WWW Conference, May 1999Google Scholar
  22. 22.
    Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E.G.M., Milios, E.: Semantic similarity methods in WordNet and their application to information retrieval on the web. In: 7th ACM International Workshop on Web Information and Data Management (WIDM 2005), Bremen Germany (2005)Google Scholar
  23. 23.
    Liu, B.: Web data mining, from Chapter 6, 7, 8, pp. 183–235, 237–270, 273–318. Springer, Berlin (2007)Google Scholar
  24. 24.
    Bhatia, M.P.S., Gupta, D.: Discussion on web Crawlers of search engine. In: Proceedings of 2nd National Conference on Challenges and Opportunities in Information Technology (COIT-2008)Google Scholar
  25. 25.
    Soon, L.K., Ku, Y.E., Lee, S.H.: Web Crawler with URL signature—a performance study. In: 4th Conference on Data Mining and Optimization (DMO) (2012)Google Scholar
  26. 26.
    Kim, S.J., Jeong, H.S., Lee, S.H.: Reliable evaluations of URL normalization. In: Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), pp. 609–617 May 2006Google Scholar
  27. 27.
    Lee, S.H., Kim, S.J., Hong, S.H.: On URL normalization. In: Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), pp. 1076–1085, Singapore, May 2005Google Scholar
  28. 28.
    Berners-Lee, T., Fielding, R., Masinter, L.: Uniform resource identifier (URI): general syntax. Available at http://gbiv.com/protocols/uri/rfc/rfc3986.html
  29. 29.
    Garcia, E.: Vector models based on normalized frequencies. Mi Islita. Retrieved 17 Aug 2012 (2006)Google Scholar
  30. 30.
    Yongsheng, Y., Hui, W.: Implementation of focused Crawler, COMP 630D Course Project Report (2000)Google Scholar

Copyright information

© Springer India 2015

Authors and Affiliations

  1. 1.Department of Computer EngineeringRCPITDhuleIndia

Personalised recommendations