Skip to main content
Log in

A path-based approach for web page retrieval

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Use of links to enhance page ranking has been widely studied. The underlying assumption is that links convey recommendations. Although this technique has been used successfully in global web search, it produces poor results for website search, because the majority of the links in a website are used to organize information and convey no recommendations. By distinguishing these two kinds of links, respectively for recommendation and information organization, this paper describes a path-based method for web page ranking. We define the Hierarchical Navigation Path (HNP) as a new resource for improving web search. HNP is composed of multi-step navigation information in visitors’ website browsing. It provides indications of the content of the destination page. We first classify the links inside a website. Then, the links for web page organization are exploited to construct the HNPs for each page. Finally, the PathRank algorithm is described for web page retrieval. The experiments show that our approach results in significant improvements over existing solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amento, B., Terveen, L., Hill, W.: Does authority mean quality? Predicting expert quality ratings of web documents. In Proc. of SIGIR. (2000)

  2. Asadi, S., Zhou, X., Yang, G.: Using local popularity of web resources for geo-ranking of search engine results. World Wide Web Internet Web Inf. Syst. 12(2), 149–170 (2009)

    Google Scholar 

  3. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley (1999)

  4. Bao, S., Wu, X., Fei, B., Xue, G., Su, Z., Yu, Y.: Optimizing web search using social annotation. Proc. of WWW (2007)

  5. Borges, J., Levene, M.: Ranking pages by topology and popularity within web sites. World Wide Web Internet Web Inf. Syst. 9(3), 301–316 (2006)

    Google Scholar 

  6. Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)

    Article  Google Scholar 

  7. Broder, A., Kumar, R., Maghoul, F. et al.: Graph structure in the web. Proc. of WWW. (2000)

  8. Cai, D., He, X., Wen, J.-R., Ma, W.-Y.: Block-level link analysis. Proc. of SIGIR. 440–447 (2004)

  9. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: A vision based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79 (2003)

  10. Chen, M., Hearst, M. et. al.: Cha-Cha: A system for organizing intranet search results. Proc. of USENIX USITS. (1999)

  11. Chen, Z., Liu, S., Liu, W., Pu, G. Ma, W.Y.: Building a web thesaurus from web link structure. Proc. of SIGIR03 (2003)

  12. Chen, J.L., Zhou, B.Y., Shi, J., Zhang, H.J., Wu, Q.F.: Function-based object model towards Website Adaptation, In Proc. of WWW01 (2001)

  13. Chi, E.H. et al.: Using information scent to model user information needs and actions on the web. Proc. of SIGCHI (2001)

  14. Delicious: http://del.icio.us

  15. Dyreson, C.E.: A jumping spider: Restructuring the WWW graph to index concepts that span pages. Proc. of WWW. (1998)

  16. Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. Proc. SIGIR 2003, 459–460 (2003)

    Google Scholar 

  17. Fagin, R., Kumar, R., McCurley, K.S., Novak, J., Sivakumar, D., Tomlin, J.A., Williamson, D.P.: Searching the workplace web. Proc. of WWW03, pp. 366–375 (2003)

  18. Glover, E.J. et. al.: Improving category specific web search by learning query modifications, symposium on applications and the internet. pp. 23–32 (2001)

  19. Glover, E.J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D.M., Flake, G.W.: Using web structure for classifying and describing web pages. Proc. of WWW02, 562–569 (2002)

  20. Hagen, P., Manning, H., Paul, Y.: Must search stink? The Forrester report, Forrester, June (2000)

  21. Han, S.K., Shin, D., Jung, J.-Y., Park, J.: Exploring the relationship between keywords and feed elements in blog post search. World Wide Web Internet Web Inf. Syst. 12(4), 381–398 (2009)

    Google Scholar 

  22. Haveliwala, T.H.: Topic-sensitive PageRank. Proc. of WWW02, pp.517-526 (2002)

  23. Hawking, D., Voorhees, E., Bailey, P., Craswell, N.: Overview of TREC-8 web track. Proceeding of TREC-8. 131–150(1999)

  24. Hawking, D., Craswell, N.: Overview of the TREC 2001 Web Track, in TREC01 (2001)

  25. Hawking, D.: Overview of the TREC-9 Web Track, in TREC02 (2000)

  26. Henzinger, M.: Link analysis on the world wide web. Proc. of ACM Hypertext, pp.1-3 (2005)

  27. Hu, Y., Xin, G., Song, R., Hu, G. et. al.: Title extraction from bodies of HTML documents and its application to web page retrieval. Proceeding of SIGIR05. 250–257 (2005)

  28. Jeh, G., Widom, J.: Scaling personalized web search. Proc. of WWW03, pp. 271–279 (2003)

  29. Kleinberg, J.: Authoritative sources in a linked environment. J. ACM 46(5), 604–622 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  30. Li, L., Otsuka, S., Kitsuregawa, M.: Finding related search engine queries by web community based query enrichment. World Wide Web Internet Web Inf. Syst. 13(1–2), 121–142 (2009)

    Google Scholar 

  31. Li, J.Q., Zhao, Y.: PathRank: Web page retrieval with navigation path, Proc. ECIR09, pp. 350–361 (2009)

  32. Lin, S.-H. and Ho, J.-M.: Discovering Informative Content Blocks from Web Documents, Proc. of SIGKDD (2002)

  33. Matsuda, K., Fukushima, T.: Task-oriented world wide web retrieval by document type classification, Proc. of CIKM1999. pp.109–113 (1999)

  34. Mizuuchi, Y., Tajima, K.: Finding context paths for Web pages. Proc. of ACM Hypertext, pp. 13–22 (1999)

  35. Najork, M., Wiener, J.: Breadth-first search crawling yields high-quality pages, Proc. of WWW2000. pp. 114–118 (2000)

  36. Nie, Z., Zhang, Y., Wen, J., Ma, W.-Y.: Object-level ranking: bringing order to Web objects. Proc. of WWW05. 567–574 (2005)

  37. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web, Technical Report, Stanford University (1998)

  38. Pandit, S.; Olston, C.: Source: Navigation-aided retrieval. Proc. of WWW2007, pp. 391–400 (2007)

  39. Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gull, A., Lau, M.: Okapi at TREC. In:Text REtrieval Conference (1992)

  40. Shen, D., Sun, J.-T., Yang, Q., Chen, Z.: A comparison of implicit and explicit links for web page classification. Proc. of WWW. 643–650 (2006)

  41. Soboroff, I.: Do trec web collections look like the web? SIGIR. Forum 36, 23–31 (2002)

    Google Scholar 

  42. Vaughan, L., Thelwall, M.: Scholarly use of the Web: what are the key inducers of links to journal web sites? J. Am. Soc. Inf. Sci. Technol. 54(1), 29–38 (2003)

    Article  Google Scholar 

  43. WT10G, http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html

  44. Xue, G., Zeng, H., Chen, Z., Ma, W. etc.: Implicit link analysis to small web search. Proc. of SIGIR03. 56–63 (2003)

  45. Yu, W., Zhang, W., Lin, X., Zhang, Q., Le, J.: A space and time efficient algorithm for SimRank computation, World Wide Web: Internet and Web Information Systems, 2010

  46. Zakos, J., Verma, B.: A novel context-based technique for web information retrieval. World Wide Web Internet Web Inf. Syst. 9(4), 485–503 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian-Qiang Li.

Additional information

A preliminary version of this paper appeared in [31]. This submission includes more complete and formal description of the algorithms and experiments.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, JQ., Zhao, Y. & Garcia-Molina, H. A path-based approach for web page retrieval. World Wide Web 15, 257–283 (2012). https://doi.org/10.1007/s11280-011-0133-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-011-0133-5

Keywords

Navigation