Skip to main content
Log in

An optimized crawling technique for maintaining fresh repositories

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

With the rapid increase in demand of digital information via internet, it becomes imperative for search engines to serve up to date information in response to user query. A web crawler plays a vital role in maintaining local cache of search engine. Today, the biggest challenge for web crawler is how to harness fresh information in its local cache of n constantly changing web pages. These web pages often possess dynamic creation and updation cycle. Moreover, the resources for downloading possible updates are limited. This problem is formulated as non-deterministic optimization problem. Many attempts had been made by researchers in past to solve it. But most of existing techniques work well for small value of n and often intractable for large corpus. The paper presents an optimal solution to deal with this non- deterministic problem for large data. The experimental results show that technique achieves promising results as compared to existing crawler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Avrachenkov KE, Borkar VS (2016) Whittle index policy for crawling ephemeral content. IEEE Trans Control Network Syst 5(1):446–455

    Article  MathSciNet  Google Scholar 

  2. Azar Y, Horvitz E, Lubetzky E, Peres Y, Shahaf D (2018) Tractable near-optimal policies for crawling. Proc Natl Acad Sci 115(32):8099–8103

    Article  Google Scholar 

  3. Bhatia S, Sharma M, Bhatia KK (2016) A novel approach for crawling the opinions from world wide web. Int J Inform Retriev Res (IJIRR) 6(2):1–23

    Google Scholar 

  4. Boldi P, Marino A, Santini M, Vigna S (2018) BUbiNG: massive crawling for the masses. ACM Trans Web (TWEB) 12(2):1–26

    Article  Google Scholar 

  5. Cho J, Garcia-Molina H (2003) Estimating frequency of change. ACM Trans Int Technol (TOIT) 3(3):256–290

    Article  Google Scholar 

  6. Dikaiakos MD, Stassopoulou A, Papageorgiou L (2005) An investigation of web crawler behavior: characterization and metrics. Comput Commun 28(8):880–897

    Article  Google Scholar 

  7. Dixit A, Sharma AK (2011) Temporal relevance improvement mechanism for crawler collection. In: 2011 International Conference on Communication Systems and Network Technologies, pp 634–637 IEEE

  8. Hasselquist D, Rawat A, Gurtov A (2019) Trends and detection avoidance of internet-connected industrial control systems. IEEE Access 7:155504–155512

    Article  Google Scholar 

  9. Heydon A, Najork M Mercator: A scalable, extensible web crawler. World Wide Web 2(4):219–229. https://doi.org/10.1023/A:1019213109274

  10. Kim KS, Kim KY, Lee KH, Kim TK, Cho WS (2012) Design and implementation of web crawler based on dynamic web collection cycle. In: the International Conference on Information Network 2012, pp 562–566 IEEE

    Chapter  Google Scholar 

  11. Liu X, Ye S, Li X, Luo Y, Rao Y (2015) Zhihurank: A topic-sensitive expert finding algorithm in community question answering websites. In: International Conference on Web-Based Learning. Springer, Cham. pp. 165–173 

  12. Meusel R, Mika P, Blanco R (2014) Focused crawling for structured data. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp 1039–1048

  13. Mukhopadhyay D, Sinha S (2019) Domain-specific crawler design. In: Web Searching and Mining. Springer, Singapore, pp 85–112

    Chapter  Google Scholar 

  14. Radinsky K, Bennett PN (2013) Predicting content change on the web. In: Proceedings of the sixth ACM international conference on Web search and data mining, WSDM 2013. Rome, Italy. pp 415–424

  15. Rawat S (2015) Focused crawling: an approach for URL queue optimization using link score. In: Recent Development in Wireless Sensor and Ad-hoc Networks. Springer, New Delhi, pp 169–189

    Chapter  Google Scholar 

  16. Sethi S, Dixit A (2015) Design of personalised search system based on user interest and query structuring. 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pp 1346-1351 IEEE

  17. Sharma AK, Gupta JP, Agarwal DP (2010) Parcahyd: an architecture of a parallel crawler based on augmented hypertext documents. Int J Advan Technol 1(2):270–283

    Google Scholar 

  18. Tarakeswar MK, Kavitha D (2011) Search engines: a study. J Comp Appl (JCA) 4(1):2011

    Google Scholar 

  19. Umbrich, J., Mrzelj, N., & Polleres, A. (2015). Towards capturing and preserving changes on the Web of Data. In: DIACRON@ ESWC, pp 50-65

  20. Wills RS (2006) Google’s pagerank. Math Intell 28(4):6–11

    Article  Google Scholar 

  21. Xu H, Lv Y, Fan G (2018) Research on Topic Crawler Strategy Based on Web Page Extension and Best Priority Search Algorithm. In: International Conference on Applications and Techniques in Cyber Security and Intelligence, Springer, Cham. pp 1129-1136

  22. Zerfos P, Cho J, Ntoulas A (2005) Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'05), pp 100–109 IEEE

    Google Scholar 

  23. Zhao F, Zhou J, Nie C, Huang H, Jin H (2015) Smartcrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans Serv Comput 9(4):608–620

    Article  Google Scholar 

  24. Zhu K, Xu Z, Wang X, Zhao Y (2008) A full distributed web crawler based on structured network. In: Asia Information Retrieval Symposium. Springer, Berlin, Heidelberg, pp 478–483

    Google Scholar 

Download references

Acknowledgements

I would like to express my sincere and deep gratitude to my Ph.D. supervisor, Dr. Ashutosh Dixit, Professor, Department of Computer Engineering, J. C. Bose University of Science & Technology, Faridabad for his continuous guidance, constructive criticism and valuable advice. I would also express gratitude to chairperson, Computer engineering department and members of IT cell who help me in obtaining the experimental results.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shilpa Sethi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sethi, S. An optimized crawling technique for maintaining fresh repositories. Multimed Tools Appl 80, 11049–11077 (2021). https://doi.org/10.1007/s11042-020-10250-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10250-8

Keywords

Navigation