Abstract
With the rapid increase in demand of digital information via internet, it becomes imperative for search engines to serve up to date information in response to user query. A web crawler plays a vital role in maintaining local cache of search engine. Today, the biggest challenge for web crawler is how to harness fresh information in its local cache of n constantly changing web pages. These web pages often possess dynamic creation and updation cycle. Moreover, the resources for downloading possible updates are limited. This problem is formulated as non-deterministic optimization problem. Many attempts had been made by researchers in past to solve it. But most of existing techniques work well for small value of n and often intractable for large corpus. The paper presents an optimal solution to deal with this non- deterministic problem for large data. The experimental results show that technique achieves promising results as compared to existing crawler.
Similar content being viewed by others
References
Avrachenkov KE, Borkar VS (2016) Whittle index policy for crawling ephemeral content. IEEE Trans Control Network Syst 5(1):446–455
Azar Y, Horvitz E, Lubetzky E, Peres Y, Shahaf D (2018) Tractable near-optimal policies for crawling. Proc Natl Acad Sci 115(32):8099–8103
Bhatia S, Sharma M, Bhatia KK (2016) A novel approach for crawling the opinions from world wide web. Int J Inform Retriev Res (IJIRR) 6(2):1–23
Boldi P, Marino A, Santini M, Vigna S (2018) BUbiNG: massive crawling for the masses. ACM Trans Web (TWEB) 12(2):1–26
Cho J, Garcia-Molina H (2003) Estimating frequency of change. ACM Trans Int Technol (TOIT) 3(3):256–290
Dikaiakos MD, Stassopoulou A, Papageorgiou L (2005) An investigation of web crawler behavior: characterization and metrics. Comput Commun 28(8):880–897
Dixit A, Sharma AK (2011) Temporal relevance improvement mechanism for crawler collection. In: 2011 International Conference on Communication Systems and Network Technologies, pp 634–637 IEEE
Hasselquist D, Rawat A, Gurtov A (2019) Trends and detection avoidance of internet-connected industrial control systems. IEEE Access 7:155504–155512
Heydon A, Najork M Mercator: A scalable, extensible web crawler. World Wide Web 2(4):219–229. https://doi.org/10.1023/A:1019213109274
Kim KS, Kim KY, Lee KH, Kim TK, Cho WS (2012) Design and implementation of web crawler based on dynamic web collection cycle. In: the International Conference on Information Network 2012, pp 562–566 IEEE
Liu X, Ye S, Li X, Luo Y, Rao Y (2015) Zhihurank: A topic-sensitive expert finding algorithm in community question answering websites. In: International Conference on Web-Based Learning. Springer, Cham. pp. 165–173
Meusel R, Mika P, Blanco R (2014) Focused crawling for structured data. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp 1039–1048
Mukhopadhyay D, Sinha S (2019) Domain-specific crawler design. In: Web Searching and Mining. Springer, Singapore, pp 85–112
Radinsky K, Bennett PN (2013) Predicting content change on the web. In: Proceedings of the sixth ACM international conference on Web search and data mining, WSDM 2013. Rome, Italy. pp 415–424
Rawat S (2015) Focused crawling: an approach for URL queue optimization using link score. In: Recent Development in Wireless Sensor and Ad-hoc Networks. Springer, New Delhi, pp 169–189
Sethi S, Dixit A (2015) Design of personalised search system based on user interest and query structuring. 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pp 1346-1351 IEEE
Sharma AK, Gupta JP, Agarwal DP (2010) Parcahyd: an architecture of a parallel crawler based on augmented hypertext documents. Int J Advan Technol 1(2):270–283
Tarakeswar MK, Kavitha D (2011) Search engines: a study. J Comp Appl (JCA) 4(1):2011
Umbrich, J., Mrzelj, N., & Polleres, A. (2015). Towards capturing and preserving changes on the Web of Data. In: DIACRON@ ESWC, pp 50-65
Wills RS (2006) Google’s pagerank. Math Intell 28(4):6–11
Xu H, Lv Y, Fan G (2018) Research on Topic Crawler Strategy Based on Web Page Extension and Best Priority Search Algorithm. In: International Conference on Applications and Techniques in Cyber Security and Intelligence, Springer, Cham. pp 1129-1136
Zerfos P, Cho J, Ntoulas A (2005) Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'05), pp 100–109 IEEE
Zhao F, Zhou J, Nie C, Huang H, Jin H (2015) Smartcrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans Serv Comput 9(4):608–620
Zhu K, Xu Z, Wang X, Zhao Y (2008) A full distributed web crawler based on structured network. In: Asia Information Retrieval Symposium. Springer, Berlin, Heidelberg, pp 478–483
Acknowledgements
I would like to express my sincere and deep gratitude to my Ph.D. supervisor, Dr. Ashutosh Dixit, Professor, Department of Computer Engineering, J. C. Bose University of Science & Technology, Faridabad for his continuous guidance, constructive criticism and valuable advice. I would also express gratitude to chairperson, Computer engineering department and members of IT cell who help me in obtaining the experimental results.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sethi, S. An optimized crawling technique for maintaining fresh repositories. Multimed Tools Appl 80, 11049–11077 (2021). https://doi.org/10.1007/s11042-020-10250-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10250-8