An optimized crawling technique for maintaining fresh repositories

Sethi, Shilpa

doi:10.1007/s11042-020-10250-8

An optimized crawling technique for maintaining fresh repositories

Published: 03 January 2021

Volume 80, pages 11049–11077, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Shilpa Sethi ORCID: orcid.org/0000-0001-9202-4234¹

276 Accesses
4 Citations
Explore all metrics

Abstract

With the rapid increase in demand of digital information via internet, it becomes imperative for search engines to serve up to date information in response to user query. A web crawler plays a vital role in maintaining local cache of search engine. Today, the biggest challenge for web crawler is how to harness fresh information in its local cache of n constantly changing web pages. These web pages often possess dynamic creation and updation cycle. Moreover, the resources for downloading possible updates are limited. This problem is formulated as non-deterministic optimization problem. Many attempts had been made by researchers in past to solve it. But most of existing techniques work well for small value of n and often intractable for large corpus. The paper presents an optimal solution to deal with this non- deterministic problem for large data. The experimental results show that technique achieves promising results as compared to existing crawler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Intelligent and Adaptive Crawling of Web Applications for Web Archiving

AcT: Accuracy-aware crawling techniques for cloud-crawler

Article 15 February 2015

Crawling Data-Intensive Web Sources Using Structure Information

References

Avrachenkov KE, Borkar VS (2016) Whittle index policy for crawling ephemeral content. IEEE Trans Control Network Syst 5(1):446–455
Article MathSciNet Google Scholar
Azar Y, Horvitz E, Lubetzky E, Peres Y, Shahaf D (2018) Tractable near-optimal policies for crawling. Proc Natl Acad Sci 115(32):8099–8103
Article Google Scholar
Bhatia S, Sharma M, Bhatia KK (2016) A novel approach for crawling the opinions from world wide web. Int J Inform Retriev Res (IJIRR) 6(2):1–23
Google Scholar
Boldi P, Marino A, Santini M, Vigna S (2018) BUbiNG: massive crawling for the masses. ACM Trans Web (TWEB) 12(2):1–26
Article Google Scholar
Cho J, Garcia-Molina H (2003) Estimating frequency of change. ACM Trans Int Technol (TOIT) 3(3):256–290
Article Google Scholar
Dikaiakos MD, Stassopoulou A, Papageorgiou L (2005) An investigation of web crawler behavior: characterization and metrics. Comput Commun 28(8):880–897
Article Google Scholar
Dixit A, Sharma AK (2011) Temporal relevance improvement mechanism for crawler collection. In: 2011 International Conference on Communication Systems and Network Technologies, pp 634–637 IEEE
Hasselquist D, Rawat A, Gurtov A (2019) Trends and detection avoidance of internet-connected industrial control systems. IEEE Access 7:155504–155512
Article Google Scholar
Heydon A, Najork M Mercator: A scalable, extensible web crawler. World Wide Web 2(4):219–229. https://doi.org/10.1023/A:1019213109274
Kim KS, Kim KY, Lee KH, Kim TK, Cho WS (2012) Design and implementation of web crawler based on dynamic web collection cycle. In: the International Conference on Information Network 2012, pp 562–566 IEEE
Chapter Google Scholar
Liu X, Ye S, Li X, Luo Y, Rao Y (2015) Zhihurank: A topic-sensitive expert finding algorithm in community question answering websites. In: International Conference on Web-Based Learning. Springer, Cham. pp. 165–173
Meusel R, Mika P, Blanco R (2014) Focused crawling for structured data. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp 1039–1048
Mukhopadhyay D, Sinha S (2019) Domain-specific crawler design. In: Web Searching and Mining. Springer, Singapore, pp 85–112
Chapter Google Scholar
Radinsky K, Bennett PN (2013) Predicting content change on the web. In: Proceedings of the sixth ACM international conference on Web search and data mining, WSDM 2013. Rome, Italy. pp 415–424
Rawat S (2015) Focused crawling: an approach for URL queue optimization using link score. In: Recent Development in Wireless Sensor and Ad-hoc Networks. Springer, New Delhi, pp 169–189
Chapter Google Scholar
Sethi S, Dixit A (2015) Design of personalised search system based on user interest and query structuring. 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pp 1346-1351 IEEE
Sharma AK, Gupta JP, Agarwal DP (2010) Parcahyd: an architecture of a parallel crawler based on augmented hypertext documents. Int J Advan Technol 1(2):270–283
Google Scholar
Tarakeswar MK, Kavitha D (2011) Search engines: a study. J Comp Appl (JCA) 4(1):2011
Google Scholar
Umbrich, J., Mrzelj, N., & Polleres, A. (2015). Towards capturing and preserving changes on the Web of Data. In: DIACRON@ ESWC, pp 50-65
Wills RS (2006) Google’s pagerank. Math Intell 28(4):6–11
Article Google Scholar
Xu H, Lv Y, Fan G (2018) Research on Topic Crawler Strategy Based on Web Page Extension and Best Priority Search Algorithm. In: International Conference on Applications and Techniques in Cyber Security and Intelligence, Springer, Cham. pp 1129-1136
Zerfos P, Cho J, Ntoulas A (2005) Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'05), pp 100–109 IEEE
Google Scholar
Zhao F, Zhou J, Nie C, Huang H, Jin H (2015) Smartcrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans Serv Comput 9(4):608–620
Article Google Scholar
Zhu K, Xu Z, Wang X, Zhao Y (2008) A full distributed web crawler based on structured network. In: Asia Information Retrieval Symposium. Springer, Berlin, Heidelberg, pp 478–483
Google Scholar

Download references

Acknowledgements

I would like to express my sincere and deep gratitude to my Ph.D. supervisor, Dr. Ashutosh Dixit, Professor, Department of Computer Engineering, J. C. Bose University of Science & Technology, Faridabad for his continuous guidance, constructive criticism and valuable advice. I would also express gratitude to chairperson, Computer engineering department and members of IT cell who help me in obtaining the experimental results.

Author information

Authors and Affiliations

Department of Computer Applications, J. C. Bose University of Science and Technology, Faridabad, Haryana, India
Shilpa Sethi

Authors

Shilpa Sethi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shilpa Sethi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sethi, S. An optimized crawling technique for maintaining fresh repositories. Multimed Tools Appl 80, 11049–11077 (2021). https://doi.org/10.1007/s11042-020-10250-8

Download citation

Received: 05 March 2020
Revised: 18 October 2020
Accepted: 09 December 2020
Published: 03 January 2021
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11042-020-10250-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An optimized crawling technique for maintaining fresh repositories

Abstract

Access this article

Similar content being viewed by others

Intelligent and Adaptive Crawling of Web Applications for Web Archiving

AcT: Accuracy-aware crawling techniques for cloud-crawler

Crawling Data-Intensive Web Sources Using Structure Information

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An optimized crawling technique for maintaining fresh repositories

Abstract

Access this article

Similar content being viewed by others

Intelligent and Adaptive Crawling of Web Applications for Web Archiving

AcT: Accuracy-aware crawling techniques for cloud-crawler

Crawling Data-Intensive Web Sources Using Structure Information

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation