Skip to main content
Log in

Efficient indexing structure to handle durable queries through web crawling

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

This paper studies efficient processing of durable top-k queries on historical time series databases. Durable top-k queries, obtained as an extension of snapshot top-k queries during a certain time period, play a key role in finding objects with durable quality and predicting the status of these objects for successive time intervals by updating the query interval at all timestamps. Web crawling and indexing are tremendously significant in recent times, especially in terms of achieving efficient durable top-k queries from vast quantum of web documents. Existing algorithms that have been employed throw up results that are less than applicable to analyzers. This paper chiefly focuses on web crawling and indexing query terms under their respective categories and updating rank changes at every time interval. Links are crawled using the modified depth-first search (MDFS) algorithm, accessed, and metadata such as the title, keywords, and descriptions extracted. To handle query indexing, novel indexing techniques are proposed to yield efficient results. This study is invaluable for analysts working on large data obtained as a result of crawling and indexing, effectively decreasing their workload.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Cooley, R.B., Mobasher, B., Srivastava, J.: Web mining: information and pattern discovery on the world wide web. In: Proceedings of the 9th IEEE International Conference on Tool with Artificial Intelligence, pp. 558–567. (1997)

  2. Singh, A., Srivatsa, M., Liu, L., Miller, T.: Apoidea: a decentralized peer-to-peer architecture for crawling the world wide web. In: Proceedings of the Workshop on Distributed Information Retrieval, Lecture Notes in Computer Science (SIGIR 2003), vol. 2924, pp. 126–142. (2003)

  3. David, G., Jon, K., Prabhakar, R.: Inferring web communities from link topology. In: Proceedings of the 9th ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space-Structure in Hypermedia Systems, Pittsburgh, pp. 225–234. (1998)

  4. Pandey, S., Olston, C.: User-centric web crawling. In: Proceedings of the 14\(^{th}\) International Conference on World Wide Web, pp. 401–411. (2005)

  5. Kosala, R., Blockeel, H.: Web mining research: a survey. ACM SIGKDD Explor. 2, 1–15 (2000)

  6. Minhas, Gurmeen, Kumar, Mukesh: LSI based relevance computation for topical web crawler. J. Emerg. Technol. Web Intell. 5(4), 401–406 (2013)

    Google Scholar 

  7. Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-K query processing techniques in relational database systems. ACM Comput. Surv. 40(4), 11–58 (2008)

    Article  Google Scholar 

  8. Jiang, B., Pei, J.: Online interval skyline queries on time series. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE), pp. 1036–1047. (2009)

  9. Lee, M.L., Hsu, W., Li, L., Tok, W.H.: Consistent top-K queries over time. In: Proceedings of the 14\(^{th}\) International Conference on Database Systems for Advanced Applications (DASFAA), vol. 5463, pp. 51–65. (2009)

  10. Li, F., Yi, K., Le, W.: Top-k queries on temporal data. VLDB J. 19, 715–733 (2010)

    Article  Google Scholar 

  11. Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 886–895. (2007)

  12. Hou, L., Mamoulis, U.N., Berberich, K., Bedathur, S.: Durable top-k search in document archives. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD), pp. 555–566. (2010)

  13. Dakka, W., Gravano, L., Ipeirotis, P.G.: Answering general time sensitive queries. IEEE Trans. Knowl. Data Eng. 24(2), 220–235 (2012)

    Article  Google Scholar 

  14. Mahmoud, A., Cengiz, O., Erkay, S.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19, 109–126 (2016)

    Article  Google Scholar 

  15. Wang, H., Cal, Y., Yang, Y., Zhang, S., Mamoulis, N.: Durable queries over historical time series. IEEE Trans. Knowl. Data Eng. 26(3), 595–607 (2014)

    Article  Google Scholar 

  16. He, Z., Wu, C., Liu, G., Zheng, Z., Tian, Y.: Decomposition tree: a spatio-temporal indexing method for movement big data. Clust. Comput. 18, 1481–1492 (2015)

    Article  Google Scholar 

  17. kim, J., Yun, U., Pyun, G., Ryang, H., Lee, G., Yoon, E., Ryu, K.H.: A blog ranking algorithm using analysis of both blog influence and characteristics of blog posts. Clust. Comput. 18, 157–164 (2015)

    Article  Google Scholar 

  18. Devi, S.R., Manjula, D.: Survey on comparative analysis of durable queries over historical time series. Int. J. Comput. Appl. 106, 34–37 (2014)

    Google Scholar 

  19. Suganya Devi, R., Manjula, D., Siddharth, R.K.: An efficient approach for web indexing of big data through hyperlinks in web crawling. Sci. World J. (2015). doi:10.1155/2015/739286

Download references

Acknowledgments

Dr. Sugumaran’s research has been supported in part by a 2016 School of Business Administration Spring/Summer Research Fellowship from Oakland University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Suganya Devi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Devi, R.S., Manjula, D. & Sugumaran, V. Efficient indexing structure to handle durable queries through web crawling. Cluster Comput 19, 1347–1358 (2016). https://doi.org/10.1007/s10586-016-0595-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-016-0595-4

Keywords

Navigation