Abstract
This paper studies efficient processing of durable top-k queries on historical time series databases. Durable top-k queries, obtained as an extension of snapshot top-k queries during a certain time period, play a key role in finding objects with durable quality and predicting the status of these objects for successive time intervals by updating the query interval at all timestamps. Web crawling and indexing are tremendously significant in recent times, especially in terms of achieving efficient durable top-k queries from vast quantum of web documents. Existing algorithms that have been employed throw up results that are less than applicable to analyzers. This paper chiefly focuses on web crawling and indexing query terms under their respective categories and updating rank changes at every time interval. Links are crawled using the modified depth-first search (MDFS) algorithm, accessed, and metadata such as the title, keywords, and descriptions extracted. To handle query indexing, novel indexing techniques are proposed to yield efficient results. This study is invaluable for analysts working on large data obtained as a result of crawling and indexing, effectively decreasing their workload.
Similar content being viewed by others
References
Cooley, R.B., Mobasher, B., Srivastava, J.: Web mining: information and pattern discovery on the world wide web. In: Proceedings of the 9th IEEE International Conference on Tool with Artificial Intelligence, pp. 558–567. (1997)
Singh, A., Srivatsa, M., Liu, L., Miller, T.: Apoidea: a decentralized peer-to-peer architecture for crawling the world wide web. In: Proceedings of the Workshop on Distributed Information Retrieval, Lecture Notes in Computer Science (SIGIR 2003), vol. 2924, pp. 126–142. (2003)
David, G., Jon, K., Prabhakar, R.: Inferring web communities from link topology. In: Proceedings of the 9th ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space-Structure in Hypermedia Systems, Pittsburgh, pp. 225–234. (1998)
Pandey, S., Olston, C.: User-centric web crawling. In: Proceedings of the 14\(^{th}\) International Conference on World Wide Web, pp. 401–411. (2005)
Kosala, R., Blockeel, H.: Web mining research: a survey. ACM SIGKDD Explor. 2, 1–15 (2000)
Minhas, Gurmeen, Kumar, Mukesh: LSI based relevance computation for topical web crawler. J. Emerg. Technol. Web Intell. 5(4), 401–406 (2013)
Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-K query processing techniques in relational database systems. ACM Comput. Surv. 40(4), 11–58 (2008)
Jiang, B., Pei, J.: Online interval skyline queries on time series. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE), pp. 1036–1047. (2009)
Lee, M.L., Hsu, W., Li, L., Tok, W.H.: Consistent top-K queries over time. In: Proceedings of the 14\(^{th}\) International Conference on Database Systems for Advanced Applications (DASFAA), vol. 5463, pp. 51–65. (2009)
Li, F., Yi, K., Le, W.: Top-k queries on temporal data. VLDB J. 19, 715–733 (2010)
Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 886–895. (2007)
Hou, L., Mamoulis, U.N., Berberich, K., Bedathur, S.: Durable top-k search in document archives. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD), pp. 555–566. (2010)
Dakka, W., Gravano, L., Ipeirotis, P.G.: Answering general time sensitive queries. IEEE Trans. Knowl. Data Eng. 24(2), 220–235 (2012)
Mahmoud, A., Cengiz, O., Erkay, S.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19, 109–126 (2016)
Wang, H., Cal, Y., Yang, Y., Zhang, S., Mamoulis, N.: Durable queries over historical time series. IEEE Trans. Knowl. Data Eng. 26(3), 595–607 (2014)
He, Z., Wu, C., Liu, G., Zheng, Z., Tian, Y.: Decomposition tree: a spatio-temporal indexing method for movement big data. Clust. Comput. 18, 1481–1492 (2015)
kim, J., Yun, U., Pyun, G., Ryang, H., Lee, G., Yoon, E., Ryu, K.H.: A blog ranking algorithm using analysis of both blog influence and characteristics of blog posts. Clust. Comput. 18, 157–164 (2015)
Devi, S.R., Manjula, D.: Survey on comparative analysis of durable queries over historical time series. Int. J. Comput. Appl. 106, 34–37 (2014)
Suganya Devi, R., Manjula, D., Siddharth, R.K.: An efficient approach for web indexing of big data through hyperlinks in web crawling. Sci. World J. (2015). doi:10.1155/2015/739286
Acknowledgments
Dr. Sugumaran’s research has been supported in part by a 2016 School of Business Administration Spring/Summer Research Fellowship from Oakland University.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Devi, R.S., Manjula, D. & Sugumaran, V. Efficient indexing structure to handle durable queries through web crawling. Cluster Comput 19, 1347–1358 (2016). https://doi.org/10.1007/s10586-016-0595-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-016-0595-4