Abstract
Numerous applications are continuously generating massive amount of data and it has become critical to extract useful information while maintaining acceptable computing performance. The objective of this work is to design an indexing framework which minimizes indexing overhead and improves query execution and data search performance with optimum aggregation of computing performance. We propose SmallClient, an indexing framework to speed up query execution. SmallClient has three modules: block creation, index creation and query execution. Block creation module supports improving data retrieval performance with minimum data uploading overhead. Index creation module allows maximum indexes on a dataset to increase index hit ratio with minimized indexing overhead. Finally, query execution module offers incoming queries to utilize these indexes. The evaluation shows that SmallClient outperforms Hadoop full scan with more than 90% search performance. Meanwhile, indexing overhead of SmallClient is reduced to approximately 50 and 80% for index size and indexing time respectively.
Similar content being viewed by others
References
Vera-Baquero, A., Colomo-Palacios, R., Molloy, O.: Measuring and querying process performance in supply chains: an approach for mining big-data cloud storages. Proc. Comput. Sci. 64, 1026–1034 (2015)
Suthaharan, S.: Big data analytics. In: Machine Learning Models and Algorithms for Big Data Classification. Integrated Series in Information Systems, vol. 36, pp. 31-75. Springer, New York (2016)
Karim, A., Salleh, R., Khan, M.K., Siddiqa, A., Choo, K.-K.R.: On the analysis and detection of mobile botnet applications. J. Univ. Comput. Sci. 22(4), 567–588 (2016)
Karim, A., Shah, S.A.A., Salleh, R.B., Arif, M., Noor, R.M., Shamshirband, S.: Mobile botnet attacks an emerging threat: classification, review and open issues. KSII Trans. Internet Inform. Syst. 9(4), 1471–1492 (2015)
Yaqoob, I., Chang, V., Gani, A., Mokhtar, S., Hashem, I.A.T., Ahmed, E., Anuar, N.B., Khan, S.U.: Information fusion in social big data: foundations, state-of-the-art, applications, challenges, and future research directions. Int. J. Inform. Manag. (2016)
Hashem, I.A.T., Chang, V., Anuar, N.B., Adewole, K., Yaqoob, I., Gani, A., Ahmed, E., Chiroma, H.: He role of big data in smart city. Int. J. Inform. Manag. 36(5), 748–758 (2016). doi:10.1016/j.ijinfomgt.2016.05.002
Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel Distrib. Comput. 74(7), 2561–2573 (2014)
Siddiqa, A., TargioHashem, I.A., Yaqoob, I., Marjani, M., Shamshirband, S., Gani, A., Nasaruddin, F.: A survey of big data management: taxonomy and state-of-the-art. J. Netw. Comput. Appl. 71, 151–166 (2016)
Siddiqa, A., Karim, A., Gani, A.: Big data storage technologies: a survey. Front. Inform. Technol. Electron. Eng. 4(3), 28–33 (2016)
Chang, V., Wills, G.: A model to compare cloud and non-cloud storage of big data. Future Gener. Comput. Syst. 57, 56–76 (2016)
Lomotey, Richard K., Deters, Ralph: Unstructured data mining: use case for CouchDB. Int. J. Big Data Intell. 2(3), 168–182 (2015)
Yu, Shanshan, Jindian, Su, Li, Pengfei, Wang, Hao: Towards high performance text mining: a TextRank-based method for automatic text summarization. Int. J. Grid High Perform. Comput. 8(2), 58–75 (2016)
Yu, Kun-Ming, Liu, Sheng-Hui, Zhou, Li-Wei, Shu-Hao, Wu: Apriori-based high efficiency load balancing parallel data mining algorithms on multi-core architectures. Int. J. Grid High Perform. Comput. 7(2), 77–99 (2015)
Dittrich, J., Quian, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. Proc. VLDB Endow. 5(11), 1591–1602 (2012)
Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my Data Files. Here are my Queries. Where are my Results? In: Proceedings of 5th Biennial Conference on Innovative Data Systems Research, No. EPFL-CONF-161489 2011, vol. EPFL-CONF-161489 (2011)
Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inform. Manag. 35(2), 137–144 (2015)
Richter, S., Quian-Ruiz, J.-A., Schuh, S., Dittrich, J.: Towards zero-overhead adaptive indexing in Hadoop. arXiv preprint arXiv:1212.3480 (2012)
Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. CIDR 3, 1–8 (2007)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 165–178 (2009)
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)
Jens, D., Jorge-Arnulfo, O.-R., Alekh, J.: Hadoop++: making a yellow elephant run like a cheetah. Proc. VLDB Endow. 3(1–2), 515–529 (2010)
Zhuang, Y., Jiang, N., Wu, Z., Li, Q., Chiu, D.K.W., Hu, H.: Efficient and robust large medical image retrieval in mobile cloud computing environment. Inform. Sci. 263, 60–86 (2014)
Wang, M., Holub, V., Murphy, J., O’Sullivan, P.: High volumes of event stream indexing and efficient multi-keyword searching for cloud monitoring. Future Gener. Comput. Syst. 29(8), 1943–1962 (2013)
Kaushik, V.D., Umarani, J., Gupta, A.K., Gupta, A.K., Gupta, P.: An efficient indexing scheme for face database using modified geometric hashing. Neurocomputing 116, 208–221 (2013)
Gani, A., Siddiqa, A., Shamshirband, S., Hanum, F.: A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl. Inf. Syst. 46(2), 241–284 (2016)
Jin, R., Cho, H.-J., Chung, T.-S.: A group round robin based b-tree index storage scheme for flash memory devices. Paper presented at the Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, Siem Reap, Cambodia (2014)
Chi, P., Lee, W.-C., Xie, Y.: Making B<sup>+</sup>-tree efficient in PCM-based main memory. Paper presented at the Proceedings of the 2014 international symposium on Low power electronics and design, La Jolla (2014)
McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action: Covers Apache Lucene 3.0. Manning Publications Co., Chicago (2010)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on 2010, pp. 1–10 (2010)
Eldawy, A., Mokbel, M.F.: Spatial Hadoop: A MapReduce Framework for Spatial Data. In: 2015 IEEE 31st International Conference on Data Engineering 2015, pp. 1352–1363. IEEE:1352-1363 (2015)
Chang, V.: Towards a big data system disaster recovery in a private cloud. Ad Hoc Netw. 35, 65–82 (2015). doi:10.1016/j.adhoc.2015.07.012
McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action: Covers Apache Lucene 3.0. Manning Publications Co., Chicago (2010)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Siddiqa, A., Karim, A. & Chang, V. SmallClient for big data: an indexing framework towards fast data retrieval. Cluster Comput 20, 1193–1208 (2017). https://doi.org/10.1007/s10586-016-0712-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-016-0712-4