Skip to main content
Log in

SmallClient for big data: an indexing framework towards fast data retrieval

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Numerous applications are continuously generating massive amount of data and it has become critical to extract useful information while maintaining acceptable computing performance. The objective of this work is to design an indexing framework which minimizes indexing overhead and improves query execution and data search performance with optimum aggregation of computing performance. We propose SmallClient, an indexing framework to speed up query execution. SmallClient has three modules: block creation, index creation and query execution. Block creation module supports improving data retrieval performance with minimum data uploading overhead. Index creation module allows maximum indexes on a dataset to increase index hit ratio with minimized indexing overhead. Finally, query execution module offers incoming queries to utilize these indexes. The evaluation shows that SmallClient outperforms Hadoop full scan with more than 90% search performance. Meanwhile, indexing overhead of SmallClient is reduced to approximately 50 and 80% for index size and indexing time respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Vera-Baquero, A., Colomo-Palacios, R., Molloy, O.: Measuring and querying process performance in supply chains: an approach for mining big-data cloud storages. Proc. Comput. Sci. 64, 1026–1034 (2015)

    Article  Google Scholar 

  2. Suthaharan, S.: Big data analytics. In: Machine Learning Models and Algorithms for Big Data Classification. Integrated Series in Information Systems, vol. 36, pp. 31-75. Springer, New York (2016)

  3. Karim, A., Salleh, R., Khan, M.K., Siddiqa, A., Choo, K.-K.R.: On the analysis and detection of mobile botnet applications. J. Univ. Comput. Sci. 22(4), 567–588 (2016)

    Google Scholar 

  4. Karim, A., Shah, S.A.A., Salleh, R.B., Arif, M., Noor, R.M., Shamshirband, S.: Mobile botnet attacks an emerging threat: classification, review and open issues. KSII Trans. Internet Inform. Syst. 9(4), 1471–1492 (2015)

    Google Scholar 

  5. Yaqoob, I., Chang, V., Gani, A., Mokhtar, S., Hashem, I.A.T., Ahmed, E., Anuar, N.B., Khan, S.U.: Information fusion in social big data: foundations, state-of-the-art, applications, challenges, and future research directions. Int. J. Inform. Manag. (2016)

  6. Hashem, I.A.T., Chang, V., Anuar, N.B., Adewole, K., Yaqoob, I., Gani, A., Ahmed, E., Chiroma, H.: He role of big data in smart city. Int. J. Inform. Manag. 36(5), 748–758 (2016). doi:10.1016/j.ijinfomgt.2016.05.002

    Article  Google Scholar 

  7. Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel Distrib. Comput. 74(7), 2561–2573 (2014)

    Article  Google Scholar 

  8. Siddiqa, A., TargioHashem, I.A., Yaqoob, I., Marjani, M., Shamshirband, S., Gani, A., Nasaruddin, F.: A survey of big data management: taxonomy and state-of-the-art. J. Netw. Comput. Appl. 71, 151–166 (2016)

    Article  Google Scholar 

  9. Siddiqa, A., Karim, A., Gani, A.: Big data storage technologies: a survey. Front. Inform. Technol. Electron. Eng. 4(3), 28–33 (2016)

    Google Scholar 

  10. Chang, V., Wills, G.: A model to compare cloud and non-cloud storage of big data. Future Gener. Comput. Syst. 57, 56–76 (2016)

    Article  Google Scholar 

  11. Lomotey, Richard K., Deters, Ralph: Unstructured data mining: use case for CouchDB. Int. J. Big Data Intell. 2(3), 168–182 (2015)

    Article  Google Scholar 

  12. Yu, Shanshan, Jindian, Su, Li, Pengfei, Wang, Hao: Towards high performance text mining: a TextRank-based method for automatic text summarization. Int. J. Grid High Perform. Comput. 8(2), 58–75 (2016)

    Article  Google Scholar 

  13. Yu, Kun-Ming, Liu, Sheng-Hui, Zhou, Li-Wei, Shu-Hao, Wu: Apriori-based high efficiency load balancing parallel data mining algorithms on multi-core architectures. Int. J. Grid High Perform. Comput. 7(2), 77–99 (2015)

    Article  Google Scholar 

  14. Dittrich, J., Quian, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. Proc. VLDB Endow. 5(11), 1591–1602 (2012)

    Article  Google Scholar 

  15. Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my Data Files. Here are my Queries. Where are my Results? In: Proceedings of 5th Biennial Conference on Innovative Data Systems Research, No. EPFL-CONF-161489 2011, vol. EPFL-CONF-161489 (2011)

  16. Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inform. Manag. 35(2), 137–144 (2015)

    Article  Google Scholar 

  17. Richter, S., Quian-Ruiz, J.-A., Schuh, S., Dittrich, J.: Towards zero-overhead adaptive indexing in Hadoop. arXiv preprint arXiv:1212.3480 (2012)

  18. Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. CIDR 3, 1–8 (2007)

    Google Scholar 

  19. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 165–178 (2009)

  20. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)

  21. Jens, D., Jorge-Arnulfo, O.-R., Alekh, J.: Hadoop++: making a yellow elephant run like a cheetah. Proc. VLDB Endow. 3(1–2), 515–529 (2010)

    Google Scholar 

  22. Zhuang, Y., Jiang, N., Wu, Z., Li, Q., Chiu, D.K.W., Hu, H.: Efficient and robust large medical image retrieval in mobile cloud computing environment. Inform. Sci. 263, 60–86 (2014)

    Article  Google Scholar 

  23. Wang, M., Holub, V., Murphy, J., O’Sullivan, P.: High volumes of event stream indexing and efficient multi-keyword searching for cloud monitoring. Future Gener. Comput. Syst. 29(8), 1943–1962 (2013)

    Article  Google Scholar 

  24. Kaushik, V.D., Umarani, J., Gupta, A.K., Gupta, A.K., Gupta, P.: An efficient indexing scheme for face database using modified geometric hashing. Neurocomputing 116, 208–221 (2013)

    Article  Google Scholar 

  25. Gani, A., Siddiqa, A., Shamshirband, S., Hanum, F.: A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl. Inf. Syst. 46(2), 241–284 (2016)

    Article  Google Scholar 

  26. Jin, R., Cho, H.-J., Chung, T.-S.: A group round robin based b-tree index storage scheme for flash memory devices. Paper presented at the Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, Siem Reap, Cambodia (2014)

  27. Chi, P., Lee, W.-C., Xie, Y.: Making B<sup>+</sup>-tree efficient in PCM-based main memory. Paper presented at the Proceedings of the 2014 international symposium on Low power electronics and design, La Jolla (2014)

  28. McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action: Covers Apache Lucene 3.0. Manning Publications Co., Chicago (2010)

    Google Scholar 

  29. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  30. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  31. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on 2010, pp. 1–10 (2010)

  32. Eldawy, A., Mokbel, M.F.: Spatial Hadoop: A MapReduce Framework for Spatial Data. In: 2015 IEEE 31st International Conference on Data Engineering 2015, pp. 1352–1363. IEEE:1352-1363 (2015)

  33. Chang, V.: Towards a big data system disaster recovery in a private cloud. Ad Hoc Netw. 35, 65–82 (2015). doi:10.1016/j.adhoc.2015.07.012

    Article  Google Scholar 

  34. McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action: Covers Apache Lucene 3.0. Manning Publications Co., Chicago (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aisha Siddiqa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Siddiqa, A., Karim, A. & Chang, V. SmallClient for big data: an indexing framework towards fast data retrieval. Cluster Comput 20, 1193–1208 (2017). https://doi.org/10.1007/s10586-016-0712-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-016-0712-4

Keywords

Navigation