Skip to main content
Log in

Atrak: a MapReduce-based data warehouse for big data

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

As warehouse data volumes expand, single-node solutions can no longer analyze the immense volume of data. Therefore, it is necessary to use shared nothing architectures such as MapReduce. Inter-node data segmentation in MapReduce creates node connectivity issues, network congestion, improper use of node memory capacity and inefficient processing power. In addition, it is not possible to change dimensions and measures without changing previously stored data and big dimension management. In this paper, a method called Atrak is proposed, which uses a unified data format to make Mapper nodes independent to solve the data management problem mentioned earlier. The proposed method can be applied to star schema data warehouse models with distributive measures. Atrak increases query execution speed by employing node independence and the proper use of MapReduce. The proposed method was compared to established methods such as Hive, Spark-SQL, HadoopDB and Flink. Simulation results confirm improved query execution speed of the proposed method. Using data unification in MapReduce can be used in other fields, such as data mining and graph processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Mohammad Hossein Barkhordari Query Language.

References

  1. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  2. Krishnan K (2013) Data warehousing in the age of big data. Newnes, p 23

  3. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam, p 51

  4. Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in MapReduce. VLDB J 23(3):355–380

    Article  Google Scholar 

  5. Eltabakh MY et al (2011) CoHadoop: flexible data placement and its exploitation in Hadoop. Proc VLDB Endow 4.9:575–585

    Article  Google Scholar 

  6. Lin Y et al (2011) Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM

  7. Chen S (2010) Cheetah: a high performance, custom data warehouse on top of MapReduce. Proc VLDB Endow 3(1–2):1459–1468

    Article  Google Scholar 

  8. He Y et al (2011) RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: IEEE 27th International Conference on Data Engineering (ICDE), 2011, IEEE

  9. Floratou A et al (2011) Column-oriented storage techniques for MapReduce. Proc VLDB Endow 4.7:419–429

    Article  Google Scholar 

  10. Nykiel T et al (2010) MRShare: sharing across multiple queries in MapReduce. Proc VLDB Endow 3.1–2:494–505

    Article  Google Scholar 

  11. Elghandour I, Aboulnaba A (2012) ReStore: reusing results of MapReduce jobs. Proc VLD B Endow 5.6:586–597

    Article  Google Scholar 

  12. Olston C et al (2008) Pig latin: a not so foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM

  13. Dittrich J et al (2010) Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc VLDB Endow 3.1–2:515–529

    Article  Google Scholar 

  14. Dittrich J et al (2012) Only aggressive elephants are fast elephants. Proc Endow 5.11:1591–16902

    Article  Google Scholar 

  15. Abouzeid A et al (2009) HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow 2.1:922–933

    Article  Google Scholar 

  16. Vernica R et al (2012) Adaptive MapReduce using situation aware mappers. In: Proceedings of the 15th International Conference on Extending Database Technology. ACM

  17. Kaldewey T, Shekita EJ, Tata S (2012) Clydesdale: structured data processing on MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology. ACM

  18. Thusoo A et al (2009) Hive: a warehousing solution over a MapReduce framework. Proc VLDB Endow 2.2:1626–1629

    Article  Google Scholar 

  19. Engle C et al (2012) Shark: fast data analysis using coarse-grained distributed memory. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, pp 689–692

  20. Armbrust M et al (2015) Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM

  21. Zaharia M et al (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol 10

  22. Carbone P et al (2015) Apache flink: stream and batch processing in a single engine. Data Eng 28

  23. http://redis.io/

  24. http://www.postgresql.org

  25. http://www.tpc.org/tpcds/

  26. http://www.ubuntu.com/download/server

  27. http://hadoop.apache.org/

  28. https://hive.apache.org/downloads.html

  29. http://spark.apache.org/

  30. https://sourceforge.net/projects/hadoopdb/

  31. http://flink.apache.org/

  32. Barkhordari M, Niamanesh M (2017) Aras: a method with uniform distributed dataset to solve data warehouse problems for big data. Int J Distrib Sys Technol (IJDST) 8(2):47–60

  33. Barkhordari M, Niamanesh M (2017) ScaDiGraph: a MapReduce-based method for solving graph problems. J Inf Sci Eng 33(1)

  34. Barkhordari M, Niamanesh M (2014) ScadiBino: an effective MapReduce-based association rule mining method. In: Proceedings of the 16th International Conference on Electronic Commerce. ACM

  35. Barkhordari M, Niamanesh M (2015) ScaDiPaSi: an effective scalable and distributable MapReduce-based method to find patient similarity on huge healthcare networks. Big Data Res 2(1):19–27

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammadhossein Barkhordari.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barkhordari, M., Niamanesh, M. Atrak: a MapReduce-based data warehouse for big data. J Supercomput 73, 4596–4610 (2017). https://doi.org/10.1007/s11227-017-2037-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-2037-3

Keywords

Navigation