Skip to main content
Log in

Parallel computation of probabilistic skyline queries using MapReduce

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In recent years, numerous applications have been continuously generating large amounts of uncertain data. The advanced analysis queries such as skyline operators are essential topics to extract interesting objects from the vast uncertain dataset. Recently, the MapReduce system has been widely used in the area of big data analysis. Although the probabilistic skyline query is not decomposable, it does not make sense to implement the probabilistic skyline query in the MapReduce framework. This paper proposes an effective parallel method called parallel computation of probabilistic skyline query (PCPS) that can measure the probabilistic skyline set in one MapReduce computation pass. The proposed method takes into account the critical sections and detects data with a high probability of existence through a proposed smart sampling algorithm. PCPS implements a new approach to the fair allocation of input data. The experimental results indicate that our proposed approach can not only reduce the processing time of the probabilistic skyline queries, but also achieve fair precision with varying dimensionality degrees.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Alharthi A, Krotov V, Bowman M (2017) Addressing barriers to big data. Bus Horiz 60(3):285–292. https://doi.org/10.1016/j.bushor.2017.01.002

    Article  Google Scholar 

  2. Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516. https://doi.org/10.1007/s11227-016-1677-z

    Article  Google Scholar 

  3. Akoka J, Comyn-Wattiau I, Laoufi N (2017) Research on big data—a systematic mapping study. Comput Stand Interfaces 54:105–115. https://doi.org/10.1016/j.csi.2017.01.004

    Article  Google Scholar 

  4. Zhang W, Lin X, Pei J, Zhang Y (2008) Managing uncertain data: probabilistic approaches. In: The Ninth International Conference on Web-Age Information Management. IEEE, pp 405–412

  5. Borzsony S, Kossmann D, Stocker K (2001) The skyline operator. In: Proceedings of 17th International Conference on Data Engineering, 2001. IEEE, pp 421–430

  6. Dellis E, Seeger B (2007) Efficient computation of reverse skyline queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, pp 291–302

  7. Papadias D, Tao Y, Fu G, Seeger B (2003) An optimal and progressive algorithm for skyline queries. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM, pp 467–478

  8. Atallah MJ, Qi Y (2009) Computing all skyline probabilities for uncertain data. In: Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp 279–287

  9. Li F, Yi K, Jestes J (2009) Ranking distributed probabilistic data. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. ACM, pp 361–374

  10. Pei J, Jiang B, Lin X, Yuan Y (2007) Probabilistic skylines on uncertain data. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, pp 15–26

  11. Wang C, Yuan LY, You J-H, Zaiane OR, Pei J (2011) On pruning for top-k ranking in uncertain databases. Proc VLDB Endow 4(10):598–609

    Article  Google Scholar 

  12. Zhang W, Lin X, Zhang Y, Wang W, Yu JX (2009) Probabilistic skyline operator over sliding windows. In: IEEE 25th International Conference on Data Engineering, 2009. ICDE’09. IEEE, pp 1060–1071

  13. Yang Z, Li K, Zhou X, Mei J, Gao Y (2018) Top k probabilistic skyline queries on uncertain data. Neurocomputing 317:1–14. https://doi.org/10.1016/j.neucom.2018.03.052

    Article  Google Scholar 

  14. Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013

    Article  Google Scholar 

  15. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  16. Ding L, Wang G, Xin J, Yuan Y (2013) Efficient probabilistic skyline query processing in mapreduce. In: 2013 IEEE International Congress on Big Data (BigData Congress). IEEE, pp 203–210

  17. Park Y, Min J-K, Shim K (2013) Parallel computation of skyline and reverse skyline queries using mapreduce. Proc VLDB Endow 6(14):2002–2013

    Article  Google Scholar 

  18. Park Y, Min J-K, Shim K (2015) Processing of probabilistic skyline queries using mapreduce. Proc VLDB Endow 8(12):1406–1417

    Article  Google Scholar 

  19. Ryu H-C, Jung S (2017) MapReduce-based skyline query processing scheme using adaptive two-level grids. Clust Comput 20(4):3605–3616

    Article  Google Scholar 

  20. Zhang B, Zhou S, Guan J (2011) Adapting skyline computation to the mapreduce framework: algorithms and experiments. In: International Conference on Database Systems for Advanced Applications. Springer, pp 403–414

  21. Kalyvas C, Maragoudakis M (2019) Skyline and reverse skyline query processing in SpatialHadoop. Data Knowl Eng 122:55–80. https://doi.org/10.1016/j.datak.2019.04.004

    Article  Google Scholar 

  22. Li X, Liu J, Ren K, Li X, Ren X, Deng K (2019) Parallel k-dominant skyline queries over uncertain data streams with capability index. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 10–12 Aug. 2019. pp 1556–1563. https://doi.org/10.1109/hpcc/smartcity/dss.2019.00214

  23. Liu J, Li X, Ren K, Song J, Zhang Z (2018) Parallel n-of-N skyline queries over uncertain data streams. In: Database and Expert Systems Applications. Springer International Publishing, Cham. pp 176–184

  24. Vlachou A, Doulkeridis C, Kotidis Y (2008) Angle-based space partitioning for efficient parallel skyline computation. In: Proceedings of the 2008 ACM SIGMOD International conference on Management of Data. ACM, pp 227–238

  25. Kalyvas C, Tzouramanis T (2017) A survey of skyline query processing. arXiv preprint arXiv:170401788

  26. Tan K-L, Eng P-K, Ooi BC (2001) Efficient progressive skyline computation. In: VLDB, vol 1. pp 301–310

  27. Kossmann D, Ramsak F, Rost S (2002) Shooting stars in the sky: an online algorithm for skyline queries. In: VLDB’02 Proceedings of the 28th International Conference on Very Large Data Bases. Morgan Kaufmann, pp 275–286

  28. Bartolini I, Ciaccia P, Patella M (2006) SaLSa: computing the skyline without scanning the whole sky. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, pp 405–414

  29. Chomicki J, Godfrey P, Gryz J, Liang D (2005) Skyline with presorting: theory and optimizations. Intelligent Information Processing and Web Mining. Springer, Berlin, Heidelberg, pp 595–604

    Chapter  Google Scholar 

  30. Godfrey P, Shipley R, Gryz J (2005) Maximal vector computation in large data sets. In: Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, pp 229–240

  31. Rocha-Junior JB, Vlachou A, Doulkeridis C, Nørvåg K (2009) AGiDS: A grid-based strategy for distributed skyline query processing. In: International Conference on Data Management in Grid and P2P Systems. Springer, pp 12–23

  32. Wang S, Ooi BC, Tung AK, Xu L (2007) Efficient skyline query processing on peer-to-peer networks. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007. IEEE, pp 1126–1135

  33. Wu P, Zhang C, Feng Y, Zhao BY, Agrawal D, El Abbadi A (2006) Parallelizing skyline queries for scalable distribution. In: International Conference on Extending Database Technology. Springer, pp 112–130

  34. Cosgaya-Lozano A, Rau-Chaplin A, Zeh N Parallel computation of skyline queries. In: 21st International Symposium on High Performance Computing Systems and Applications, 2007. HPCS 2007. IEEE, pp 12–12

  35. Köhler H, Yang J, Zhou X (2011) Efficient parallel skyline processing using hyperplane projections. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, pp 85–96

  36. Afrati FN, Koutris P, Suciu D, Ullman JD (2015) Parallel skyline queries. Theory Comput Syst 57(4):1008–1037

    Article  MathSciNet  Google Scholar 

  37. Green TJ, Tannen V (2006) Models for incomplete and probabilistic information. In: International Conference on Extending Database Technology. Springer, pp 278–296

  38. Xin J, Bai M, Wang G (2011) Efficient threshold skyline query processing in uncertain databases. In: 2011 Seventh International Conference on Natural Computation. IEEE, pp 311–315

  39. Considine J, Li F, Kollios G, Byers J (2004) Approximate aggregation techniques for sensor databases. In: Proceedings. 20th International Conference on Data Engineering. IEEE, pp 449–460

  40. Xin J, Wang G, Chen L, Zhang X, Wang Z (2007) Continuously maintaining sliding window skylines in a sensor network. In: International Conference on Database Systems for Advanced Applications. Springer, pp 509–521

  41. Cosgaya-Lozano A, Rau-Chaplin A, Zeh N (2007) Parallel computation of skyline queries. In: 21st International Symposium on High Performance Computing Systems and Applications (HPCS’07). IEEE, pp 12–12

  42. Valkanas G, Papadopoulos AN (2010) Efficient and adaptive distributed skyline computation. In: International Conference on Scientific and Statistical Database Management. Springer, pp 24–41

  43. Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput 74:3415–3440

    Article  Google Scholar 

  44. Gavagsaz E, Rezaee A, Javadi HHS (2019) Load balancing in join algorithms for skewed data in MapReduce systems. J Supercomput 75(1):228–254

    Article  Google Scholar 

  45. Tillé Y (2006) Sampling algorithms. Springer, New York. https://doi.org/10.1007/0-387-34240-0

    Book  MATH  Google Scholar 

  46. Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning—volume 28. pp III-531–III-539

  47. Förster S (1997) Monte Carlo simulation of correlated random variables. http://citeseer.ist.psu.edu/432929.html

  48. Madar V (2015) Direct formulation to Cholesky decomposition of a general nonsingular correlation matrix. Stat Probab Lett 103:142–147

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elaheh Gavagsaz.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gavagsaz, E. Parallel computation of probabilistic skyline queries using MapReduce. J Supercomput 77, 418–444 (2021). https://doi.org/10.1007/s11227-020-03279-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03279-x

Keywords

Navigation