Abstract
In recent years, numerous applications have been continuously generating large amounts of uncertain data. The advanced analysis queries such as skyline operators are essential topics to extract interesting objects from the vast uncertain dataset. Recently, the MapReduce system has been widely used in the area of big data analysis. Although the probabilistic skyline query is not decomposable, it does not make sense to implement the probabilistic skyline query in the MapReduce framework. This paper proposes an effective parallel method called parallel computation of probabilistic skyline query (PCPS) that can measure the probabilistic skyline set in one MapReduce computation pass. The proposed method takes into account the critical sections and detects data with a high probability of existence through a proposed smart sampling algorithm. PCPS implements a new approach to the fair allocation of input data. The experimental results indicate that our proposed approach can not only reduce the processing time of the probabilistic skyline queries, but also achieve fair precision with varying dimensionality degrees.
Similar content being viewed by others
References
Alharthi A, Krotov V, Bowman M (2017) Addressing barriers to big data. Bus Horiz 60(3):285–292. https://doi.org/10.1016/j.bushor.2017.01.002
Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516. https://doi.org/10.1007/s11227-016-1677-z
Akoka J, Comyn-Wattiau I, Laoufi N (2017) Research on big data—a systematic mapping study. Comput Stand Interfaces 54:105–115. https://doi.org/10.1016/j.csi.2017.01.004
Zhang W, Lin X, Pei J, Zhang Y (2008) Managing uncertain data: probabilistic approaches. In: The Ninth International Conference on Web-Age Information Management. IEEE, pp 405–412
Borzsony S, Kossmann D, Stocker K (2001) The skyline operator. In: Proceedings of 17th International Conference on Data Engineering, 2001. IEEE, pp 421–430
Dellis E, Seeger B (2007) Efficient computation of reverse skyline queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, pp 291–302
Papadias D, Tao Y, Fu G, Seeger B (2003) An optimal and progressive algorithm for skyline queries. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM, pp 467–478
Atallah MJ, Qi Y (2009) Computing all skyline probabilities for uncertain data. In: Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp 279–287
Li F, Yi K, Jestes J (2009) Ranking distributed probabilistic data. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. ACM, pp 361–374
Pei J, Jiang B, Lin X, Yuan Y (2007) Probabilistic skylines on uncertain data. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, pp 15–26
Wang C, Yuan LY, You J-H, Zaiane OR, Pei J (2011) On pruning for top-k ranking in uncertain databases. Proc VLDB Endow 4(10):598–609
Zhang W, Lin X, Zhang Y, Wang W, Yu JX (2009) Probabilistic skyline operator over sliding windows. In: IEEE 25th International Conference on Data Engineering, 2009. ICDE’09. IEEE, pp 1060–1071
Yang Z, Li K, Zhou X, Mei J, Gao Y (2018) Top k probabilistic skyline queries on uncertain data. Neurocomputing 317:1–14. https://doi.org/10.1016/j.neucom.2018.03.052
Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Ding L, Wang G, Xin J, Yuan Y (2013) Efficient probabilistic skyline query processing in mapreduce. In: 2013 IEEE International Congress on Big Data (BigData Congress). IEEE, pp 203–210
Park Y, Min J-K, Shim K (2013) Parallel computation of skyline and reverse skyline queries using mapreduce. Proc VLDB Endow 6(14):2002–2013
Park Y, Min J-K, Shim K (2015) Processing of probabilistic skyline queries using mapreduce. Proc VLDB Endow 8(12):1406–1417
Ryu H-C, Jung S (2017) MapReduce-based skyline query processing scheme using adaptive two-level grids. Clust Comput 20(4):3605–3616
Zhang B, Zhou S, Guan J (2011) Adapting skyline computation to the mapreduce framework: algorithms and experiments. In: International Conference on Database Systems for Advanced Applications. Springer, pp 403–414
Kalyvas C, Maragoudakis M (2019) Skyline and reverse skyline query processing in SpatialHadoop. Data Knowl Eng 122:55–80. https://doi.org/10.1016/j.datak.2019.04.004
Li X, Liu J, Ren K, Li X, Ren X, Deng K (2019) Parallel k-dominant skyline queries over uncertain data streams with capability index. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 10–12 Aug. 2019. pp 1556–1563. https://doi.org/10.1109/hpcc/smartcity/dss.2019.00214
Liu J, Li X, Ren K, Song J, Zhang Z (2018) Parallel n-of-N skyline queries over uncertain data streams. In: Database and Expert Systems Applications. Springer International Publishing, Cham. pp 176–184
Vlachou A, Doulkeridis C, Kotidis Y (2008) Angle-based space partitioning for efficient parallel skyline computation. In: Proceedings of the 2008 ACM SIGMOD International conference on Management of Data. ACM, pp 227–238
Kalyvas C, Tzouramanis T (2017) A survey of skyline query processing. arXiv preprint arXiv:170401788
Tan K-L, Eng P-K, Ooi BC (2001) Efficient progressive skyline computation. In: VLDB, vol 1. pp 301–310
Kossmann D, Ramsak F, Rost S (2002) Shooting stars in the sky: an online algorithm for skyline queries. In: VLDB’02 Proceedings of the 28th International Conference on Very Large Data Bases. Morgan Kaufmann, pp 275–286
Bartolini I, Ciaccia P, Patella M (2006) SaLSa: computing the skyline without scanning the whole sky. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, pp 405–414
Chomicki J, Godfrey P, Gryz J, Liang D (2005) Skyline with presorting: theory and optimizations. Intelligent Information Processing and Web Mining. Springer, Berlin, Heidelberg, pp 595–604
Godfrey P, Shipley R, Gryz J (2005) Maximal vector computation in large data sets. In: Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, pp 229–240
Rocha-Junior JB, Vlachou A, Doulkeridis C, Nørvåg K (2009) AGiDS: A grid-based strategy for distributed skyline query processing. In: International Conference on Data Management in Grid and P2P Systems. Springer, pp 12–23
Wang S, Ooi BC, Tung AK, Xu L (2007) Efficient skyline query processing on peer-to-peer networks. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007. IEEE, pp 1126–1135
Wu P, Zhang C, Feng Y, Zhao BY, Agrawal D, El Abbadi A (2006) Parallelizing skyline queries for scalable distribution. In: International Conference on Extending Database Technology. Springer, pp 112–130
Cosgaya-Lozano A, Rau-Chaplin A, Zeh N Parallel computation of skyline queries. In: 21st International Symposium on High Performance Computing Systems and Applications, 2007. HPCS 2007. IEEE, pp 12–12
Köhler H, Yang J, Zhou X (2011) Efficient parallel skyline processing using hyperplane projections. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, pp 85–96
Afrati FN, Koutris P, Suciu D, Ullman JD (2015) Parallel skyline queries. Theory Comput Syst 57(4):1008–1037
Green TJ, Tannen V (2006) Models for incomplete and probabilistic information. In: International Conference on Extending Database Technology. Springer, pp 278–296
Xin J, Bai M, Wang G (2011) Efficient threshold skyline query processing in uncertain databases. In: 2011 Seventh International Conference on Natural Computation. IEEE, pp 311–315
Considine J, Li F, Kollios G, Byers J (2004) Approximate aggregation techniques for sensor databases. In: Proceedings. 20th International Conference on Data Engineering. IEEE, pp 449–460
Xin J, Wang G, Chen L, Zhang X, Wang Z (2007) Continuously maintaining sliding window skylines in a sensor network. In: International Conference on Database Systems for Advanced Applications. Springer, pp 509–521
Cosgaya-Lozano A, Rau-Chaplin A, Zeh N (2007) Parallel computation of skyline queries. In: 21st International Symposium on High Performance Computing Systems and Applications (HPCS’07). IEEE, pp 12–12
Valkanas G, Papadopoulos AN (2010) Efficient and adaptive distributed skyline computation. In: International Conference on Scientific and Statistical Database Management. Springer, pp 24–41
Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput 74:3415–3440
Gavagsaz E, Rezaee A, Javadi HHS (2019) Load balancing in join algorithms for skewed data in MapReduce systems. J Supercomput 75(1):228–254
Tillé Y (2006) Sampling algorithms. Springer, New York. https://doi.org/10.1007/0-387-34240-0
Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning—volume 28. pp III-531–III-539
Förster S (1997) Monte Carlo simulation of correlated random variables. http://citeseer.ist.psu.edu/432929.html
Madar V (2015) Direct formulation to Cholesky decomposition of a general nonsingular correlation matrix. Stat Probab Lett 103:142–147
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gavagsaz, E. Parallel computation of probabilistic skyline queries using MapReduce. J Supercomput 77, 418–444 (2021). https://doi.org/10.1007/s11227-020-03279-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03279-x