Abstract
This work aims to reduce queries on big data to computations on small data, and hence make querying big data possible under bounded resources. A query Q is boundedly evaluable when posed on any big dataset \({\cal D}\), there exists a fraction \({{\cal D}_Q}\) of \({\cal D}\) such that \(Q({\cal D}) = Q({{\cal D}_Q})\), and the cost of identifying \({{\cal D}_Q}\) is independent of the size of \({\cal D}\). It has been shown that with an auxiliary structure known as access schema, many queries in relational algebra (RA) are boundedly evaluable under the set semantics of RA. This paper extends the theory of bounded evaluation to RAaggr, i.e., RA extended with aggregation, under the bag semantics. (1) We extend access schema to bag access schema, to help us identify \({{\cal D}_Q}\) for RAaggr queries Q. (2) While it is undecidable to determine whether an RAaggr query is boundedly evaluable under a bag access schema, we identify special cases that are decidable and practical. (3) In addition, we develop an effective syntax for bounded RAaggr queries, i.e., a core subclass of boundedly evaluable RAaggr queries without sacrificing their expressive power. (4) Based on the effective syntax, we provide efficient algorithms to check the bounded evaluability of RAaggr queries and to generate query plans for bounded RAaggr queries. (5) As proof of concept, we extend PostgreSQL to support bounded evaluation. We experimentally verify that the extended system improves performance by orders of magnitude.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
C. H. Papadimitriou. Computational Complexity, Reading, USA: Addison-Wesley, 1994.
S. Akateboul, Ft. Hull, V. Vianu. Foundations of Databases, Boston, USA: Addison Wesley, 1995.
R. Horak. Telecommunications and Data Communications Handbook, New York, USA: Wiley, 2007.
W. F. Fan, X. Wang, Y. H. Wu, D. Deng. Distributed graph simulation: Impossibility and possibility. Proceedings of the VLDB Endowment, vol. 7, no. 12, pp. 1083–1094, 2014. DOI: https://doi.org/10.14778/2732977.2732983.
W. F. Fan, F. Geerts, Y. Cao, T. Deng, P. Lu. Querying big data by accessing small data. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, ACM, Melbourne, Victoria, Auatralia, pp. 173–184, 2015. DOI: https://doi.org/10.1145/2745754.2745771.
Y. Cao, W. F. Fan. An effective syntax for bounded relational queries. In Proceedings of 2016 International Conference on Management of Data, ACM, San Francisco, USA, 2016. DOI: https://doi.org/10.1145/2882903.2882942.
The University of Edinburgh. Huawei deal to advance expertise in data science, [Online], Available: https://www.ed.ac.uk/news/2017/huawei-deal-to-advance-expertise-in-data-science, June 14, 2017.
Facebook. Introducing graph search beta, [Online], Available: https://about.fb.com/news/2013/01/introducing-graph-search-beta/, January 15, 2013.
I. Grujic, S. Bogdanovic-Dmic, L. Stoimenov. Collecting and analyzing data from e-government Facebook pages. In ICT Innovations, Ohrid, Macedonia, pp. 86–96, 2014.
Facebook. Newsroom, [Online], Available: http://news-room.fb.com.
R. Ramakrishnan, J. Gehrke. Database Management Systems, 2nd ed., New York, USA: McGraw-Hill Education, 2000.
J. D. Ullman. Principles of Database Systems, 2nd ed., Computer Science Press, 1982.
A. P. Stolboushkin, M. A. Taitslin. Finite queries do not have efffective syntax. In Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, ACM, San Jose, USA, pp. 277–285, 1995. DOI: https://doi.org/10.1145/212433.212477.
A. van Gelder, R. W. Topor. Safety and translation of relational calculus queries. ACM Transactions on Database Systems, vol 16, no. 2, pp. 235–278, 1991. DOI: https://doi.org/10.1145/114325.103712.
TPC. TPC-H, [Online], Available: http://www.tpc.org/tpch/.
W. F. Fan. Making Big Data Small, UK: British Royal Society, 2019. DOI: https://doi.org/10.1098/rspa.2019.0034.
Y. Cao, W. F. Fan. Data driven approximation with bounded resources. Proceedings of the VLDB Endowment, vol. 10, no. 9, pp. 973–984, 2017. DOI: https://doi.org/10.114778/3099622.3099628.
Y. Cao, W. F. Fan, T. F. Yuan. Block as a value for SQL over NoSQL. Proceedings of the VLDB Endowment, vol. 12, no. 10, pp. 1153–1166, 2019. DOI: https://doi.org/10.14778/3339490.3339498.
W. F. Fan, F. Geerts, L. Libkin. On scale independence for querying big data. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, ACM, Snowbird, USA, 2044. DOI: https://doi.org/10.1145/2594538.2594551.
D. Abadi, P. A. Boncz, S. Harizopoulos, S. Idreos, S. Madden. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases, vol. 5, no. 3, pp. 197–280, 2013. DOI: https://doi.org/10.1561/1900000024.
Microsoft SQL server columnstore indexes: Overview, [Online], Available: https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-over-view?view=sql-server-ver15.
TPC. TPC-DS, [Online], Available: http://www.tpc.org/tpcds/.
M. R. Garey, D. S. Johnson. Computers and Intractability: a Guide to the Theory of NP-Completeness, San Francisco, USA: W. H. Freeman, 1979.
M. L. Fredman, R. E. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM, vol 34, no. 3, pp. 596–615, 9987. DOI: https://doi.org/10.1145/28869.28874.
Bureau of Transportation Statistics. The carrier on-time performance database, [Online], Available: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120.
Bureau of Transportation Statistics. The air carrier statistics database, [Online], Available: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=110.
Department for Transport. Anonymised mot tests and results, [Online], Available: http://data.gov.uk/dataset/anonymised_mot_test, January 11, 2019.
Department for Transport. Roadside survey of vehicle observations, [Online], Available: https://data.gov.uk/dataset/52e1e2ab-5687-489b-a4d8-b207cd5d6767/roadside-survey-of-vehicle-observations.
Y. Huhtala, J. Kärkkainen, P. Porkka, H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, vol. 42, no. 2, pp. 100–111, 1999. DOI: https://doi.org/10.1093/comjnl/42.2.100.
M. Armbrust, A. Fox, D. A Patterson, N. Lanham, B. Trushkowsky, J. Trutna, H. Oh. Scads: Scale-independent storage for social computing applications. In Proceedings of the 4th Biennial Conference on Innovative Data Systems Research, Asilomar, USA, 2009.
M. Armbrust, S. Tu, A. Fox, M. J. Franklin, D. A. Patterson, N. Lanham, B. Trushkowsky, J. Trutna. PIQL: A performance insightful query language. In Proceedings of 2010 ACM SIGMOD International Conference on Management of Data, ACM, Indiana, USA., pp. 1207–1210, 2010. DOI: https://doi.org/10.1145/1807167.1807320.
Y. Cao, W. F. Fan, F. Geerts, P. Lu. Bounded query rewriting using views. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, ACM, San Francisco, USA, pp. 107–119, 2016. DOI: https://doi.org/10.1145/2902251.2902294.
Y. Cao, W. F. Fan, T. Y. Wo, W. Y. Yu. Bounded conjunctive queries. Proceedings of the VLDB Endowment, vol. 7, no. 12, pp. 1231–1242, 2014. DOI: https://doi.org/10.14778/2732977.2732996.
Y. Cao, W. F. Fan, Y. H. Wang, T. F. Yuan, Y. C. Li, L. Y. Chen. BEAS: Bounded evaluation of SQL queries. In Proceedings of ACM International Conference on Management of Data, ACM, Chicago, USA, pp. 1667–1670, 2017. DOI: https://doi.org/10.1145/3035918.3058748.
S. Acharya, P. B. Gibbons, V. Poosala. Congressional samples for approximate answering of group-by queries. In Proceedings of ACM SIGMOD International Conference on Management of Data, ACM, Dallas, Txxas, USA, pp. 487–498, 2000. DOI: https://doi.org/10.1145/342009.335450.
Y. E. Ioannidis, V. Poosala. Histogram-based approximation of set-valued query-answers. In Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, pp. 174–185, 1999.
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, T. Suel. Optimal histograms with quality guarantees. In Proceedings of the 24rd International Conference on Very Large Data Bases, New York City, USA, pp.275-286,2009.
K. Chakrabarti, M. N. Garofalakis, R. Rastogi, K. Shim. Approximate query processing using wavelets. The VLDB Journal, vol. 10, no. 2–3, pp. 199–223, 2001.
G. Cormode, M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the 31st International Conference on Very Large Data Bases, ACM, Trondheim, Norway, 2005.
B. Babcock, S. Chaudhuri, G. Das. Dynamic sample selection for approximate query processing. In Proceedings of ACM SIGMOD International Conference on Management of Data, ACM, San Diego, USA, pp. 539–550, 2003. DOI: https://doi.org/10.1145/872757.872822.
S. Kanduhs, A. Shanbhag, A. Vitorovic, M. Omina, R. Grandl, S. Chaudhuri, B. Ding. Quickr: Lazily approximating complex AdHoc queries in BigData clusters. In Proceedings of International Conference on Management of Data, ACM, San Francisco, USA, pp. 631–646, 2016. DOI: https://doi.org/10.1145/2882903.2882940.
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, I. Stoica. BlinkDB: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, SCM, Prague, Czech Republic, pp. 29–42, 2013. DOI: https://doi.org/10.1145/2465351.2465355.
C. Li. Computing complete answers to queries in the presence of limited access patterns. The VLDB Journal, vol. 12, no. 3, pp. 211–227, 2003. DOI: https://doi.org/10.1007/s00778-002-0085-6.
M. Benedikt, J. Leblay, B. ten Cate, E. Tsamoura. Generating Plans from Proofs: Synthesis Lectures on Data Maragement, vol.8, no.1, pp. 1–205, 2016. DOI: https://doi.org/10.2200/S00703ED1V01Y201602DTM043.
A. Nash, B. Ludäscher. Processing first-order queries under limited access patterns. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, CCM, Pasis, France, pp. 307–318, 2004. DOI: https://doi.org/10.1145/1055558.1055601.
M. S. Kester, M. Athanassoulis, S. Idreos. Access path selection in main-memory optimized data systems: Should I scan or should I probe? In Proceedings of ACM International Conference on Management of Data, ACM, Chicago, USA, pp. 715–730, 2017. DOI: https://doi.org/10.1145/3035918.3064049.
T. Neumann. Query simplification: Graceful degradation for join-order optimization. In Proceedings of ACM SIGMOD International Conference on Management of Data, ACM, Rodee Island, TOA, pp.403–414, 2009. DOI: https://doi.org/10.1145/1559845.1559889.
M. Eich, P. Fender, G. Moerkotte. Faster plan generation through consideration off functional dependencies add keys. Proceedings of the VLDB Endowment, vol. 9, no. 10, pp. 756–767, 2016. DOI: https://doi.org/10.14778/2977797.2977802.
B. L. Ding, S. Das, R. Marcus, W. T. Wu, S. Chaudhuri, V. R. Narasayya. AI meets AI: Leveraging query executions to improve index recommendations. In Proceedings of International Conference on Management of Data, ACM, Amsterdam, The Netherlands, pp. 1241–1258, 2019. DOI: https://doi.org/10.1145/3299869.3324957.
T. Kraska, A. Beutel, E. H. Chi, J. Dean, N. Polyzotis. The case for learned index structures. In Proceedings of International Conference on Management of Data, ACM, Houston, USA, pp. 489–504, 2018. DOI: https://doi.org/10.1145/3183713.3196909.
A. Galakatos, M Markovitch, C Binnig, R Fonseca, T. Kraska. Fiting-tree: A data-aware index structure. In Proceedings of 2019 International Conference on Management of Data, ACM, Amsterdam, The Natherlands, pp. 1189–1206, 2019. DOI: https://doi.org/10.1145/3299869.3319860.
R. C. Marcus, P. Negi, H. Z. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, N. Tatbul. Neo: A learned query optimizer. Proceedings of the VLDB Endowment, vol. 12, no. 11, pp. 1705–1718, 2019. DOI: https://doi.org/10.14778/3342263.3342644.
J. Sun, G. Li. An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment, vol. 13, no. 3, pp. 307–319, 2019. DOI: https://doi.org/10.14778/3368289.3368296.
I. Trummer, J. Wang, D. Maram, S. Moseley, S. Jo, J. Antonakakis. Skinnerdb: Regret-bounded query evaluation via reinforcement learning. https://arxiv.org/abs/1901.05152v1, 2019. DOI: https://doi.org/10.1145/3299869.3300088.
Acknowlegements
The authors are supported in part by Royal Society Wolfson Research Merit Award WRM/R1/180014, ERC 652976, EPSRC EP/M025268/1, Shenzhen Institute of Computing Sciences, and Beijing Advanced Innovation Center for Big Data and Brain Computing.
Author information
Authors and Affiliations
Corresponding author
Additional information
Recommended by Editor-in-Chief Guo-Ping Liu
Yang Cao received the B. Sc. degree from Beihang University, China. He received the Ph.D. degree from University of Edinburgh, UK. He is a faculty member in the School of Informatics, University of Edinburgh, UK. He is the recipient of SIGMOD Research Highlight ward 2018, SIGMOD Best Paper ward 2017, and Microsoft Research Asia Fellowship. His research has been invited to publish in TODS special issues on “Best of SIGMOD 2017” and “Best of PODS 2016”, and in the Computer Journal special issue on “Best of BICOD 2015”.
His research interests include query processing, graph data management and distributed databases.
Wen-Fei Fan received the B. Sc. degree and M.Sc. degree from Peking University China. He received the Ph. D. degree from University of Pennsylvania, USA. He is the Chair of Web Data Management at the University of Edinburgh, UK, the Chief Scientist of Shenzhen Institute of Computing Science, and the Chief Scientist of Beijing Advanced Innovation Center for Big Data and Brain Computing, China. He is a Fellow of the Royal Society (FRS), a Fellow of the Royal Society of Edinburgh (FRSE), a Member of the Academy of Europe (MAE), an ACM Fellow (FACM), and a Foreign Member of Chinese Academy of Sciences. He is a recipient of Royal Society Wolfson Research Merit Award in 2018, ERC Advanced Fellowship in 2015, the Roger Needham Award, UK in 2008, Yangtze River Scholar, China in 2007, the Outstanding Overseas Young Scholar Award, China in 2003, the Career Award, USA in 2001, and several Test-of-Time and Best Paper Awards USA (Alberto O. Mendelzon Test-of-Time Award of ACM PODS 2015 and 2010, Best Paper Awards for SIGMOD 2017, VLDB 2010, ICDE 2007 and Computer Networks 2002).
His research interests include database theory and systems, in particular big data, data quality, data sharing, distributed query processing, query languages, recommender systems and social media marketing.
Teng-Fei Yuan received the B.Eng. degree from Shandong University China. He is Ph.D. degree cadidate in LFCS, School of Informatics, University of Edinburgh UK.
His research interest is development of BEAS, a system for bounded evaluation of SQL queries.
Rights and permissions
Open access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cao, Y., Fan, WF. & Yuan, TF. Bounded Evaluation: Querying Big Data with Bounded Resources. Int. J. Autom. Comput. 17, 502–526 (2020). https://doi.org/10.1007/s11633-020-1236-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-020-1236-1