Abstract
Efficient query processing in traditional database management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statistics management still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over structured overlays with the power associated with such statistical information, with emphasis on meeting the scalability challenge. To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average-Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source implementation of these tools for distributed statistical synopses, and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Building histograms without looking at data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1999)
Acharaya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The AQUA approximate query answering system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1999)
Acharya, S., Gibbons, P., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1999)
Babaoǧlu, Ö., Meling, H., Montresor, A.: Anthill: A framework for the development of agent-based peer-to-peer systems. In: Proceedings of the IEEE International Conference on Distributed Computing and Systems (ICDCS) (2002)
Bawa, M., Garcia-Molina, H., Gionis, A., Motwani, R.: Estimating aggregates on a peer-to-peer network. Technical report, Computer Science Department, Stanford University (2003)
Bawa, M., Gionis, A., Garcia-Molina, H., Motwani, R.: The price of validity in dynamic networks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (2004)
Beyer, K., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R.: On synopses for distinct-value estimation under multiset operations. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (2007)
Bharambe, A., Agrawal, M., Seshan, S.: Mercury: Supporting scalable multi-attribute range queries. In: Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM) (2004)
Blohsfeld, B., Korus, D., Seeger, B.: A comparison of selectivity estimators for range queries on metric attributes. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1999)
Carzaniga A., Rosenblum D.S., Wolf A.L.: Design and evaluation of a wide-area event notification service. ACM Trans. Comput. Syst. (TOCS) 19(3), 332–383 (2001)
Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) (1998)
Chaudhuri, S., Das, G., Narasayya, V.R.: A robust, optimization-based approach for approximate answering of aggregate queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (2001)
Chaudhuri S., Das G., Narasayya V.R.: Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst. (TODS) 32(2), 9 (2007)
Chaudhuri, S., Motwani, R., Narasayya, R.: Random sampling for histogram construction: how much is enough? In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1998)
Chaudhuri S., Narasayya V.R.: Automating statistics management for query optimizers. IEEE Trans. Knowl. Data Eng. (TKDE) 13(1), 7–20 (2001)
Considine, J., Li, F., Kollios, G., Byers, J.: Approximate aggregation techniques for sensor databases. In: Proceedings of the International Conference on Data Engineering (ICDE) (2004)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. In: Proceedings of the Latin American Symposium on Theoretical Informatics (LATIN) (2004)
Cormode, G., Muthukrishnan, S.: Space efficient mining of multigraph streams. In: Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) (2005)
Dabek, F., Li, J., Sit, E., Robertson, J., Kaashoek, M.F., Morris, R.: Designing a DHT for low latency and high throughput. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2004)
Damgaard C., Weiner J.: Describing inequality in plant size or fecundity. Ecology 81, 1139–1142 (2000)
Dobra, A.: Histograms revisited: When are histograms the best approximation method for aggregates over joins? In: Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) (2005)
Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Sketch-based multi-query processing over data streams. In: Proceedings of the International Conference on Extending Database Technology (EDBT) (2004)
Druschel, P., Rowstron, A.: Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In: Proceedings of the IFIP/ACM IFIP/ACM International Conference on Distributed Systems Platforms (Middleware) (2001)
Durand, M., Flajolet, P.: Loglog counting of large cardinalities. In: Proceedings of the Annual European Symposium on Algorithms (ESA) (2003)
Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. In: Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM) (1998)
Flajolet P., Martin G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)
FreeDHS: Homepage (2006). http://netcins.ceid.upatras.gr/DHS.php
FreePastry: Homepage (2002). http://freepastry.org/FreePastry/
FreeSHADE (2006). http://netcins.ceid.upatras.gr/SHADE.php
Ganguly, S., Garofalakis, M., Rastogi, R.: Processing set expressions over continuous update streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (2003)
Garofalakis M. (ed.): Special issue on in-network query processing. IEEE Data Eng. Bull. 28(1) (2005)
Gibbons P.B., Matias Y., Poosala V.: Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst. (TODS) 27(3), 261–298 (2002)
Gray J., Liu D.T., Nieto-Santisteban M.A., Szalay A.S., DeWitt D.J., Heber G.: Scientific data management in the coming decade. SIGMOD Record 34(4), 34–41 (2005)
Gribble, S., Halevy, A., Ives, Z., Rodrig, M., Suciu, D.: What can peer-to-peer do for databases, and vice versa? In: Proceedings of the International Workshop on the Web and Databases (WebDB) (2001)
Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In: Proceedings of the Annual ACM Symposium on Theory of Computing (STOC) (2001)
Gummadi, K., Gummadi, R., Gribble, S., Ratnasamy, S., Shenker, S., Stoica, I.: The impact of DHT routing geometry on resilience and proximity. In: Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM) (2003)
Gummadi, K.P., Dunn, R.J., Saroiu, S., Gribble, S.D., Levy, H.M., Zahorjan, J.: Measurement, modeling, and analysis of a peer-to-peer file-sharing workload. In: Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), pp. 314–329 (2003)
Haas, P., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of the International Conference on Very Large Data Bases (VLDB) (1995)
Hadjieleftheriou, M., Byers, J.W., Kollios, G.: Robust sketching and aggregation of distributed data streams. Technical Report 2005-011, Computer Science Department, Boston University (2005)
Hellerstein, J.M., Condie, T., Garofalakis, M., Loo, B.T., Maniatis, P., Roscoe, T., Taft, N.: Public health for the internet (PHI). In: Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR) (2007)
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1997)
Huebsch, R., Chun, B.N., Hellerstein, J.M., Loo, B.T., Maniatis, P., Roscoe, T., Shenker, S., Stoica, I., Yumerefendi, A.R.: The architecture of PIER: an internet-scale query processor. In: Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR) (2005)
Huebsch, R., Jeffery, S.: FREddies: DHT-based adaptive query processing via FedeRated Eddies. Technical Report UCB/CSD-4-1339, University of Californa at Berkeley (2004)
Ioannidis, Y.: The history of histograms (abridged). In: Proceedings of the International Conference on Very Large Data Bases (VLDB) (2003)
Ioannidis Y., Christodoulakis S.: Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Trans. Database Syst. (TODS) 18(4), 709–748 (1993)
Ioannidis, Y., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1995)
Jain, R., Chiu, D., Hawe, W.: A quantitative measure of fairness and discrimination for resource allocation in shared computer systems. Technical Report TR-301, DEC (1984)
Jelasity, M., Montresor, A.: Epidemic-style proactive aggregation in large overlay networks. In: Proceedings of the IEEE International Conference on Distributed Computing and Systems (ICDCS) (2004)
Jelasity M., Montresor A., Babaoǧlu Ö.: Gossip-based aggregation in large dynamic networks. ACM Trans. Comput. Syst. (TOCS) 23(3), 219–252 (2005)
Jermaine, C.M., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the DBO engine. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (2007)
Kempe, D., Dobra, A., Gehrke, J.: Computing aggregate information using gossip. In: Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS) (2003)
Kossmann A.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)
Koudas, N. (ed.): Special issue on data quality. IEEE Data Eng.Bull. 29(2) (2006)
Krishnan, P.: Online prediction algorithms for databases and operating systems. Ph.D. thesis, Brown University (1995)
Li, J., Loo, B.T., Hellerstein, J.M., Kaashoek, M.F., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Proceedings of the International Workshop on Peer- to-Peer Systems (IPTPS) (2003)
Madden, S., Franklin, M.J., Hellerstein, J.M., Hong, W.: TAG: A Tiny AGgregation service for ad-hoc sensor networks. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2002)
Manku, G.: Routing networks for Distributed Hash Tables. In: Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC) (2003)
Markl V., Haas P.J., Kutsch M., Megiddo N., Srivastava U., Tran T.M.: Consistent selectivity estimation via maximum entropy. VLDB J. 16(1), 55–76 (2007)
Matia, Y., Matias, Y.: Calibration and profile based synopses error estimation and synopses reconciliation. In: Proceedings of the International Conference on Data Engineering (ICDE) (2007)
Maymouknov, P., Mazières, D.: Kademlia: A peer-to-peer information system based on the XOR metric. In: Proceedings of the International Workshop on Peer-to-Peer Systems (IPTPS) (2002)
Michel, S., Bender, M., Ntarmos, N., Triantafillou, P., Weikum, G., Zimmer, C.: Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. In: Proceedings of the ACM Conference on Information and Knowledge Management (CIKM) (2006)
Michel, S., Triantafillou, P., Weikum, G.: KLEE: A framework for distributed top-k query algorithms. In: Proceedings of the International Conference on Very Large Databases (VLDB) (2005)
Montresor, A., Jelasity, M., Babaoǧlu, Ö.: Robust aggregation protocols for large-scale overlay networks. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN) (2004)
Montresor, A., Meling, H., Babaoǧlu, Ö.: Messor: Load-balancing through a swarm of autonomous agents. In: Proceedings of the Workshop on Agent and Peer-to-Peer Systems (2002)
Morris R.: Counting large numbers of events in small registers. Commun. ACM 21(10), 840–842 (1978)
Munro J.I., Paterson M.S.: Selection and sorting with limited storage. Theor. Comput. Sci. 12(3), 315–323 (1980)
Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1988)
Ntarmos, N., Triantafillou, P., Weikum, G.: Counting at large: Efficient cardinality estimation in internet-scale data networks. In: Proceedings of the International Conference on Data Engineering (ICDE) (2006)
Palmer, C.R., Siganos, G., Faloutsos, M., Faloutsos, C., Gibbons, P.B.: The connectivity and fault-tolerance of the internet topology. In: Proceedings of the Workshop on Network-Related Data Management (NRDM) (2001)
Pitoura, T., Triantafillou, P.: Load distribution fairness in p2p data management systems. In: Proceedings of the International Conference on Data Engineering (ICDE) (2007)
Poosala V., Ganti V., Ioannidis Y.: Approximate query answering using histograms. IEEE Data Eng. Bull. 22(4), 5–14 (1999)
Poosala, V., Ioannidis, Y.: Estimation of query-result distribution and its application in parallel-join load balancing. In: Proceedings of the International Conference on Very Large Data Bases (VLDB) (1996)
Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. In: Proceedings of the International Conference on Very Large Data Bases (VLDB) (1997)
Poosala, V., Ioannidis, Y., Haas, P., Shekita, E.: Improved histograms for selectivity estimation of range predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1996)
Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable content-addressable network. In: Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM) (2001)
Reynolds, P., Vahdat, A.: Efficient peer-to-peer keyword searching. In: Proceedings of the IFIP/ACM IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pp. 21–40 (2003)
Rhea, S., Geels, D., Roscoe, T., Kubiatowicz, J.: Handling churn in a DHT. In: Proceedings of the USENIX Annual Technical Conference (USENIX) (2004)
Scott D.W.: Average shifted histograms: effective nonparametric density estimators in several dimensions. Ann. Stat. 13(3), 1024–1040 (1985)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1979)
Shapiro, G.P., Connell, C.: Accurate estimation of the number of tuples satisfying a condition. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (1984)
Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: New aggregation techniques for sensor networks. In: Proceedings of the ACM Conference on Embedded Networked Sensor Systems (SenSys) (2004)
Steinmetz, R., Wehrle, K. (eds): Peer-to-Peer Systems and Applications. Springer, Berlin (2005)
Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable Peer-To-Peer lookup service for internet applications. In: Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM) (2001)
Suciu, D. (ed.): Special issue on web-scale data, systems, and semantics. IEEE Data Eng. Bull. 29(4) (2006)
Tao, Y., Kollios, G., Considine, J., Li, F., Papadias, D.: Spatio- temporal aggregation using sketches. In: Proceedings of the International Conference on Data Engineering (ICDE) (2004)
Thaper, N., Guha, S., Indyk, P., Koudas, N.: Dynamic multidimensional histograms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (2002)
Triantafillou, P., Economides, A.: Subscription summarization: A new paradigm for efficient publish/subscribe systems. In: Proceedings of the IEEE International Conference on Distributed Computing and Systems (ICDCS) (2004)
van Renesse R., Birman K.P., Vogels W.: Astrolabe: a robust and scalable technology for distributed system monitoring, man- agement, and data mining. ACM Trans. Comput. Syst. (TOCS) 21(2), 164–206 (2003)
Yalagandula, P., Dahlin, M.: A scalable distributed information management system. In: Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM) (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Ntarmos, N., Triantafillou, P. & Weikum, G. Statistical structures for Internet-scale data management. The VLDB Journal 18, 1279–1312 (2009). https://doi.org/10.1007/s00778-009-0140-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-009-0140-7