Abstract
On-Line Analytical Processing (OLAP) aims at gaining useful information quickly from large amounts of data residing in a data warehouse. To improve the quickness of response to queries, pre-aggregation is a useful strategy. However, it is usually impossible to pre-aggregate along all combinations of the dimensions. The multi-dimensional aspects of the data lead to combinatorial explosion in the number and potential storage size of the aggregates. We must selectively pre-aggregate. Cost/benefit analysis involves estimating the storage requirements of the aggregates in question. We present an original algorithm for estimating the number of rows in an aggregate based on the Pareto distribution model. We test the Pareto Model Algorithm empirically against four published algorithms, and conclude the Pareto Model Algorithm is consistently the best of these algorithms for estimating view size.
Similar content being viewed by others
References
Cardenas A. Analysis and performance of inverted database structures.Communications of the ACM 1975;18(5):253-263.
Charikar M, Chaudhuri S, Motwani R, Narasayya V. Towards estimation error guarantees for distinct values. In: Proceedings of the Nineteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'00), Dallas, 2000;268-279.
Chaudhuri S, Motwani R, Narasayya V. Random sampling for histogram construction: How much is enough? In: Proceedings of the 1998 ACMSIGMOD International Conference on Management of Data (SIGMOD'98), Seattle, 1998;436-447.
DeGroot M. Optimal Statistical Decisions. McGraw-Hill Book Company, 1970.
Faloutsos C, Matias Y, Silberschatz A. Modeling skewed distributions using multifractals and the '80-20' law. In: Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB'96), Mumbai, 1996;307-317.
Flajolet P, Martin G. Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences 1985;31:182-209.
Gibbons P. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB'01), Roma, 2001;541-550.
Harinarayan V, Rajaraman A, Ullman J. Implementing data cubes ef-ficiently. In: Proceedings of the 1996 ACMSIGMOD International Conference on Management of Data (SIGMOD'96), Montreal, 1996;205-216.
Kimball R. The Data Warehouse Toolkit. John Wiley, 1996.
Nadeau T, Runapongsa K, Teorey T. Binomial multifractal curve fitting for viewsize estimation inOLAP. In: SCI 2001 Proceedings, Vol. II, Information Systems, Orlando, 2001;194-199.
Nadeau T, Teorey T. A Pareto Model for OLAP view size estimation.In: Proceedings of CASCON 2001, Toronto, 2001;1-13.
Runapongsa K, Nadeau T, Teorey T. Storage estimation for multidimensional aggregates in OLAP. In: Proceedings of CASCON 1999, Toronto, 1999;40-54.
Shukla A, Deshpande P, Naughton J, Ramasamy K. Storage estimation for multidimensional aggregates in the presence of hierarchies.In: Proceedings of the 22nd Very Large Data Bases (VLDB'96), Mumbai, 1996;522-531.
Zipf G. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Cambridge: Addison Wesley, 1949.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Nadeau, T.P., Teorey, T.J. A Pareto Model for OLAP View Size Estimation. Information Systems Frontiers 5, 137–147 (2003). https://doi.org/10.1023/A:1022693305401
Issue Date:
DOI: https://doi.org/10.1023/A:1022693305401