Random sampling from databases: a survey

Olken, Frank; Rotem, Doron

doi:10.1007/BF00140664

Random sampling from databases: a survey

Published: March 1995

Volume 5, pages 25–42, (1995)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Frank Olken¹ &
Doron Rotem^1,2

1790 Accesses
71 Citations
3 Altmetric
Explore all metrics

Abstract

This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B ⁺ trees, hash files, spatial data structures (including R-trees and quadtrees). Algorithms for sampling from simple relational queries, e.g. single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g. the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision-theoretic approaches to sampling for query optimization are reviewed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Antoshenkov, G. (1992) Random sampling from pseudo-ranked B ⁺ trees. In Proceedings of the 19th International Conference on Very Large Databases (VLDB), pp. 375–82.
Antoshenkov, G. (1993) Dynamic query optimization in rdb/vms. Proceedings of the IEEE Data Engineering Conference, pp. 538–47.
Astrahan, M., Schkolnick, M. and Whang, K.-Y. (1987) Approximating the number of unique values of an attribute without sorting. Information Systems, 12, 11–15.
Google Scholar
Bayer, R. and McCreight, E. M. (1972) Maintenance of large ordered indices. Acta Informatica, 1, 1–21.
Google Scholar
Bennett, B. and Kruskal, V. (1975) LRU stack processing. IBM Journal of Research and Development, 19, 353–7.
Google Scholar
Blakeley, J. A., Larson, P.-A. and Tompa, F. W. (1986) Efficiently updating materialized views. In ACM SIGMOD International Conference on the Management of Data (ed. C. Zaniolo), pp. 61–71.
Bunge, J. and Fitzpatrick, M. (1993) Estimating the number of species: a review. Journal of the American Statistical Association, 88, 364–73.
Google Scholar
Cardenas, A. (1975) Analysis and performance of inverted database structures. Communications of the ACM, 18, 253–63.
Google Scholar
Carey, M. J., DeWitt, D. J., Richardson, J. E. and Shekita, E. J. (1986) Object and file management in the exodus extensible database system. In Proceedings of the 12th International Conference on Very Large Databases (VLDB). Morgan Kaufmann.
Ceri, S. and Widom, J. (1991) Deriving production rules for incremental view maintenance. In Proceedings of the 17th International Conference on Very Large Databases (VLDB), (ed. G. Lohman, A. Sernadas and R. Camps), pp. 577–89.
Cheon, M. J. and Philipoom, P. (1991) Using regression to compromise statistical databases: a modification of the attribute correlation modeling approach. Journal of Database Administration, 2, 15–21.
Google Scholar
Christodoulakis, S. (1984a) Estimating block selectivities. Information Systems, 9, 9–79.
Google Scholar
Christodoulakis, S. (1984b) Implications of certain assumptions database performance evaluation. ACM Transactions on Database Systems, 9, 163–86.
Google Scholar
Chu, P.-C. (1989) Database access path selection: a two step approach. Information Systems, 14, 385–92.
Google Scholar
Cochran, W. G. (1977) Sampling Techniques. Wiley, New York.
Google Scholar
Comer, D. (1979) The ubiquitous B-tree. Computing Surveys, 11, 121–37.
Google Scholar
Date, C. J. (1990) An Introduction to Database Systems, 5th edn. Addison-Wesley, Reading, MA.
Google Scholar
de Vries, P. G. (1986) Sampling Theory for Forest Inventory. Springer-Verlag, Berlin.
Google Scholar
Denning, D. and Schlorer, J. (1980) A fast procedure for finding a tracker in a statistical database. ACM Transactions on Database Systems, 5, 88–102
Google Scholar
Denning, D., Denning, P. and Schwartz, M. (1979) The tracker: a threat to statistical database security. ACM Transactions on Database Systems, 4, 79–96.
Google Scholar
Denning, D. E. (1980) Secure statistical databases with random sample queries. ACM Transactions on Database Systems, 5, 291–315.
Google Scholar
Devroye, L. (1986) Lecture Notes on Bucket Algorithms. Birkhäuser, Boston.
Google Scholar
Devroye, L. (1991) Coupled samples in simulation. Operations Research, 38, 115–26.
Google Scholar
DeWitt, D. J., Naughton, J. F. and Schneider, D. A. (1991a) A comparison of non-equijoin algorithms. In Proceedings of the 18th International Conference on Very Large Databases (VLDB), pp. 443–52.
DeWitt, D. J., Naughton, J. F. and Schneider, D. A. (1991b) Parallel sorting on a shared-nothing architecture using probabilistic splitting. In Proceedings of the International Conference on Parallel and Distributed Information Systems, pp. 280–91.
DeWitt, D. J., Naughton, J. F., Schneider, D. A. and Seshadri, S. (1992) Practical skew handling in parallel joins. In Proceedings of the 19th International Conference on Very Large Databases (VLDB), pp. 27–40.
Duncan, G. T. and Mukherjee, S. (1991) Microdata disclosure limitation in statistical databases: query size and random sample query control. In Proceedings. 1991 IEEE Computer Society Symposium on Research in Security and Privacy, pp. 278–87. IEEE Computer Society Press, New York.
Google Scholar
Fagin, R., Nievergelt, J., Pippenger, N. and Strong, H. (1979) Extendible hashing—a fast access method for dynamic files. ACM Transactions on Database Systems, 4, 315–44.
Google Scholar
Fan, C., Muller, M. and Rezucha, I. (1962) Development of sampling plans by using sequential (item by item) selection techniques and digital computers. Journal of the American Statistical Association, 57, 387–402.
Google Scholar
Flajolet, P. (1985) Approximate counting: a detailed analysis. BIT, 25, 113–34.
Google Scholar
Flajolet, P. and Martin, G. (1983) Probabilistic counting. In 24th Annual Symposium on Foundations of Computer Science, pp. 76–82.
Flajolet, P. and Martin, G. (1985) Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences, 31, 182–209.
Google Scholar
Ghosh, S. P. (1988) SIAM: Statistics information access method. Information Systems, 13, 359–68.
Google Scholar
Goodman, L. (1949) On the estimation of the number of classes in a population. Annals of Mathematical Statistics, 20, 572–9.
Google Scholar
Guthrie, D. et al. (1989) Statistical models and analysis on auditing, panel on nonstandard mixture of distributions. Statistical Science, 4, 2–33.
Google Scholar
Guttman, A. (1984) R-trees: A dynamic index structure for spatial searching. In ACM SIGMOD International Conference on the Management of Data, pp. 47–57.
Haas, P. J. and Swami, A. N. (1992a) Sequential sampling procedures for query size estimation. Technical Report RJ 8558, IBM Alamaden.
Haas, P. J. and Swami, A. N. (1992b) Sequential sampling procedures for query size estimation. In ACM SIGMOD International Conference on the Management of Data, pp. 341–50.
Hanson, E. N. (1987) A performance analysis of view materializations strategies. In ACM SIGMOD International Conference on the Management of Data, pp. 440–53.
Hou, W.-C. (1989) Relational Aggregate Query Processing Techniques for Real Time Databases. PhD thesis, Case Western Reserve University.
Hou, W.-C. and Ozsoyoglu, G. (1991) Statistical estimators for aggregate relational algebra queries. ACM Transactions on Database Systems, 16, 600–54.
Google Scholar
Hou, W.-C., Ozsoyoglu, G. and Taneja, B. K. (1988) Statistical estimators for relational algebra expressions. In Proceedings of the 7th ACM Conference on Principles of Database Systems, pp. 288–93.
Hou, W.-C., Ozsoyoglu, G. and Taneja, B. K. (1989) Processing aggregate relational queries with hard time constraints. In ACM SIGMOD International Conference on the Management of Data, pp. 68–77.
Hou, W.-C., Ozsoyoglu, G. and Dogdu, E. (1991) Errorconstrained count query evaluation in relational databases. In ACM SIGMOD International Conference on the Management of Data, pp. 278–87.
Jones, T. (1962) A note on sampling from a tape file. Communications of the ACM, 5, 343.
Google Scholar
Knuth, D. E. (1973) The Art of Computer Programming: Vol, 3, Sorting and Searching. Addison-Wesley, Reading, MA.
Google Scholar
Lang, S. and Manolopoulos, Y. (1990) Efficient expressions for completely and partly unsuccessful batched search of treestructured files. IEEE Transactions on Software Engineering, 16, 1433–5.
Google Scholar
Lang, S.-D., Driscoll, J. and Jou, J. (1989) A unified analysis of batched searching of sequential and tree-structured files. ACM Transactions on Database Systems, 14, 604–18.
Google Scholar
Larson, E. (1992) The Naked Consumer: How Our Private Lives Become Public Commodities. H. Holt.
Lipton, R. J. and Naughton, J. F. (1989) Estimating the size of generalized transitive closures. In Proceedings of the 15th International Conference on Very Large Databases (VLDB), pp. 165–71.
Lipton, R. and Naughton, J. (1990) Query size estimation through adaptive sampling. In Proceedings of the ACM Conference on Principles of Database Systems, pp. 40–6.
Lipton, R., Naughton, J. and Schneider, D. (1990) Practical selectivity estimation through adaptive sampling. In ACM SIGMOD International Conference on the Management of Data, pp. 1–11.
Litwin, W. (1980) Linear hashing: a new tool for file and table addressing. In Proceedings of the 6th International Conference on Very Large Databases (VLDB), pp. 212–23.
Muralikrishan, M. and DeWitt, D. J. (1988) Equi-depth histograms for estimating selectivity factors for multidimensional queries. In ACM SIGMOD International Conference on the Management of Data, pp. 28–36
Naughton, J. F. and Seshadri, S. (1990) On estimating the size of projections. In Proceedings of ICDT 90, 3rd International Conference on Database Theory (ed. S. Abiteboul and P. Kanellakis), pp. 499–513. Springer-Verlag, Berlin.
Google Scholar
Olken, F. (1993) Random Sampling from Databases. PhD thesis, University of California, Berkeley. (Also issued as LBL tech report LBL-32883).
Google Scholar
Olken, F. and Rotem, D. (1986) Simple random sampling from relational databases. In Proceedings of the 12th International Conference on Very Large Databases (VLDE), pp. 160–9.
Olken, F. and Rotem, D. (1989) Random sampling from B ⁺ trees. In Proceedings of the 15th International Conference on Very Large Databases (VLDB), pp. 269–77. Morgan Kaufman.
Olken, F. and Rotem. D. (1992) Maintenance of materialized views of sampling queries. In Proceedings of the IEEE Data Engineering Conference, pp. 632–41.
Olken, F. and Rotem, D. (1995) Sampling from spatial databases. Statistics and Computing pp 43–57.
Olken, F., Rotem, D. and Xu, P. (1990) Random sampling from hash files. In ACM SIGMOD International Conference on the Management of Data, pp. 375–86.
Palley, M. A. and Simonoff, J. S. (1987) The use of regression methodology for the compromise of confidential information in statistical databases. ACM Transactions on Database Systems, 12, 593–608.
Google Scholar
Palvia, P. (1985) Expressions for batched searching of sequential and hierarchical files. ACM Transactions on Database Systems, 10, 97–106.
Google Scholar
Piatetsky-Shapiro, G. and Connell, C. (1984) Accurate estimation of the number of tuples satisfying a condition. In ACM SIGMOD International Conference on the Management of Data, pp. 256–75.
Rosenbaum, P. R. (1993) Sampling the leaves of a tree with equal probabilities. Journal of the American Statistical Association, 88, 1455–7.
Google Scholar
Rowe, N. (1983) Rule-based Statistical Calculations on a Database Abstract, PhD thesis, Stanford.
Rowe, N. (1985) Anti-sampling for estimation: an overview. IEEE Transactions on Software Engineering, 11, 1081–91.
Google Scholar
Rowe, N. (1988) Absolute bounds on set intersection and union sizes. IEEE Transactions on Software Engineering, 14, 1033–7.
Google Scholar
Samet, H. (1989) Applications of Spatial Data Structures. Addison-Wesley, Reading, MA.
Google Scholar
Segev, A. and Fang, W. (1990) Currency-based updates to distributed materialized views. In Proceedings of the IEEE Data Engineering Conference, pp. 512–20.
Segev, A. and Fang, W. (1991) Optimal update policies for distributed materialized views. Management Science, 37, 851–70.
Google Scholar
Segev, A. and Park, J. (1989) Updating distributed materialized views. IEEE Transactions on Knowledge and Data Engineering, 1, 173–84.
Google Scholar
Seppi, K. D. (1990) A Bayesian Approach to Selected Database Issues. PhD thesis, University of Texas at Austin.
Google Scholar
Seppi, K. D., Barnes, J. W. and Morris, C. J. (1993) A Bayesian approach to database query optimization. ORSA Journal of Computing, 5, 410–19.
Google Scholar
Seshadri, S. (1992) Probabilistic Methods in Query Processing. PhD thesis, University of Wisconsin.
Seshadri, S. and Naughton, J. (1991) Sampling issues in parallel database systems. In Proceedings of the Conference on Extending Database Technology, pp. 328–43. Springer-Verlag, Berlin.
Google Scholar
Shmueli, O. and Itai, A. (1984) Maintenance of views. In ACM SIGMOD International Conference on the Management of Data, pp. 244–55.
Shmueli, O. and Itai, A. (1987) Complexity of views: tree and cyclic schemas. SIAM Journal on Computing, 16, 17–37.
Google Scholar
Shneiderman, B. and Goodman, V. (1976) Batch searching of sequential and hierarchical files. ACM Transactions on Database Systems, 1, 268–75.
Google Scholar
Srivastava, J. and Lum, V. (1988) A tree based access method (tbsam) for fast processing of aggregate queries. In Proceedings of the 4th International Conference on Data Engineering, pp. 504–510. IEEE Computer Society, New York.
Google Scholar
Stonebraker, M., Stettner, H., Lynn, N., Kalash, J. and Guttman, A. (1983) Document processing in a relational database system. ACM Transactions on Office Information Systems, 1, 143–58.
Google Scholar
Tompa, F. and Blakely, J. (1988) Maintaining materialized views without accessing base data. Information Systems, 13, 393–406.
Google Scholar
Ullman, J. D. (1988) Principles of Database and Knowledge-Base Systems, Vol. 1, Computer Science Press, New York.
Google Scholar
Vitter, J. S. (1985) Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11, 37–57.
Google Scholar
Wald, A. (1947) Sequential Analysis. Wiley, New York.
Google Scholar
Whang, K.-Y., Vander-Zanden, B. and Taylor, H. (1990) A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15, 208–29.
Google Scholar
Willard, D. E. (1991) Optimal sample cost residues for differential database batch query problems. Journal of the ACM, 38, 104–19.
Google Scholar
Wolfson, O., Zhang, W., Butani, H., Kawaguchi, A. and Mok, K. (1993) A methodology for evaluating parallel graph algorithms and its application to single source reachability. In Proceedings of the International Conference on Parallel and Distributed Information Systems, pp. 243–50.
Wong, C. and Easton, M. (1980) An efficient method for weighted sampling without replacement. SIAM Journal on Computing, 9, 111–13.
Google Scholar
Yao, S. B. (1977) Approximating the number of accesses in database organizations. Communications of the ACM, 20, 260–1.
Google Scholar

Download references

Author information

Authors and Affiliations

Information and Computing Sciences Division, Lawrence Berkeley Laboratory, 94720, Berkeley, CA, USA
Frank Olken & Doron Rotem
Management Information Systems Department, School of Business, San Jose State University, San Jose, CA, USA
Doron Rotem

Authors

Frank Olken
View author publications
You can also search for this author in PubMed Google Scholar
Doron Rotem
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Olken, F., Rotem, D. Random sampling from databases: a survey. Stat Comput 5, 25–42 (1995). https://doi.org/10.1007/BF00140664

Download citation

Issue Date: March 1995
DOI: https://doi.org/10.1007/BF00140664

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Random sampling from databases: a survey

Abstract

Access this article

Similar content being viewed by others

CoDS: A Representative Sampling Method for Relational Databases

Estimating Sufficient Sample Sizes for Approximate Decision Support Queries

Efficient Sampling Methods for Discrete Distributions

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Random sampling from databases: a survey

Abstract

Access this article

Similar content being viewed by others

CoDS: A Representative Sampling Method for Relational Databases

Estimating Sufficient Sample Sizes for Approximate Decision Support Queries

Efficient Sampling Methods for Discrete Distributions

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation