Skip to main content
Log in

Random sampling from databases: a survey

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including R-trees and quadtrees). Algorithms for sampling from simple relational queries, e.g. single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g. the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision-theoretic approaches to sampling for query optimization are reviewed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Antoshenkov, G. (1992) Random sampling from pseudo-ranked B + trees. In Proceedings of the 19th International Conference on Very Large Databases (VLDB), pp. 375–82.

  • Antoshenkov, G. (1993) Dynamic query optimization in rdb/vms. Proceedings of the IEEE Data Engineering Conference, pp. 538–47.

  • Astrahan, M., Schkolnick, M. and Whang, K.-Y. (1987) Approximating the number of unique values of an attribute without sorting. Information Systems, 12, 11–15.

    Google Scholar 

  • Bayer, R. and McCreight, E. M. (1972) Maintenance of large ordered indices. Acta Informatica, 1, 1–21.

    Google Scholar 

  • Bennett, B. and Kruskal, V. (1975) LRU stack processing. IBM Journal of Research and Development, 19, 353–7.

    Google Scholar 

  • Blakeley, J. A., Larson, P.-A. and Tompa, F. W. (1986) Efficiently updating materialized views. In ACM SIGMOD International Conference on the Management of Data (ed. C. Zaniolo), pp. 61–71.

  • Bunge, J. and Fitzpatrick, M. (1993) Estimating the number of species: a review. Journal of the American Statistical Association, 88, 364–73.

    Google Scholar 

  • Cardenas, A. (1975) Analysis and performance of inverted database structures. Communications of the ACM, 18, 253–63.

    Google Scholar 

  • Carey, M. J., DeWitt, D. J., Richardson, J. E. and Shekita, E. J. (1986) Object and file management in the exodus extensible database system. In Proceedings of the 12th International Conference on Very Large Databases (VLDB). Morgan Kaufmann.

  • Ceri, S. and Widom, J. (1991) Deriving production rules for incremental view maintenance. In Proceedings of the 17th International Conference on Very Large Databases (VLDB), (ed. G. Lohman, A. Sernadas and R. Camps), pp. 577–89.

  • Cheon, M. J. and Philipoom, P. (1991) Using regression to compromise statistical databases: a modification of the attribute correlation modeling approach. Journal of Database Administration, 2, 15–21.

    Google Scholar 

  • Christodoulakis, S. (1984a) Estimating block selectivities. Information Systems, 9, 9–79.

    Google Scholar 

  • Christodoulakis, S. (1984b) Implications of certain assumptions database performance evaluation. ACM Transactions on Database Systems, 9, 163–86.

    Google Scholar 

  • Chu, P.-C. (1989) Database access path selection: a two step approach. Information Systems, 14, 385–92.

    Google Scholar 

  • Cochran, W. G. (1977) Sampling Techniques. Wiley, New York.

    Google Scholar 

  • Comer, D. (1979) The ubiquitous B-tree. Computing Surveys, 11, 121–37.

    Google Scholar 

  • Date, C. J. (1990) An Introduction to Database Systems, 5th edn. Addison-Wesley, Reading, MA.

    Google Scholar 

  • de Vries, P. G. (1986) Sampling Theory for Forest Inventory. Springer-Verlag, Berlin.

    Google Scholar 

  • Denning, D. and Schlorer, J. (1980) A fast procedure for finding a tracker in a statistical database. ACM Transactions on Database Systems, 5, 88–102

    Google Scholar 

  • Denning, D., Denning, P. and Schwartz, M. (1979) The tracker: a threat to statistical database security. ACM Transactions on Database Systems, 4, 79–96.

    Google Scholar 

  • Denning, D. E. (1980) Secure statistical databases with random sample queries. ACM Transactions on Database Systems, 5, 291–315.

    Google Scholar 

  • Devroye, L. (1986) Lecture Notes on Bucket Algorithms. Birkhäuser, Boston.

    Google Scholar 

  • Devroye, L. (1991) Coupled samples in simulation. Operations Research, 38, 115–26.

    Google Scholar 

  • DeWitt, D. J., Naughton, J. F. and Schneider, D. A. (1991a) A comparison of non-equijoin algorithms. In Proceedings of the 18th International Conference on Very Large Databases (VLDB), pp. 443–52.

  • DeWitt, D. J., Naughton, J. F. and Schneider, D. A. (1991b) Parallel sorting on a shared-nothing architecture using probabilistic splitting. In Proceedings of the International Conference on Parallel and Distributed Information Systems, pp. 280–91.

  • DeWitt, D. J., Naughton, J. F., Schneider, D. A. and Seshadri, S. (1992) Practical skew handling in parallel joins. In Proceedings of the 19th International Conference on Very Large Databases (VLDB), pp. 27–40.

  • Duncan, G. T. and Mukherjee, S. (1991) Microdata disclosure limitation in statistical databases: query size and random sample query control. In Proceedings. 1991 IEEE Computer Society Symposium on Research in Security and Privacy, pp. 278–87. IEEE Computer Society Press, New York.

    Google Scholar 

  • Fagin, R., Nievergelt, J., Pippenger, N. and Strong, H. (1979) Extendible hashing—a fast access method for dynamic files. ACM Transactions on Database Systems, 4, 315–44.

    Google Scholar 

  • Fan, C., Muller, M. and Rezucha, I. (1962) Development of sampling plans by using sequential (item by item) selection techniques and digital computers. Journal of the American Statistical Association, 57, 387–402.

    Google Scholar 

  • Flajolet, P. (1985) Approximate counting: a detailed analysis. BIT, 25, 113–34.

    Google Scholar 

  • Flajolet, P. and Martin, G. (1983) Probabilistic counting. In 24th Annual Symposium on Foundations of Computer Science, pp. 76–82.

  • Flajolet, P. and Martin, G. (1985) Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences, 31, 182–209.

    Google Scholar 

  • Ghosh, S. P. (1988) SIAM: Statistics information access method. Information Systems, 13, 359–68.

    Google Scholar 

  • Goodman, L. (1949) On the estimation of the number of classes in a population. Annals of Mathematical Statistics, 20, 572–9.

    Google Scholar 

  • Guthrie, D. et al. (1989) Statistical models and analysis on auditing, panel on nonstandard mixture of distributions. Statistical Science, 4, 2–33.

    Google Scholar 

  • Guttman, A. (1984) R-trees: A dynamic index structure for spatial searching. In ACM SIGMOD International Conference on the Management of Data, pp. 47–57.

  • Haas, P. J. and Swami, A. N. (1992a) Sequential sampling procedures for query size estimation. Technical Report RJ 8558, IBM Alamaden.

  • Haas, P. J. and Swami, A. N. (1992b) Sequential sampling procedures for query size estimation. In ACM SIGMOD International Conference on the Management of Data, pp. 341–50.

  • Hanson, E. N. (1987) A performance analysis of view materializations strategies. In ACM SIGMOD International Conference on the Management of Data, pp. 440–53.

  • Hou, W.-C. (1989) Relational Aggregate Query Processing Techniques for Real Time Databases. PhD thesis, Case Western Reserve University.

  • Hou, W.-C. and Ozsoyoglu, G. (1991) Statistical estimators for aggregate relational algebra queries. ACM Transactions on Database Systems, 16, 600–54.

    Google Scholar 

  • Hou, W.-C., Ozsoyoglu, G. and Taneja, B. K. (1988) Statistical estimators for relational algebra expressions. In Proceedings of the 7th ACM Conference on Principles of Database Systems, pp. 288–93.

  • Hou, W.-C., Ozsoyoglu, G. and Taneja, B. K. (1989) Processing aggregate relational queries with hard time constraints. In ACM SIGMOD International Conference on the Management of Data, pp. 68–77.

  • Hou, W.-C., Ozsoyoglu, G. and Dogdu, E. (1991) Errorconstrained count query evaluation in relational databases. In ACM SIGMOD International Conference on the Management of Data, pp. 278–87.

  • Jones, T. (1962) A note on sampling from a tape file. Communications of the ACM, 5, 343.

    Google Scholar 

  • Knuth, D. E. (1973) The Art of Computer Programming: Vol, 3, Sorting and Searching. Addison-Wesley, Reading, MA.

    Google Scholar 

  • Lang, S. and Manolopoulos, Y. (1990) Efficient expressions for completely and partly unsuccessful batched search of treestructured files. IEEE Transactions on Software Engineering, 16, 1433–5.

    Google Scholar 

  • Lang, S.-D., Driscoll, J. and Jou, J. (1989) A unified analysis of batched searching of sequential and tree-structured files. ACM Transactions on Database Systems, 14, 604–18.

    Google Scholar 

  • Larson, E. (1992) The Naked Consumer: How Our Private Lives Become Public Commodities. H. Holt.

  • Lipton, R. J. and Naughton, J. F. (1989) Estimating the size of generalized transitive closures. In Proceedings of the 15th International Conference on Very Large Databases (VLDB), pp. 165–71.

  • Lipton, R. and Naughton, J. (1990) Query size estimation through adaptive sampling. In Proceedings of the ACM Conference on Principles of Database Systems, pp. 40–6.

  • Lipton, R., Naughton, J. and Schneider, D. (1990) Practical selectivity estimation through adaptive sampling. In ACM SIGMOD International Conference on the Management of Data, pp. 1–11.

  • Litwin, W. (1980) Linear hashing: a new tool for file and table addressing. In Proceedings of the 6th International Conference on Very Large Databases (VLDB), pp. 212–23.

  • Muralikrishan, M. and DeWitt, D. J. (1988) Equi-depth histograms for estimating selectivity factors for multidimensional queries. In ACM SIGMOD International Conference on the Management of Data, pp. 28–36

  • Naughton, J. F. and Seshadri, S. (1990) On estimating the size of projections. In Proceedings of ICDT 90, 3rd International Conference on Database Theory (ed. S. Abiteboul and P. Kanellakis), pp. 499–513. Springer-Verlag, Berlin.

    Google Scholar 

  • Olken, F. (1993) Random Sampling from Databases. PhD thesis, University of California, Berkeley. (Also issued as LBL tech report LBL-32883).

    Google Scholar 

  • Olken, F. and Rotem, D. (1986) Simple random sampling from relational databases. In Proceedings of the 12th International Conference on Very Large Databases (VLDE), pp. 160–9.

  • Olken, F. and Rotem, D. (1989) Random sampling from B + trees. In Proceedings of the 15th International Conference on Very Large Databases (VLDB), pp. 269–77. Morgan Kaufman.

  • Olken, F. and Rotem. D. (1992) Maintenance of materialized views of sampling queries. In Proceedings of the IEEE Data Engineering Conference, pp. 632–41.

  • Olken, F. and Rotem, D. (1995) Sampling from spatial databases. Statistics and Computing pp 43–57.

  • Olken, F., Rotem, D. and Xu, P. (1990) Random sampling from hash files. In ACM SIGMOD International Conference on the Management of Data, pp. 375–86.

  • Palley, M. A. and Simonoff, J. S. (1987) The use of regression methodology for the compromise of confidential information in statistical databases. ACM Transactions on Database Systems, 12, 593–608.

    Google Scholar 

  • Palvia, P. (1985) Expressions for batched searching of sequential and hierarchical files. ACM Transactions on Database Systems, 10, 97–106.

    Google Scholar 

  • Piatetsky-Shapiro, G. and Connell, C. (1984) Accurate estimation of the number of tuples satisfying a condition. In ACM SIGMOD International Conference on the Management of Data, pp. 256–75.

  • Rosenbaum, P. R. (1993) Sampling the leaves of a tree with equal probabilities. Journal of the American Statistical Association, 88, 1455–7.

    Google Scholar 

  • Rowe, N. (1983) Rule-based Statistical Calculations on a Database Abstract, PhD thesis, Stanford.

  • Rowe, N. (1985) Anti-sampling for estimation: an overview. IEEE Transactions on Software Engineering, 11, 1081–91.

    Google Scholar 

  • Rowe, N. (1988) Absolute bounds on set intersection and union sizes. IEEE Transactions on Software Engineering, 14, 1033–7.

    Google Scholar 

  • Samet, H. (1989) Applications of Spatial Data Structures. Addison-Wesley, Reading, MA.

    Google Scholar 

  • Segev, A. and Fang, W. (1990) Currency-based updates to distributed materialized views. In Proceedings of the IEEE Data Engineering Conference, pp. 512–20.

  • Segev, A. and Fang, W. (1991) Optimal update policies for distributed materialized views. Management Science, 37, 851–70.

    Google Scholar 

  • Segev, A. and Park, J. (1989) Updating distributed materialized views. IEEE Transactions on Knowledge and Data Engineering, 1, 173–84.

    Google Scholar 

  • Seppi, K. D. (1990) A Bayesian Approach to Selected Database Issues. PhD thesis, University of Texas at Austin.

    Google Scholar 

  • Seppi, K. D., Barnes, J. W. and Morris, C. J. (1993) A Bayesian approach to database query optimization. ORSA Journal of Computing, 5, 410–19.

    Google Scholar 

  • Seshadri, S. (1992) Probabilistic Methods in Query Processing. PhD thesis, University of Wisconsin.

  • Seshadri, S. and Naughton, J. (1991) Sampling issues in parallel database systems. In Proceedings of the Conference on Extending Database Technology, pp. 328–43. Springer-Verlag, Berlin.

    Google Scholar 

  • Shmueli, O. and Itai, A. (1984) Maintenance of views. In ACM SIGMOD International Conference on the Management of Data, pp. 244–55.

  • Shmueli, O. and Itai, A. (1987) Complexity of views: tree and cyclic schemas. SIAM Journal on Computing, 16, 17–37.

    Google Scholar 

  • Shneiderman, B. and Goodman, V. (1976) Batch searching of sequential and hierarchical files. ACM Transactions on Database Systems, 1, 268–75.

    Google Scholar 

  • Srivastava, J. and Lum, V. (1988) A tree based access method (tbsam) for fast processing of aggregate queries. In Proceedings of the 4th International Conference on Data Engineering, pp. 504–510. IEEE Computer Society, New York.

    Google Scholar 

  • Stonebraker, M., Stettner, H., Lynn, N., Kalash, J. and Guttman, A. (1983) Document processing in a relational database system. ACM Transactions on Office Information Systems, 1, 143–58.

    Google Scholar 

  • Tompa, F. and Blakely, J. (1988) Maintaining materialized views without accessing base data. Information Systems, 13, 393–406.

    Google Scholar 

  • Ullman, J. D. (1988) Principles of Database and Knowledge-Base Systems, Vol. 1, Computer Science Press, New York.

    Google Scholar 

  • Vitter, J. S. (1985) Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11, 37–57.

    Google Scholar 

  • Wald, A. (1947) Sequential Analysis. Wiley, New York.

    Google Scholar 

  • Whang, K.-Y., Vander-Zanden, B. and Taylor, H. (1990) A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15, 208–29.

    Google Scholar 

  • Willard, D. E. (1991) Optimal sample cost residues for differential database batch query problems. Journal of the ACM, 38, 104–19.

    Google Scholar 

  • Wolfson, O., Zhang, W., Butani, H., Kawaguchi, A. and Mok, K. (1993) A methodology for evaluating parallel graph algorithms and its application to single source reachability. In Proceedings of the International Conference on Parallel and Distributed Information Systems, pp. 243–50.

  • Wong, C. and Easton, M. (1980) An efficient method for weighted sampling without replacement. SIAM Journal on Computing, 9, 111–13.

    Google Scholar 

  • Yao, S. B. (1977) Approximating the number of accesses in database organizations. Communications of the ACM, 20, 260–1.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Olken, F., Rotem, D. Random sampling from databases: a survey. Stat Comput 5, 25–42 (1995). https://doi.org/10.1007/BF00140664

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00140664

Keywords

Navigation