Skip to main content
Log in

Random Databases with Approximate Record Matching

  • Published:
Methodology and Computing in Applied Probability Aims and scope Submit manuscript

Abstract

In many database applications in telecommunication, environmental and health sciences, bioinformatics, physics, and econometrics, real-world data are uncertain and subjected to errors. These data are processed, transmitted and stored in large databases. We consider stochastic modelling for databases with uncertain data and for some basic database operations (for example, join, selection) with exact and approximate matching. Approximate join is used for merging or data deduplication in large databases. Distribution and mean of the join sizes are studied for random databases. A random database is treated as a table with independent random records with a common distribution (or a set of random tables). These results can be used for integration of information from different databases, multiple join optimization, and various probabilistic algorithms for structured random data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Barbour AD, Holst L, Janson S (1992) Poisson approximation. Clarendon, Oxford

    MATH  Google Scholar 

  • Bronski JC (2003) Small ball constant and tight eigenvalue asymptotic for fractional Brownian motion. J Theor Probab 16:87–100

    Article  MATH  MathSciNet  Google Scholar 

  • Bruno N, Chaudhuri S (2002) Exploiting statistics on query expressions for optimization. In: Proc ACM SIGMOD02, pp 263–274

  • Christodoulakis S (1983) Estimating record selectivities. Inf Syst 8:105–115

    Article  Google Scholar 

  • Copas JB, Hilton FJ (1990) Record linkage: statistical models for matching computer records. J R Stat Soc Ser A 153:287–320

    Article  Google Scholar 

  • Dalvi N, Suciu D (2005) Answering queries from statistics and probabilistic views. In: Proc VLDB05, Conf. Oslo, Norway, pp 805–816

  • Dembo A, Kontoyiannis I (2002) Source coding, large deviations, and approximate pattern matching. IEEE Trans Infrom Th 48:1590–1615

    Article  MATH  MathSciNet  Google Scholar 

  • Demetrovics J, Katona GOH, Miklós D, Seleznjev O, Thalheim B (1995) The average length of keys and functional dependencies in (random) databases. In: Gottlob G, Vardi M (eds) Proc ICDT95 LN in Comp Sc, vol 893. Springer, Berlin, pp 266–279

    Google Scholar 

  • Demetrovics J, Katona GOH, Miklós D, Seleznjev O, Thalheim B (1998a) Asymptotic properties of keys and functional dependencies in random databases. Theor Comp Sci 190:151–166

    Article  MATH  Google Scholar 

  • Demetrovics J, Katona GOH, Miklós D, Seleznjev O, Thalheim B (1998b) Functional dependencies in random databases. Stud Sci Math Hung 34:127–140

    MATH  Google Scholar 

  • Dereich S (2003) Small ball probabilities around random centers of Gaussian measures and application to quantization. J Theor Prob 16:427–449

    Article  MATH  MathSciNet  Google Scholar 

  • Graf S, Luschgy H (2000) Foundation of quantization for probability distributions. In: LN in Math, vol 1730. Springer, Berlin

    Google Scholar 

  • Hachem NI, Bao C, Taylor S (1996) Approximate query answering in numerical databases. In: Proc SSDBM, vol 8, pp 63–73

  • Haussler D, Opper M (1995) General bounds on the mutual information between a parameter and n conditionally independent observations. In: Proc 7th Ann. ACW workshop comp. learn. Th., ACM Press: New York, pp 402–411

    Google Scholar 

  • Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2:9–37

    Article  Google Scholar 

  • Kapur JN, Kesavan HK (1992) Entropy optimization principles with applications. Academic, New York

    Google Scholar 

  • Kifer M, Bernstein A, Lewis Ph M (2005) Database systems: an application-oriented approach. Addison Wesley

  • Kuelbs J, Li WV (1993) Metric entropy and the small ball problem for Gaussian measures. J Func Anal 116:133–157

    Article  MATH  MathSciNet  Google Scholar 

  • Li WV, Shao Q-M (2001) Gaussian processes: inequalities, small ball probabilities, and applications. In: Rao CR, Shanbhag D (eds) Stochastic processes: theory and methods. Handbook of Statistics, vol 19. Elsevier, New York pp 533–598

    Google Scholar 

  • Lifshits M (1999) Asymptotic behavior of small ball probabilities. In: Prob. Theory and Math Stat, 23th European Meet Stat, VSP/TEV, pp 453–468

  • Luczak T, Szpankovwski W (1997) A suboptimal lossy data compression based in approximate pattern matching. IEEE Trans Inf Theory 4:1439–1451

    Article  Google Scholar 

  • Posner EC, Rodemich EE, Rumsey H Jr (1969) Epsilon entropy of Gaussian processes. Ann Math Stat 40:1272–1296

    Article  MATH  MathSciNet  Google Scholar 

  • Prohorov Yu V, Rozanov Yu A (1970) Probability theory. Springer: New York

    Google Scholar 

  • Rényi A (1961) On measures of entropy and information. In: 4th Berkley Symp Math Statist Prob, vol I. Berkeley, Univ Calif Press, pp 547–561

  • Rényi A (1970) Probability theory. North-Holland, London

    Google Scholar 

  • Seleznjev O, Thalheim B (2003) Average case analysis in database problems. Meth Comput Appl Probab 5:395–418

    Article  MATH  MathSciNet  Google Scholar 

  • Shykula M, Seleznjev O (2006) Stochastic structure of asymptotic quantization errors. Stat Probab Lett 76:453–464

    Article  MATH  MathSciNet  Google Scholar 

  • Steinbrunn M, Moerkotte G, Kemper A (1997) Heuristic and randomized optimization for the join ordering problem. VLDB 6:191–208

    Article  Google Scholar 

  • Szpankowski W (1991) On the height of digital trees and related problems. Algorithmica 6:256–277

    Article  MATH  MathSciNet  Google Scholar 

  • Szpankowski W (2001) Average case analysis of algorithms on sequences. Wiley, New York

    MATH  Google Scholar 

  • Thalheim B (2000) Entity-relationship modeling. Foundations of database technology. Springer, Berlin

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oleg Seleznjev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Seleznjev, O., Thalheim, B. Random Databases with Approximate Record Matching. Methodol Comput Appl Probab 12, 63–89 (2010). https://doi.org/10.1007/s11009-008-9092-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11009-008-9092-4

Keywords

AMS 2000 Subject Classification

Navigation