Random Databases with Approximate Record Matching

Seleznjev, Oleg; Thalheim, Bernhard

doi:10.1007/s11009-008-9092-4

Random Databases with Approximate Record Matching

Published: 31 July 2008

Volume 12, pages 63–89, (2010)
Cite this article

Methodology and Computing in Applied Probability Aims and scope Submit manuscript

Oleg Seleznjev^1,2 &
Bernhard Thalheim³

95 Accesses
6 Citations
Explore all metrics

Abstract

In many database applications in telecommunication, environmental and health sciences, bioinformatics, physics, and econometrics, real-world data are uncertain and subjected to errors. These data are processed, transmitted and stored in large databases. We consider stochastic modelling for databases with uncertain data and for some basic database operations (for example, join, selection) with exact and approximate matching. Approximate join is used for merging or data deduplication in large databases. Distribution and mean of the join sizes are studied for random databases. A random database is treated as a table with independent random records with a common distribution (or a set of random tables). These results can be used for integration of information from different databases, multiple join optimization, and various probabilistic algorithms for structured random data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Barbour AD, Holst L, Janson S (1992) Poisson approximation. Clarendon, Oxford
MATH Google Scholar
Bronski JC (2003) Small ball constant and tight eigenvalue asymptotic for fractional Brownian motion. J Theor Probab 16:87–100
Article MATH MathSciNet Google Scholar
Bruno N, Chaudhuri S (2002) Exploiting statistics on query expressions for optimization. In: Proc ACM SIGMOD02, pp 263–274
Christodoulakis S (1983) Estimating record selectivities. Inf Syst 8:105–115
Article Google Scholar
Copas JB, Hilton FJ (1990) Record linkage: statistical models for matching computer records. J R Stat Soc Ser A 153:287–320
Article Google Scholar
Dalvi N, Suciu D (2005) Answering queries from statistics and probabilistic views. In: Proc VLDB05, Conf. Oslo, Norway, pp 805–816
Dembo A, Kontoyiannis I (2002) Source coding, large deviations, and approximate pattern matching. IEEE Trans Infrom Th 48:1590–1615
Article MATH MathSciNet Google Scholar
Demetrovics J, Katona GOH, Miklós D, Seleznjev O, Thalheim B (1995) The average length of keys and functional dependencies in (random) databases. In: Gottlob G, Vardi M (eds) Proc ICDT95 LN in Comp Sc, vol 893. Springer, Berlin, pp 266–279
Google Scholar
Demetrovics J, Katona GOH, Miklós D, Seleznjev O, Thalheim B (1998a) Asymptotic properties of keys and functional dependencies in random databases. Theor Comp Sci 190:151–166
Article MATH Google Scholar
Demetrovics J, Katona GOH, Miklós D, Seleznjev O, Thalheim B (1998b) Functional dependencies in random databases. Stud Sci Math Hung 34:127–140
MATH Google Scholar
Dereich S (2003) Small ball probabilities around random centers of Gaussian measures and application to quantization. J Theor Prob 16:427–449
Article MATH MathSciNet Google Scholar
Graf S, Luschgy H (2000) Foundation of quantization for probability distributions. In: LN in Math, vol 1730. Springer, Berlin
Google Scholar
Hachem NI, Bao C, Taylor S (1996) Approximate query answering in numerical databases. In: Proc SSDBM, vol 8, pp 63–73
Haussler D, Opper M (1995) General bounds on the mutual information between a parameter and n conditionally independent observations. In: Proc 7th Ann. ACW workshop comp. learn. Th., ACM Press: New York, pp 402–411
Google Scholar
Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2:9–37
Article Google Scholar
Kapur JN, Kesavan HK (1992) Entropy optimization principles with applications. Academic, New York
Google Scholar
Kifer M, Bernstein A, Lewis Ph M (2005) Database systems: an application-oriented approach. Addison Wesley
Kuelbs J, Li WV (1993) Metric entropy and the small ball problem for Gaussian measures. J Func Anal 116:133–157
Article MATH MathSciNet Google Scholar
Li WV, Shao Q-M (2001) Gaussian processes: inequalities, small ball probabilities, and applications. In: Rao CR, Shanbhag D (eds) Stochastic processes: theory and methods. Handbook of Statistics, vol 19. Elsevier, New York pp 533–598
Google Scholar
Lifshits M (1999) Asymptotic behavior of small ball probabilities. In: Prob. Theory and Math Stat, 23th European Meet Stat, VSP/TEV, pp 453–468
Luczak T, Szpankovwski W (1997) A suboptimal lossy data compression based in approximate pattern matching. IEEE Trans Inf Theory 4:1439–1451
Article Google Scholar
Posner EC, Rodemich EE, Rumsey H Jr (1969) Epsilon entropy of Gaussian processes. Ann Math Stat 40:1272–1296
Article MATH MathSciNet Google Scholar
Prohorov Yu V, Rozanov Yu A (1970) Probability theory. Springer: New York
Google Scholar
Rényi A (1961) On measures of entropy and information. In: 4th Berkley Symp Math Statist Prob, vol I. Berkeley, Univ Calif Press, pp 547–561
Rényi A (1970) Probability theory. North-Holland, London
Google Scholar
Seleznjev O, Thalheim B (2003) Average case analysis in database problems. Meth Comput Appl Probab 5:395–418
Article MATH MathSciNet Google Scholar
Shykula M, Seleznjev O (2006) Stochastic structure of asymptotic quantization errors. Stat Probab Lett 76:453–464
Article MATH MathSciNet Google Scholar
Steinbrunn M, Moerkotte G, Kemper A (1997) Heuristic and randomized optimization for the join ordering problem. VLDB 6:191–208
Article Google Scholar
Szpankowski W (1991) On the height of digital trees and related problems. Algorithmica 6:256–277
Article MATH MathSciNet Google Scholar
Szpankowski W (2001) Average case analysis of algorithms on sequences. Wiley, New York
MATH Google Scholar
Thalheim B (2000) Entity-relationship modeling. Foundations of database technology. Springer, Berlin
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Mathematical Statistics, Umeå University, 901 87, Umeå, Sweden
Oleg Seleznjev
Faculty of Mathematics and Mechanics, Moscow State University, 119 992, Moscow, Russia
Oleg Seleznjev
Institute of Computer Science and Applied Mathematics, Christian-Albrechts University, 24118, Kiel, Germany
Bernhard Thalheim

Authors

Oleg Seleznjev
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Thalheim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oleg Seleznjev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Seleznjev, O., Thalheim, B. Random Databases with Approximate Record Matching. Methodol Comput Appl Probab 12, 63–89 (2010). https://doi.org/10.1007/s11009-008-9092-4

Download citation

Received: 22 May 2006
Revised: 20 May 2008
Accepted: 25 June 2008
Published: 31 July 2008
Issue Date: March 2010
DOI: https://doi.org/10.1007/s11009-008-9092-4

Keywords

AMS 2000 Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Random Databases with Approximate Record Matching

Abstract

Access this article

Similar content being viewed by others

CoDS: A Representative Sampling Method for Relational Databases

On Sampling Representatives of Relational Schemas with a Functional Dependency

Can We Probabilistically Generate Uniformly Distributed Relation Instances Efficiently?

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

AMS 2000 Subject Classification

Navigation

Random Databases with Approximate Record Matching

Abstract

Access this article

Similar content being viewed by others

CoDS: A Representative Sampling Method for Relational Databases

On Sampling Representatives of Relational Schemas with a Functional Dependency

Can We Probabilistically Generate Uniformly Distributed Relation Instances Efficiently?

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

AMS 2000 Subject Classification

Search

Navigation