Abstract
In many database applications in telecommunication, environmental and health sciences, bioinformatics, physics, and econometrics, real-world data are uncertain and subjected to errors. These data are processed, transmitted and stored in large databases. We consider stochastic modelling for databases with uncertain data and for some basic database operations (for example, join, selection) with exact and approximate matching. Approximate join is used for merging or data deduplication in large databases. Distribution and mean of the join sizes are studied for random databases. A random database is treated as a table with independent random records with a common distribution (or a set of random tables). These results can be used for integration of information from different databases, multiple join optimization, and various probabilistic algorithms for structured random data.
Similar content being viewed by others
References
Barbour AD, Holst L, Janson S (1992) Poisson approximation. Clarendon, Oxford
Bronski JC (2003) Small ball constant and tight eigenvalue asymptotic for fractional Brownian motion. J Theor Probab 16:87–100
Bruno N, Chaudhuri S (2002) Exploiting statistics on query expressions for optimization. In: Proc ACM SIGMOD02, pp 263–274
Christodoulakis S (1983) Estimating record selectivities. Inf Syst 8:105–115
Copas JB, Hilton FJ (1990) Record linkage: statistical models for matching computer records. J R Stat Soc Ser A 153:287–320
Dalvi N, Suciu D (2005) Answering queries from statistics and probabilistic views. In: Proc VLDB05, Conf. Oslo, Norway, pp 805–816
Dembo A, Kontoyiannis I (2002) Source coding, large deviations, and approximate pattern matching. IEEE Trans Infrom Th 48:1590–1615
Demetrovics J, Katona GOH, Miklós D, Seleznjev O, Thalheim B (1995) The average length of keys and functional dependencies in (random) databases. In: Gottlob G, Vardi M (eds) Proc ICDT95 LN in Comp Sc, vol 893. Springer, Berlin, pp 266–279
Demetrovics J, Katona GOH, Miklós D, Seleznjev O, Thalheim B (1998a) Asymptotic properties of keys and functional dependencies in random databases. Theor Comp Sci 190:151–166
Demetrovics J, Katona GOH, Miklós D, Seleznjev O, Thalheim B (1998b) Functional dependencies in random databases. Stud Sci Math Hung 34:127–140
Dereich S (2003) Small ball probabilities around random centers of Gaussian measures and application to quantization. J Theor Prob 16:427–449
Graf S, Luschgy H (2000) Foundation of quantization for probability distributions. In: LN in Math, vol 1730. Springer, Berlin
Hachem NI, Bao C, Taylor S (1996) Approximate query answering in numerical databases. In: Proc SSDBM, vol 8, pp 63–73
Haussler D, Opper M (1995) General bounds on the mutual information between a parameter and n conditionally independent observations. In: Proc 7th Ann. ACW workshop comp. learn. Th., ACM Press: New York, pp 402–411
Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2:9–37
Kapur JN, Kesavan HK (1992) Entropy optimization principles with applications. Academic, New York
Kifer M, Bernstein A, Lewis Ph M (2005) Database systems: an application-oriented approach. Addison Wesley
Kuelbs J, Li WV (1993) Metric entropy and the small ball problem for Gaussian measures. J Func Anal 116:133–157
Li WV, Shao Q-M (2001) Gaussian processes: inequalities, small ball probabilities, and applications. In: Rao CR, Shanbhag D (eds) Stochastic processes: theory and methods. Handbook of Statistics, vol 19. Elsevier, New York pp 533–598
Lifshits M (1999) Asymptotic behavior of small ball probabilities. In: Prob. Theory and Math Stat, 23th European Meet Stat, VSP/TEV, pp 453–468
Luczak T, Szpankovwski W (1997) A suboptimal lossy data compression based in approximate pattern matching. IEEE Trans Inf Theory 4:1439–1451
Posner EC, Rodemich EE, Rumsey H Jr (1969) Epsilon entropy of Gaussian processes. Ann Math Stat 40:1272–1296
Prohorov Yu V, Rozanov Yu A (1970) Probability theory. Springer: New York
Rényi A (1961) On measures of entropy and information. In: 4th Berkley Symp Math Statist Prob, vol I. Berkeley, Univ Calif Press, pp 547–561
Rényi A (1970) Probability theory. North-Holland, London
Seleznjev O, Thalheim B (2003) Average case analysis in database problems. Meth Comput Appl Probab 5:395–418
Shykula M, Seleznjev O (2006) Stochastic structure of asymptotic quantization errors. Stat Probab Lett 76:453–464
Steinbrunn M, Moerkotte G, Kemper A (1997) Heuristic and randomized optimization for the join ordering problem. VLDB 6:191–208
Szpankowski W (1991) On the height of digital trees and related problems. Algorithmica 6:256–277
Szpankowski W (2001) Average case analysis of algorithms on sequences. Wiley, New York
Thalheim B (2000) Entity-relationship modeling. Foundations of database technology. Springer, Berlin
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Seleznjev, O., Thalheim, B. Random Databases with Approximate Record Matching. Methodol Comput Appl Probab 12, 63–89 (2010). https://doi.org/10.1007/s11009-008-9092-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11009-008-9092-4