Data Quality Control Based on Metric Data Models
We consider statistical edits defined on a metric data space spanned by the nonkey attributes (variables) of a given database. Integrity constraints are defined on this data space based on definitions, behavioral equations or a balance equation system. As an example think of a set of business or economic indicators. The variables are linked by the four basic arithmetic operations only. Assuming a multivariate Gaussian distribution and an error in the variables model estimation of the unknown (latent) variables can be carried out by a generalized least-squares (GLS) procedure. The drawback of this approach is that the equations form a non-linear equation system due to multiplication and division of variables, and that generally one assumes independence between all variables due to a lack of information in real applications. As there exists no finite parameter density family which is closed under all four arithmetic operations we use MCMC-simulation techniques, cf. Smith and Gelfand (1992) and Chib (2004) to derive the “exact” distributions in the non-normal case and under cross-correlation. The research can be viewed as an extension of Köppen and Lenz (2005) in the sense of studying the robustness of the GLS approach with respect to non-normality and correlation.
Unable to display preview. Download preview PDF.
- John Aitchison. The Statistical Analysis of Compositional Data. Kluwer, 1986.Google Scholar
- Siddhartha Chib. Handbook of Computational Statistics - Concepts and Methods, chapter Markov Chain Monte Carlo Technology, pages 71–102. Springer, 2004.Google Scholar
- I. P. Fellegi and D. Holt. A Systematic Approach to Automatic Edit and Imputation, JASA, 71, 17-35, 1976.Google Scholar
- Veit Köppen and Hans-J. Lenz. Simulation of non-linear stochastic equation systems. In A.N. Pepelyshev, S.M. Ermakov, V.B. Melas, eds., Proceeding of the Fifth Workshop on Simulation, pages 373–378, St. Petersburg, Russia, July 2005. NII Chemistry Saint Petersburg University Publishers.Google Scholar
- Hans-J. Lenz and Roland M. Müller. On the solution of fuzzy equation systems. In G. Della Riccia, H-J. Lenz, and R. Kruse, eds., Computational Intelligence in Data Mining, CISM Courses and Lectures. Springer, New York, 2000.Google Scholar
- Hans-J. Lenz and Egmar Rödel. Statistical quality control of data. In Peter Gritzmann, Rainer Hettich, Reiner Horst, and Ekkehard Sachs, editors, 16th Symposium on Operations Research, pages 341–346. Physica Verlag, Heidelberg, 1991.Google Scholar
- Gunar E. Liepins and V.R.R. Uppuluri. Data Quality Control Theory and Pragmatics, Marcel Dekker, 1991.Google Scholar
- Beat Schmid, (1979). Bilanzmodelle. Simulationsverfahren zur Verarbeitung unscharfer Teilinformationen, ORL-Bericht No. 40, ORL Institut, ETH Zürich, 1979.Google Scholar
- G.Barrie Wetherill and Marion E. Gerson. Computer Aids to Data Quality Control, The Statisticians, 36, 598-592, 1987.Google Scholar