Advertisement

Data Quality Control Based on Metric Data Models

Chapter

Summary

We consider statistical edits defined on a metric data space spanned by the nonkey attributes (variables) of a given database. Integrity constraints are defined on this data space based on definitions, behavioral equations or a balance equation system. As an example think of a set of business or economic indicators. The variables are linked by the four basic arithmetic operations only. Assuming a multivariate Gaussian distribution and an error in the variables model estimation of the unknown (latent) variables can be carried out by a generalized least-squares (GLS) procedure. The drawback of this approach is that the equations form a non-linear equation system due to multiplication and division of variables, and that generally one assumes independence between all variables due to a lack of information in real applications. As there exists no finite parameter density family which is closed under all four arithmetic operations we use MCMC-simulation techniques, cf. Smith and Gelfand (1992) and Chib (2004) to derive the “exact” distributions in the non-normal case and under cross-correlation. The research can be viewed as an extension of Köppen and Lenz (2005) in the sense of studying the robustness of the GLS approach with respect to non-normality and correlation.

Keywords

Multivariate Gaussian Distribution Data Quality Control Statistical Quality Control Validation Rule Right Hand Side Variable 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. John Aitchison. The Statistical Analysis of Compositional Data. Kluwer, 1986.Google Scholar
  2. Adelchi Azzalini and Antonella Capitanio. Statistical Applications of the Multivariate Skew Normal Distribution, Journal of the Royal Statistical Society. Series B, 61, 579-602, 1999.CrossRefGoogle Scholar
  3. Adelchi Azzalini and Alessandra Dalla Valle. The Multivariate Skew-Normal Distribution, Biometrika, 83, 715-726, 1996.CrossRefMathSciNetGoogle Scholar
  4. Carlo Batini and Monica Scannapieco. Data Quality Concepts, Methodologies and Techniques, Springer, 2006.MATHGoogle Scholar
  5. Siddhartha Chib. Handbook of Computational Statistics - Concepts and Methods, chapter Markov Chain Monte Carlo Technology, pages 71–102. Springer, 2004.Google Scholar
  6. I. P. Fellegi and D. Holt. A Systematic Approach to Automatic Edit and Imputation, JASA, 71, 17-35, 1976.Google Scholar
  7. W. Keith Hastings. Monte Carlo sampling methods using markov chains and their applications. Biometrika, 57:97–109, 1970.CrossRefGoogle Scholar
  8. Veit Köppen and Hans-J. Lenz. Simulation of non-linear stochastic equation systems. In A.N. Pepelyshev, S.M. Ermakov, V.B. Melas, eds., Proceeding of the Fifth Workshop on Simulation, pages 373–378, St. Petersburg, Russia, July 2005. NII Chemistry Saint Petersburg University Publishers.Google Scholar
  9. Hans-J. Lenz and Roland M. Müller. On the solution of fuzzy equation systems. In G. Della Riccia, H-J. Lenz, and R. Kruse, eds., Computational Intelligence in Data Mining, CISM Courses and Lectures. Springer, New York, 2000.Google Scholar
  10. Hans-J. Lenz and Egmar Rödel. Statistical quality control of data. In Peter Gritzmann, Rainer Hettich, Reiner Horst, and Ekkehard Sachs, editors, 16th Symposium on Operations Research, pages 341–346. Physica Verlag, Heidelberg, 1991.Google Scholar
  11. Gunar E. Liepins and V.R.R. Uppuluri. Data Quality Control Theory and Pragmatics, Marcel Dekker, 1991.Google Scholar
  12. Beat Schmid, (1979). Bilanzmodelle. Simulationsverfahren zur Verarbeitung unscharfer Teilinformationen, ORL-Bericht No. 40, ORL Institut, ETH Zürich, 1979.Google Scholar
  13. Adian F. M. Smith and Alan E. Gelfand. Bayesian statistics without tears: A samplingresampling perspective. The American Statistician, 46(2):84–88, may 1992.CrossRefMathSciNetGoogle Scholar
  14. G.Barrie Wetherill and Marion E. Gerson. Computer Aids to Data Quality Control, The Statisticians, 36, 598-592, 1987.Google Scholar

Copyright information

© Physica-Verlag Heidelberg 2010

Authors and Affiliations

  1. 1.Institute of Production, Information Systems and Operations ResearchFreie Universität BerlinBerlinGermany
  2. 2.Institute of Statistics and EconometricsFreie Universität BerlinBerlinGermany

Personalised recommendations