Abstract
The editing-and-imputation problem concerns the question of finding errors in a record which does not satisfy a set of consistency rules. Once some potential errors have been localizated, it is also necessary to impute new values to the associated fields. The output dataset should consist of valid records and preserve similar statistical properties as the input dataset. Most of this work is usually done manually by statistical agencies, thus consuming a great deal of human resources. This paper presents a mathematical programming model to optimally solve the problem on surveys with categorical values and particular edits. We also describe a heuristic approach to deal with the more complex surveys. The heuristic procedure follows a combination of the widely-accepted hot-deck donor scheme and the multivariate regression analysis. It has been implemented in a graphical user interface running on standard personal computers, and has been tested on real-world surveys. This paper demonstrates the satisfactory performance of our automatic procedure.
Similar content being viewed by others
References
Bell C, Nerode A, Ng RT, Subrahmanian VS (1996) Implementing deductive databases by mixed integer programming. ACM Trans Database Syst 21: 238–269
Bruni R (2004) Discrete models for data imputation. Discrete Appl Math 144: 59–69
Bruni R (2005) Error correction for massive data sets. Optim Methods Softw 20: 295–314
De Waal T (2001) WAID 4.1: a computer program for imputation of missing values. Res Off Stat 2: 53–70
De Waal T (2003) Processing of erroneous and unsafe data. PhD thesis, Erasmus University Rotterdam
De Waal T, Coutinho W (2005) Automatic editing for business surveys: an assessment of selected algorithms. Int Stat Rev 73: 73–102
Fellegi IP, Holt D (1976) A systematic approach to automatic edit and imputation. J Am Stat Assoc 71: 17–35
Ford BF (1983) An overview of hot-deck procedures. Incomplete Data Sample Surveys Theory Bibliograph 2: 185–207
Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. Freeman WH, San Francisco
Garfinkel RS, Kunnathur AS, Liepins GE (1986) Optimal imputation of erroneous data: categorical data, general edits. Oper Res 34: 744–751
Kovar J, Whitridge P (1990) Generalized edit and imputation system; overview and applications. Rev Bras Estadistica 51: 85–100
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley Interscience, New York
Milano M (ed) (2004) Constraint and integer programming toward a unified methodology. Operations Research/Computer Science, Interfaces Series 27
Nerode A, Shore RA (1997) Logic for applications. Springer, New York
Olinsky A, Chen S, Harlow L (2003) The comparative efficacy of imputation methods for missing data in structural equation modeling. Eur J Oper Res 151: 53–79
Pierzchala M (1995) Editing systems and software. In: Cox B, Chinnappa CK(eds) Business Survey Methods. Wiley, New York, pp 425–441
Riera-Ledesma J, Salazar-González JJ (2007a) A Heuristic approach for the continuous error localization problem in data cleaning. Comput Oper Res 34: 2370–2383
Riera-Ledesma J, Salazar-González JJ (2007b) A branch-and-cut algorithm for the error location problem in data cleaning. Comput Oper Res 34: 2790–2804
Schaffer J (1987) Procedure for solving the data-editing problem with both continuous and discrete data types. Naval Res Logist 34: 879–890
The knowledge base on statistical data editing. Available online at: http://amrads.jrc.cec.eu.int/k-base (accessed on May 15, 2007)
United Nations Statistical Commision and Economic Comission for Europe (2000) Evaluating efficiency of statistical data editing: general framework (Conference of European Statisticians in Geneva). Available online at: http://www.unece.org/stats/publications/editingefficiency.pdf (accessed on May 15, 2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Delgado-Quintero, S., Salazar-González, JJ. A new approach for data editing and imputation. Math Meth Oper Res 68, 407–428 (2008). https://doi.org/10.1007/s00186-008-0237-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00186-008-0237-6