Skip to main content
Log in

A new approach for data editing and imputation

  • Original Article
  • Published:
Mathematical Methods of Operations Research Aims and scope Submit manuscript

Abstract

The editing-and-imputation problem concerns the question of finding errors in a record which does not satisfy a set of consistency rules. Once some potential errors have been localizated, it is also necessary to impute new values to the associated fields. The output dataset should consist of valid records and preserve similar statistical properties as the input dataset. Most of this work is usually done manually by statistical agencies, thus consuming a great deal of human resources. This paper presents a mathematical programming model to optimally solve the problem on surveys with categorical values and particular edits. We also describe a heuristic approach to deal with the more complex surveys. The heuristic procedure follows a combination of the widely-accepted hot-deck donor scheme and the multivariate regression analysis. It has been implemented in a graphical user interface running on standard personal computers, and has been tested on real-world surveys. This paper demonstrates the satisfactory performance of our automatic procedure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bell C, Nerode A, Ng RT, Subrahmanian VS (1996) Implementing deductive databases by mixed integer programming. ACM Trans Database Syst 21: 238–269

    Article  Google Scholar 

  • Bruni R (2004) Discrete models for data imputation. Discrete Appl Math 144: 59–69

    Article  MATH  MathSciNet  Google Scholar 

  • Bruni R (2005) Error correction for massive data sets. Optim Methods Softw 20: 295–314

    Article  MathSciNet  Google Scholar 

  • De Waal T (2001) WAID 4.1: a computer program for imputation of missing values. Res Off Stat 2: 53–70

    Google Scholar 

  • De Waal T (2003) Processing of erroneous and unsafe data. PhD thesis, Erasmus University Rotterdam

  • De Waal T, Coutinho W (2005) Automatic editing for business surveys: an assessment of selected algorithms. Int Stat Rev 73: 73–102

    MATH  Google Scholar 

  • Fellegi IP, Holt D (1976) A systematic approach to automatic edit and imputation. J Am Stat Assoc 71: 17–35

    Article  Google Scholar 

  • Ford BF (1983) An overview of hot-deck procedures. Incomplete Data Sample Surveys Theory Bibliograph 2: 185–207

    Google Scholar 

  • Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. Freeman WH, San Francisco

    MATH  Google Scholar 

  • Garfinkel RS, Kunnathur AS, Liepins GE (1986) Optimal imputation of erroneous data: categorical data, general edits. Oper Res 34: 744–751

    Article  MATH  Google Scholar 

  • Kovar J, Whitridge P (1990) Generalized edit and imputation system; overview and applications. Rev Bras Estadistica 51: 85–100

    Google Scholar 

  • Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley Interscience, New York

    MATH  Google Scholar 

  • Milano M (ed) (2004) Constraint and integer programming toward a unified methodology. Operations Research/Computer Science, Interfaces Series 27

  • Nerode A, Shore RA (1997) Logic for applications. Springer, New York

    MATH  Google Scholar 

  • Olinsky A, Chen S, Harlow L (2003) The comparative efficacy of imputation methods for missing data in structural equation modeling. Eur J Oper Res 151: 53–79

    Article  MATH  MathSciNet  Google Scholar 

  • Pierzchala M (1995) Editing systems and software. In: Cox B, Chinnappa CK(eds) Business Survey Methods. Wiley, New York, pp 425–441

    Google Scholar 

  • Riera-Ledesma J, Salazar-González JJ (2007a) A Heuristic approach for the continuous error localization problem in data cleaning. Comput Oper Res 34: 2370–2383

    Article  MATH  Google Scholar 

  • Riera-Ledesma J, Salazar-González JJ (2007b) A branch-and-cut algorithm for the error location problem in data cleaning. Comput Oper Res 34: 2790–2804

    Article  MATH  Google Scholar 

  • Schaffer J (1987) Procedure for solving the data-editing problem with both continuous and discrete data types. Naval Res Logist 34: 879–890

    Article  MATH  Google Scholar 

  • The knowledge base on statistical data editing. Available online at: http://amrads.jrc.cec.eu.int/k-base (accessed on May 15, 2007)

  • United Nations Statistical Commision and Economic Comission for Europe (2000) Evaluating efficiency of statistical data editing: general framework (Conference of European Statisticians in Geneva). Available online at: http://www.unece.org/stats/publications/editingefficiency.pdf (accessed on May 15, 2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergio Delgado-Quintero.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Delgado-Quintero, S., Salazar-González, JJ. A new approach for data editing and imputation. Math Meth Oper Res 68, 407–428 (2008). https://doi.org/10.1007/s00186-008-0237-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00186-008-0237-6

Keywords

Navigation