A new approach for data editing and imputation

  • Sergio Delgado-Quintero
  • Juan-José Salazar-González
Original Article

Abstract

The editing-and-imputation problem concerns the question of finding errors in a record which does not satisfy a set of consistency rules. Once some potential errors have been localizated, it is also necessary to impute new values to the associated fields. The output dataset should consist of valid records and preserve similar statistical properties as the input dataset. Most of this work is usually done manually by statistical agencies, thus consuming a great deal of human resources. This paper presents a mathematical programming model to optimally solve the problem on surveys with categorical values and particular edits. We also describe a heuristic approach to deal with the more complex surveys. The heuristic procedure follows a combination of the widely-accepted hot-deck donor scheme and the multivariate regression analysis. It has been implemented in a graphical user interface running on standard personal computers, and has been tested on real-world surveys. This paper demonstrates the satisfactory performance of our automatic procedure.

Keywords

Editing Imputation Error localization problem Mathematical Programming Heuristics 

References

  1. Bell C, Nerode A, Ng RT, Subrahmanian VS (1996) Implementing deductive databases by mixed integer programming. ACM Trans Database Syst 21: 238–269CrossRefGoogle Scholar
  2. Bruni R (2004) Discrete models for data imputation. Discrete Appl Math 144: 59–69MATHCrossRefMathSciNetGoogle Scholar
  3. Bruni R (2005) Error correction for massive data sets. Optim Methods Softw 20: 295–314CrossRefMathSciNetGoogle Scholar
  4. De Waal T (2001) WAID 4.1: a computer program for imputation of missing values. Res Off Stat 2: 53–70Google Scholar
  5. De Waal T (2003) Processing of erroneous and unsafe data. PhD thesis, Erasmus University RotterdamGoogle Scholar
  6. De Waal T, Coutinho W (2005) Automatic editing for business surveys: an assessment of selected algorithms. Int Stat Rev 73: 73–102MATHGoogle Scholar
  7. Fellegi IP, Holt D (1976) A systematic approach to automatic edit and imputation. J Am Stat Assoc 71: 17–35CrossRefGoogle Scholar
  8. Ford BF (1983) An overview of hot-deck procedures. Incomplete Data Sample Surveys Theory Bibliograph 2: 185–207Google Scholar
  9. Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. Freeman WH, San FranciscoMATHGoogle Scholar
  10. Garfinkel RS, Kunnathur AS, Liepins GE (1986) Optimal imputation of erroneous data: categorical data, general edits. Oper Res 34: 744–751MATHCrossRefGoogle Scholar
  11. Kovar J, Whitridge P (1990) Generalized edit and imputation system; overview and applications. Rev Bras Estadistica 51: 85–100Google Scholar
  12. Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley Interscience, New YorkMATHGoogle Scholar
  13. Milano M (ed) (2004) Constraint and integer programming toward a unified methodology. Operations Research/Computer Science, Interfaces Series 27Google Scholar
  14. Nerode A, Shore RA (1997) Logic for applications. Springer, New YorkMATHGoogle Scholar
  15. Olinsky A, Chen S, Harlow L (2003) The comparative efficacy of imputation methods for missing data in structural equation modeling. Eur J Oper Res 151: 53–79MATHCrossRefMathSciNetGoogle Scholar
  16. Pierzchala M (1995) Editing systems and software. In: Cox B, Chinnappa CK(eds) Business Survey Methods. Wiley, New York, pp 425–441Google Scholar
  17. Riera-Ledesma J, Salazar-González JJ (2007a) A Heuristic approach for the continuous error localization problem in data cleaning. Comput Oper Res 34: 2370–2383MATHCrossRefGoogle Scholar
  18. Riera-Ledesma J, Salazar-González JJ (2007b) A branch-and-cut algorithm for the error location problem in data cleaning. Comput Oper Res 34: 2790–2804MATHCrossRefGoogle Scholar
  19. Schaffer J (1987) Procedure for solving the data-editing problem with both continuous and discrete data types. Naval Res Logist 34: 879–890MATHCrossRefGoogle Scholar
  20. The knowledge base on statistical data editing. Available online at: http://amrads.jrc.cec.eu.int/k-base (accessed on May 15, 2007)
  21. United Nations Statistical Commision and Economic Comission for Europe (2000) Evaluating efficiency of statistical data editing: general framework (Conference of European Statisticians in Geneva). Available online at: http://www.unece.org/stats/publications/editingefficiency.pdf (accessed on May 15, 2007)

Copyright information

© Springer-Verlag 2008

Authors and Affiliations

  • Sergio Delgado-Quintero
    • 1
  • Juan-José Salazar-González
    • 1
  1. 1.DEIOCUniversidad de La LagunaTenerifeSpain

Personalised recommendations