Skip to main content
Log in

Mining incomplete survey data through classification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Data mining with incomplete survey data is an immature subject area. Mining a database with incomplete data, the patterns of missing data as well as the potential implication of these missing data constitute valuable knowledge. This paper presents the conceptual foundations of data mining with incomplete data through classification which is relevant to a specific decision making problem. The proposed technique generally supposes that incomplete data and complete data may come from different sub-populations. The major objective of the proposed technique is to detect the interesting patterns of data missing behavior that are relevant to a specific decision making, instead of estimation of individual missing value. Using this technique, a set of complete data is used to acquire a near-optimal classifier. This classifier provides the prediction reference information for analyzing the incomplete data. The data missing behavior concealed in the missing data is then revealed. Using a real-world survey data set, the paper demonstrates the usefulness of this technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

BPNN:

Layered back-propagation neural networks

CCR:

Correct classification rate

HMDA:

Home Mortgage Disclosure Act

LDA:

Linear discriminant analysis

MI:

Multiple imputation

MSA:

Metropolitan Statistical Area

C :

Classifier

RC :

A set of reference information

RM :

A set of classification results for SM k

S :

A set of survey data with incomplete data

SC :

A data set with complete data

SC Test :

Sub-set of SC for test of the classifier

SC Train :

Sub-set of SC for training of the classifier

SM :

A set of survey data with missing values

SM k :

Data sets with artificial imputation values for the missing values

References

  1. Aggarwal CC, Parthasarathy S (2001) Mining massively incomplete data sets by conceptual reconstruction. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 227–232

  2. Alyuda Research Inc. (2009) Alyuda Forecaster. Version XL (software program). http://www.alyuda.com. Retrieved January 5, 2009

  3. Archer NP, Wang S (1993) Application of the back propagation neural network algorithm with monotonicity conditions for two-Group classification problems. Decis Sci 24(1): 60–75

    Article  Google Scholar 

  4. Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5/6): 519–533

    Article  Google Scholar 

  5. Chang X, Lilly JH (2004) Evolutionary design of a fuzzy classifier from data. IEEE Trans Syst Man Cybern B 34(4): 1894–1906

    Article  Google Scholar 

  6. Consumers Union (2000) Consumers Union Southwest Regional Office and Austin Tenant’s Council. Access to the dream: Subprime and prime mortgage lending in Texas—executive summary. April 2000. http://www.consumersunion.org/finance/access/access1.htm [Retrieved January 15, 2009]

  7. Dempster AP, Laird NM, Rubin DB (1997) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B39(1): 1–38

    MathSciNet  Google Scholar 

  8. Duda R, Hart P (1973) Pattern Classification and Scene Analysis. Wiley, New York

    MATH  Google Scholar 

  9. Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78: 316–331

    Article  MATH  MathSciNet  Google Scholar 

  10. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7: 179–188

    Google Scholar 

  11. Green PE (1978) Analyzing multivariate data. Dryden Press, Hinsdale

    Google Scholar 

  12. Hand DJ (1981) Discrimination and classification. Wiley, New York

    MATH  Google Scholar 

  13. Hjorth JSU (1994) Computer intensive statistical methods validation, model selection, and bootstrap. Chapman & Hall, London

    MATH  Google Scholar 

  14. Ishibuchi H, Nakashima T, Murata T (1999) Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Trans Syst Man Cybern B 29(5): 601–618

    Article  Google Scholar 

  15. Lachenbruch PA (1975) Discriminant analysis. Hafner Press, New York

    MATH  Google Scholar 

  16. Li T, Zhu S, Ogihara M (2006) Using discriminant analysis for multi-class classification: an experimental investigation. Knowl Inf Syst 10(4): 453–468

    Article  Google Scholar 

  17. Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  18. Parthasarathy S, Aggarwal CC (2003) On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Comput Soc 15(6): 1512–1521

    Google Scholar 

  19. Partovi FY, Anandarajan M (2002) Classifying inventory using an artificial neural network approach. Comput Ind Eng 41(4): 389–404

    Article  Google Scholar 

  20. Rubin DB (1978) Multiple imputations in sample survey—a phenomenological Bayesian approach to nonresponse. In: Proceedings of the survey research methods section, American Statistical Association, pp 20–34

  21. Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

    Book  Google Scholar 

  22. Rubin DB (1996) Multiple imputation after 18+ year. J Am Stat Assoc 91(434): 473–489

    Article  MATH  Google Scholar 

  23. Rumelhart D, McClelland J, The PDP Research Group (1986) Parallel distributed processing: explorations in the microstructure of cognition, vol 1. Foundations. The MIT Press, Cambridge

  24. Sinharay S, Stern HS, Russell D (2001) The use of multiple imputation for the analysis of missing data. Psychol Methods 6(4): 317–329

    Article  Google Scholar 

  25. SPSS (2009) SPSS for Windows, Version 15.0 (software program). http://www.spss.com. Retrieved January 15, 2009

  26. UCI (2009) UCI machine learning repository, adult database, http://www.ics.uci.edu/~mlearn/MLRepository.html. Retrieved January 14, 2009

  27. US Census Bureau (2009) Home Mortgage Disclosure Act. DataFerrett for TheDataWeb (software program). http://dataferrett.census.gov. Retrieved January 15, 2009

  28. Wang JS, Lee CSG (2002) Self-adaptive neuro-fuzzy inference systems for classification applications. IEEE Trans Fuzzy Syst 10(6): 790–802

    Article  Google Scholar 

  29. Wang S (1995) The unpredictability of standard back propagation neural networks in classification applications. Manag Sci 41(3): 555–559

    Article  MATH  Google Scholar 

  30. Weiss SM, Kulikowski CA (1991) Computer systems that learn. Morgan Kaufmann, New York

    Google Scholar 

  31. Wu X, Kumar V, Quinlan JR et al (2008) Top 10 algorithm in data mining. Knowl Inf Syst 14(1): 1–37

    Article  Google Scholar 

  32. Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inf Syst 9(3): 339–352

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shouhong Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Wang, S. Mining incomplete survey data through classification. Knowl Inf Syst 24, 221–233 (2010). https://doi.org/10.1007/s10115-009-0245-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0245-8

Keywords

Navigation