Knowledge and Information Systems

, Volume 24, Issue 2, pp 221–233 | Cite as

Mining incomplete survey data through classification

  • Hai Wang
  • Shouhong Wang
Regular Paper


Data mining with incomplete survey data is an immature subject area. Mining a database with incomplete data, the patterns of missing data as well as the potential implication of these missing data constitute valuable knowledge. This paper presents the conceptual foundations of data mining with incomplete data through classification which is relevant to a specific decision making problem. The proposed technique generally supposes that incomplete data and complete data may come from different sub-populations. The major objective of the proposed technique is to detect the interesting patterns of data missing behavior that are relevant to a specific decision making, instead of estimation of individual missing value. Using this technique, a set of complete data is used to acquire a near-optimal classifier. This classifier provides the prediction reference information for analyzing the incomplete data. The data missing behavior concealed in the missing data is then revealed. Using a real-world survey data set, the paper demonstrates the usefulness of this technique.


Data mining Knowledge discovery Incomplete survey data Classification 



Layered back-propagation neural networks


Correct classification rate


Home Mortgage Disclosure Act


Linear discriminant analysis


Multiple imputation


Metropolitan Statistical Area

List of symbols




A set of reference information


A set of classification results for SM k


A set of survey data with incomplete data


A data set with complete data


Sub-set of SC for test of the classifier


Sub-set of SC for training of the classifier


A set of survey data with missing values


Data sets with artificial imputation values for the missing values


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal CC, Parthasarathy S (2001) Mining massively incomplete data sets by conceptual reconstruction. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 227–232Google Scholar
  2. 2.
    Alyuda Research Inc. (2009) Alyuda Forecaster. Version XL (software program). Retrieved January 5, 2009
  3. 3.
    Archer NP, Wang S (1993) Application of the back propagation neural network algorithm with monotonicity conditions for two-Group classification problems. Decis Sci 24(1): 60–75CrossRefGoogle Scholar
  4. 4.
    Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5/6): 519–533CrossRefGoogle Scholar
  5. 5.
    Chang X, Lilly JH (2004) Evolutionary design of a fuzzy classifier from data. IEEE Trans Syst Man Cybern B 34(4): 1894–1906CrossRefGoogle Scholar
  6. 6.
    Consumers Union (2000) Consumers Union Southwest Regional Office and Austin Tenant’s Council. Access to the dream: Subprime and prime mortgage lending in Texas—executive summary. April 2000. [Retrieved January 15, 2009]
  7. 7.
    Dempster AP, Laird NM, Rubin DB (1997) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B39(1): 1–38MathSciNetGoogle Scholar
  8. 8.
    Duda R, Hart P (1973) Pattern Classification and Scene Analysis. Wiley, New YorkzbMATHGoogle Scholar
  9. 9.
    Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78: 316–331zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7: 179–188Google Scholar
  11. 11.
    Green PE (1978) Analyzing multivariate data. Dryden Press, HinsdaleGoogle Scholar
  12. 12.
    Hand DJ (1981) Discrimination and classification. Wiley, New YorkzbMATHGoogle Scholar
  13. 13.
    Hjorth JSU (1994) Computer intensive statistical methods validation, model selection, and bootstrap. Chapman & Hall, LondonzbMATHGoogle Scholar
  14. 14.
    Ishibuchi H, Nakashima T, Murata T (1999) Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Trans Syst Man Cybern B 29(5): 601–618CrossRefGoogle Scholar
  15. 15.
    Lachenbruch PA (1975) Discriminant analysis. Hafner Press, New YorkzbMATHGoogle Scholar
  16. 16.
    Li T, Zhu S, Ogihara M (2006) Using discriminant analysis for multi-class classification: an experimental investigation. Knowl Inf Syst 10(4): 453–468CrossRefGoogle Scholar
  17. 17.
    Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New YorkzbMATHGoogle Scholar
  18. 18.
    Parthasarathy S, Aggarwal CC (2003) On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Comput Soc 15(6): 1512–1521Google Scholar
  19. 19.
    Partovi FY, Anandarajan M (2002) Classifying inventory using an artificial neural network approach. Comput Ind Eng 41(4): 389–404CrossRefGoogle Scholar
  20. 20.
    Rubin DB (1978) Multiple imputations in sample survey—a phenomenological Bayesian approach to nonresponse. In: Proceedings of the survey research methods section, American Statistical Association, pp 20–34Google Scholar
  21. 21.
    Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New YorkCrossRefGoogle Scholar
  22. 22.
    Rubin DB (1996) Multiple imputation after 18+ year. J Am Stat Assoc 91(434): 473–489zbMATHCrossRefGoogle Scholar
  23. 23.
    Rumelhart D, McClelland J, The PDP Research Group (1986) Parallel distributed processing: explorations in the microstructure of cognition, vol 1. Foundations. The MIT Press, CambridgeGoogle Scholar
  24. 24.
    Sinharay S, Stern HS, Russell D (2001) The use of multiple imputation for the analysis of missing data. Psychol Methods 6(4): 317–329CrossRefGoogle Scholar
  25. 25.
    SPSS (2009) SPSS for Windows, Version 15.0 (software program). Retrieved January 15, 2009
  26. 26.
    UCI (2009) UCI machine learning repository, adult database, Retrieved January 14, 2009
  27. 27.
    US Census Bureau (2009) Home Mortgage Disclosure Act. DataFerrett for TheDataWeb (software program). Retrieved January 15, 2009
  28. 28.
    Wang JS, Lee CSG (2002) Self-adaptive neuro-fuzzy inference systems for classification applications. IEEE Trans Fuzzy Syst 10(6): 790–802CrossRefGoogle Scholar
  29. 29.
    Wang S (1995) The unpredictability of standard back propagation neural networks in classification applications. Manag Sci 41(3): 555–559zbMATHCrossRefGoogle Scholar
  30. 30.
    Weiss SM, Kulikowski CA (1991) Computer systems that learn. Morgan Kaufmann, New YorkGoogle Scholar
  31. 31.
    Wu X, Kumar V, Quinlan JR et al (2008) Top 10 algorithm in data mining. Knowl Inf Syst 14(1): 1–37CrossRefGoogle Scholar
  32. 32.
    Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inf Syst 9(3): 339–352CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  1. 1.Sobey School of BusinessSaint Mary’s UniversityHalifaxCanada
  2. 2.Charlton College of BusinessUniversity of Massachusetts DartmouthDartmouthUSA

Personalised recommendations