Capture Missing Values Based on Crowdsourcing

  • Chen Ye
  • Hongzhi Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8491)


Due to the unreliable environment in mobile could, attribute values or tuples may be missing or lost. Thus we should capture missing values to make data mining and analysis more accurate. Besides ignoring or setting to default values, many imputation methods have been proposed, but they also have their limitations. This paper proposes a human-machine hybrid workflow to study the missing value filling method with crowdsourcing. First we propose a missing value selection algorithm to select the missing values which are suitable to use crowdsourcing for filling. Then we propose three missing values filling methods according to different attribute types to select answers from crowdsourcing. Experimental results show that our algorithms could improve data quality significantly with low costs.


data cleaning missing values crowdsourcing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amazon Mechanical Turk (2013),
  2. 2.
    Qiang, L., Jian, P., Alexander, T.: Ihler: Variational Inference for Crowdsourcing. In: NIPS, pp. 701–709 (2012)Google Scholar
  3. 3.
    Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: Crowdsourcing Entity Resolution. In: VLDB (2012)Google Scholar
  4. 4.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38 (1977)Google Scholar
  5. 5.
    Walsh, B.: Markov chain Monte Carlo and Gibbs sampling. Lecture notes for EEB 581 (version April 26, 2004)Google Scholar
  6. 6.
    Yang, K., Li, J., Wang, C.: Missing values estimation in microarray data with partial least squares regression. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 662–669. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    Shan, Y., Deng, G.: Kernel PCA regression for missing data estimation in DNA microarray analysis. In: IEEE International Symposium on Circuits and Systems, ISCAS 2009, pp. 1477–1480. IEEE (2009)Google Scholar
  8. 8.
    Lakshminarayan, K., Harp, S.A., Goldman, R., et al.: Imputation of missing data using machine learning techniques. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 140–145 (1996)Google Scholar
  9. 9.
    Li, X.B.: A Bayesian approach for estimating and replacing missing categorical data. Journal of Data and Information Quality (JDIQ) 1(1), 3 (2009)Google Scholar
  10. 10.
    Di Zio, M., Scanu, M., Coppola, L., et al.: Bayesian networks for imputation. Journal of the Royal Statistical Society: Series A (Statistics in Society) 167(2), 309–322 (2004)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: Proceedings of the 2010 International Conference on Management of Data, pp. 75–86. ACM (2010)Google Scholar
  12. 12.
    Zhang, S.: Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35(1), 123–133 (2011)CrossRefzbMATHGoogle Scholar
  13. 13.
    Zhang, C., Zhu, X., Zhang, J., Qin, Y., Zhang, S.: GbkII: An imputation method for missing values. In: Zhou, Z.-H., Li, H., Yang, Q., et al. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 1080–1087. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  14. 14.
    Setiawan, N.A., Venkatachalam, P., Hani, A.F.M.: Missing attribute value prediction based on artificial neural network and rough set theory. In: International Conference on BioMedical Engineering and Informatics, BMEI 2008, vol. 1, pp. 306–310. IEEE (2008)Google Scholar
  15. 15.
    Shen, L., Joshi, A.K., An, S.V.M.: based voting algorithm with application to parse reranking. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp. 9–16. Association for Computational Linguistics (2003)Google Scholar
  16. 16.
    Fan, W., Geerts, F., Ma, S., Tang, N., Yu, W.: Data Quality Problems beyond Consistency and Deduplication. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, W.-C., Fourman, M., et al. (eds.) Buneman Festschrift 2013. LNCS, vol. 8000, pp. 237–249. Springer, Heidelberg (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Chen Ye
    • 1
  • Hongzhi Wang
    • 1
  1. 1.Harbin Institute of TechnologyHarbinChina

Personalised recommendations