Crowdsourcing-Enhanced Missing Values Imputation Based on Bayesian Network

  • Chen YeEmail author
  • Hongzhi Wang
  • Jianzhong Li
  • Hong Gao
  • Siyao Cheng
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9642)


Due to development of the Internet, the size of data continue to be large and rough. During the process of data collection, different kinds of data problems occurred, among where incompleteness is one of the most serious problems to deal with. The existing methods for missing values imputation have mostly relied on using statistics and machine learning. These methods are known to be limited in efficiency and accuracy, which are caused by high dimensional calculation and low quality of initial data. In this paper, we propose a new method combining Bayesian network and crowdsourcing to deal with missing values together. We use Bayesian network to inference missing values to improve efficiency while use crowdsourcing to obtain additional information in need to improve accuracy. Experiments on real datasets show that our methods achieve better performance compared to other imputation methods.


Missing values Bayesian network Crowdsourcing 



This paper was supported by NGFR 973 grant 2012CB316200, NSFC grant U1509216, 61472099, 61133002 and National Sci-Tech Support Plan 2015BAH10F01.


  1. 1.
    Janssen, K.J.M., Donders, A.R.T., Harrell, F.E., et al.: Missing covariate data in medical research: to impute is better than to ignore. J. Clin. Epidemiol. 63(7), 721–727 (2010)CrossRefGoogle Scholar
  2. 2.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological) 39, 1–38 (2011)MathSciNetzbMATHGoogle Scholar
  3. 3.
    Shan, Y., Kernel, D.G.: PCA regression for missing data estimation in DNA microarray analysis. In: IEEE International Symposium on Circuits and Systems, ISCAS 2009, pp. 1477–1480. IEEE (2009)Google Scholar
  4. 4.
    Lakshminarayan, K., Harp, S.A., Goldman, R.P., et al.: Imputation of missing data using machine learning techniques. In: KDD, pp. 140–145 (1996)Google Scholar
  5. 5.
    Yang, K., Li, J., Wang, C.: Missing values estimation in microarray data with partial least squares regression. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 662–669. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Li, X.B.: A Bayesian approach for estimating and replacing missing categorical data. J. Data Inf. Qual. (JDIQ) 1(1), 3 (2009)Google Scholar
  7. 7.
    Di Zio, M., Scanu, M., Coppola, L., et al.: Bayesian networks for imputation. J. R. Stat. Soc. Ser. A (Statistics in Society) 167(2), 309–322 (2004)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Zhang, S.: Shell-neighbor method and its application in missing data imputation. Appl. Intell. 35(1), 123–133 (2011)CrossRefGoogle Scholar
  9. 9.
    Setiawan, N.A., Venkatachalam, P.A., Hani, A.F.M.: Missing attribute value prediction based on artificial neural network and rough set theory. In: International Conference on BioMedical Engineering and Informatics, BMEI 2008, vol. 1, pp. 306–310. IEEE (2008)Google Scholar
  10. 10.
    Nowak, S., Rger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, pp. 557–566. ACM (2010)Google Scholar
  11. 11.
    Noronha, J., Hysen, E., Zhang, H., et al.: Platemate: crowdsourcing nutritional analysis from food photographs. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 1–12. ACM (2011)Google Scholar
  12. 12.
    Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. Proc. VLDB Endowment 6(6), 349–360 (2013)CrossRefGoogle Scholar
  13. 13.
    Wang, J., Kraska, T., Franklin, M.J., et al.: Crowder: crowdsourcing entity resolution. Proc. VLDB Endowment 5(11), 1483–1494 (2012)CrossRefGoogle Scholar
  14. 14.
    Zhang, C.J., Chen, L., Jagadish, H.V., et al.: Reducing uncertainty of schema matching via crowdsourcing. Proc. VLDB Endowment 6(9), 757–768 (2013)CrossRefGoogle Scholar
  15. 15.
    Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, SanMateo (1988)zbMATHGoogle Scholar
  16. 16.
    Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)CrossRefzbMATHGoogle Scholar
  17. 17.
    Stekhoven, D.J., Bhlmann, P.: MissForestnon-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)CrossRefGoogle Scholar
  18. 18.
    Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992)zbMATHGoogle Scholar
  19. 19.
    Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)CrossRefGoogle Scholar
  20. 20.
    Huang, C., Darwiche, A.: Inference in belief networks: a procedural guide. Int. J. Approximate Reasoning 15(3), 225–263 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Lauritzen, S.L.: The EM algorithm for graphical association models with missing data. Comput. Stat. Data Anal. 19(2), 191–201 (1995)CrossRefzbMATHGoogle Scholar
  22. 22.
    Hochbaum, D.S.: Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In: Approximation Algorithms for NP-Hard Problems, pp. 94–143. PWS Publishing Co. (1996)Google Scholar
  23. 23.
    Li, J., Cai, Z., Yan, M., Li, Y.: Using crowdsourced data in location-based social networks to explore influence maximization. In: The 35th Annual IEEE International Conference on Computer Communications (INFOCOM 2016) (2016)Google Scholar
  24. 24.
    Wang, Y., Cai, Z., Stothard, P., et al.: Fast accurate missing SNP genotype local imputation. BMC Res. Notes 5(1), 404 (2012)CrossRefGoogle Scholar
  25. 25.
    Cai, Z., Heydari, M., Lin, G.: Iterated local least squares imputation for microarray missing values. J. Bioinform. Comput. Biol. 4(5), 935–957 (2006)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Chen Ye
    • 1
    Email author
  • Hongzhi Wang
    • 1
  • Jianzhong Li
    • 1
  • Hong Gao
    • 1
  • Siyao Cheng
    • 1
  1. 1.Department of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations