Cost Reduction for Web-Based Data Imputation

  • Zhixu Li
  • Shuo Shang
  • Qing Xie
  • Xiangliang Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8422)


Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity of these keywords and the data complexity on the Web, different queries may retrieve different answers to the same absent field value. To decide the most probable right answer to each absent filed value, existing method issues quite a few available imputation queries for each absent value, and then vote on deciding the most probable right answer. As a result, we have to issue a large number of imputation queries for filling all absent values in an incomplete data set, which brings a large overhead. In this paper, we work on reducing the cost of Web-based Data Imputation in two aspects: First, we propose a query execution scheme which can secure the most probable right answer to an absent field value by issuing as few imputation queries as possible. Second, we recognize and prune queries that probably will fail to return any answers a priori. Our extensive experimental evaluation shows that our proposed techniques substantially reduce the cost of Web-based Imputation without hurting its high imputation accuracy.


Web-based Data Imputation Imputation Query Cost Reduction 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: ACM DL, pp. 85–94 (2000)Google Scholar
  2. 2.
    Barnard, J., Rubin, D.: Small-sample degrees of freedom with multiple imputation. Biometrika 86(4), 948–955 (1999)CrossRefzbMATHMathSciNetGoogle Scholar
  3. 3.
    Batista, G., Monard, M.: An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence 17(5-6), 519–533 (2003)CrossRefGoogle Scholar
  4. 4.
    Chaudhuri, S.: What next?: A half-dozen data management research goals for big data and the cloud. In: PODS, pp. 1–4 (2012)Google Scholar
  5. 5.
    Cohen, W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods. In: ACM SIGKDD, pp. 89–98 (2004)Google Scholar
  6. 6.
    Cormode, G., Golab, L., Flip, K., McGregor, A., Srivastava, D., Zhang, X.: Estimating the confidence of conditional functional dependencies. In: SIGMOD, pp. 469–482 (2009)Google Scholar
  7. 7.
    Elmeleegy, H., Madhavan, J., Halevy, A.: Harvesting relational tables from lists on the web. PVLDB 2(1), 1078–1089 (2009)Google Scholar
  8. 8.
    Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL, pp. 363–370 (2005)Google Scholar
  9. 9.
    Gomadam, K., Yeh, P.Z., Verma, K.: Data enrichment using data sources on the web. In: 2012 AAAI Spring Symposium Series (2012)Google Scholar
  10. 10.
    Grzymala-Busse, J., Grzymala-Busse, W., Goodwin, L.: Coping with missing attribute values based on closest fit in preterm birth data: A rough set approach. Computational Intelligence 17(3), 425–434 (2001)CrossRefGoogle Scholar
  11. 11.
    Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  12. 12.
    Gummadi, R., Khulbe, A., Kalavagattu, A., Salvi, S., Kambhampati, S.: Smartint: Using mined attribute dependencies to integrate fragmented web databases. Journal of Intelligent Information Systems, 1–25 (2012)Google Scholar
  13. 13.
    Gupta, R., Sarawagi, S.: Answering table augmentation queries from unstructured lists on the web. PVLDB 2(1), 289–300 (2009)Google Scholar
  14. 14.
    Koutrika, G.: Entity Reconstruction: Putting the pieces of the puzzle back together. HP Labs, Palo Alto, USA (2012)Google Scholar
  15. 15.
    Li, Z., Sharaf, M.A., Sitbon, L., Sadiq, S., Indulska, M., Zhou, X.: Webput: Efficient web-based data imputation. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 243–256. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  16. 16.
    Li, Z., Sharaf, M.A., Sitbon, L., Sadiq, S., Indulska, M., Zhou, X.: A web-based approach to data imputation. In: WWW (2013)Google Scholar
  17. 17.
    Lin, W., Yangarber, R., Grishman, R.: Bootstrapped learning of semantic classes from positive and negative examples. In: ICML Workshop on The Continuum from Labeled to Unlabeled Data, vol. 1, p. 21 (2003)Google Scholar
  18. 18.
    Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2(4), 285–318 (1988)Google Scholar
  19. 19.
    Loshin, D.: The Data Quality Business Case: Projecting Return on Investment. Informatica (2006)Google Scholar
  20. 20.
    Pantel, P., Crestan, E., Borkovsky, A., Popescu, A., Vyas, V.: Web-scale distributional similarity and entity set expansion. In: EMNLP, pp. 938–947 (2009)Google Scholar
  21. 21.
    Quinlan, J.: C4. 5: Programs for machine learning. Morgan kaufmann (1993)Google Scholar
  22. 22.
    Wu, C., Wun, C., Chou, H.: Using association rules for completing missing data. In: HIS, pp. 236–241 (2004)Google Scholar
  23. 23.
    Yakout, M., Ganjam, K., Chakrabarti, K., Chaudhuri, S.: Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In: SIGMOD, pp. 97–108. ACM (2012)Google Scholar
  24. 24.
    Zhang, S.: Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35(1), 123–133 (2011)CrossRefzbMATHGoogle Scholar
  25. 25.
    Zhang, S.: Nearest neighbor selection for iteratively knn imputation. Journal of Systems and Software 85(11), 2541–2552 (2012)CrossRefGoogle Scholar
  26. 26.
    Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE TKDE 23(1), 110–121 (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Zhixu Li
    • 1
  • Shuo Shang
    • 2
  • Qing Xie
    • 1
  • Xiangliang Zhang
    • 1
  1. 1.King Abdullah University of Science and TechnologySaudi Arabia
  2. 2.Department of Software EngineeringChina University of Petroleum-BeijingP.R. China

Personalised recommendations