Advertisement

World Wide Web

, Volume 17, Issue 5, pp 873–897 | Cite as

A web-based approach to data imputation

  • Zhixu Li
  • Mohamed A. Sharaf
  • Laurianne Sitbon
  • Shazia Sadiq
  • Marta Indulska
  • Xiaofang Zhou
Article

Abstract

In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques.

Keywords

Data imputation WebPut Incomplete data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: ACM DL, pp. 85–94 (2000)Google Scholar
  2. 2.
    Barnard, J., Rubin, D.: Small-sample degrees of freedom with multiple imputation. Biometrika 86(4), 948–955 (1999)zbMATHMathSciNetCrossRefGoogle Scholar
  3. 3.
    Batista, G., Monard, M.: An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 17(5–6), 519–533 (2003)CrossRefGoogle Scholar
  4. 4.
    Brin, S.: Extracting patterns and relations from the world wide web. In: The World Wide Web and Databases, pp. 172–183. Springer (1999)Google Scholar
  5. 5.
    Cormode, G., Golab, L., Flip, K., McGregor, A., Srivastava, D., Zhang, X.: Estimating the confidence of conditional functional dependencies. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 469–482. ACM (2009)Google Scholar
  6. 6.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Series B (Methodological) 39(1), 1–38 (1977)zbMATHMathSciNetGoogle Scholar
  7. 7.
    Grzymala-Busse, J.W.: Three approaches to missing attribute values: a rough set perspective. Data Mining: Foundations and Practice 118, 139 (2008)Google Scholar
  8. 8.
    Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Rough sets and current trends in computing, vol. 2005, p. 378. Springer (2001)Google Scholar
  9. 9.
    Grzymala-Busse, J., Grzymala-Busse, W., Goodwin, L.: Coping with missing attribute values based on closest fit in preterm birth data: a rough set approach. Comput. Intell. 17(3), 425–434 (2001)CrossRefGoogle Scholar
  10. 10.
    Gupta, R., Sarawagi, S.: Answering table augmentation queries from unstructured lists on the web. Proceedings of the VLDB Endowment (PVLDB) 2(1), 289–300 (2009)Google Scholar
  11. 11.
    Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, vol. 2, pp. 539–545 (1992)Google Scholar
  12. 12.
    Li, J., Cercone, N.: Assigning missing attribute values based on rough sets theory. In: Granular Computing, 2006 IEEE International Conference on, pp. 607–610. IEEE (2006)Google Scholar
  13. 13.
    Li, Z., Sitbon, L., Zhou, X.: Learning-based relevance feedback for web-based relation completion. In: CIKM, pp. 1535–1540 (2011)Google Scholar
  14. 14.
    Li, Z., Sharaf, M.A., Sitbon, L., Sadiq, S., Indulska, M., Zhou, X.: Webput: efficient web-based data imputation. In: WISE, pp. 243–256. Springer (2012)Google Scholar
  15. 15.
    Loshin, D.: The data quality business case: projecting return on investment. Volume White Paper, Informatica (2008)Google Scholar
  16. 16.
    Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers. In: EACL, pp. 1–8 (1999)Google Scholar
  17. 17.
    Quinlan, J.: C4. 5: Programs for Machine Learning. Morgan Kaufmann (1993)Google Scholar
  18. 18.
    Ramoni, M., Sebastiani, P.: Robust learning with missing data. Mach. Learn. 45(2), 147–170 (2001)zbMATHCrossRefGoogle Scholar
  19. 19.
    Shi, S., Zhang, H., Yuan, X., Wen, J.-R.: Corpus-based semantic class mining: distributional vs. pattern-based approaches. In: COLING, pp. 993–1001 (2010)Google Scholar
  20. 20.
    Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)zbMATHMathSciNetCrossRefGoogle Scholar
  21. 21.
    Wang, R., Cohen, W.: Iterative set expansion of named entities using the web. In: ICDM, pp. 1091–1096. IEEE (2008)Google Scholar
  22. 22.
    Wang, R., Cohen, W.: Automatic set instance extraction using the web. In: ACL/AFNLP, pp. 441–449. Association for Computational Linguistics (2009)Google Scholar
  23. 23.
    Wang, Q., Rao, J.: Empirical likelihood-based inference under imputation for missing response data. Ann. Stat. 30(3), 896–924 (2002)zbMATHMathSciNetCrossRefGoogle Scholar
  24. 24.
    Wu, C., Wun, C., Chou, H.: Using association rules for completing missing data. In: HIS, pp. 236–241. IEEE (2004)Google Scholar
  25. 25.
    Zhang, S.: Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell. Inform. Bull. 9(1), 32–38 (2008)Google Scholar
  26. 26.
    Zhang, S.: Shell-neighbor method and its application in missing data imputation. Appl. Intell. 35(1), 123–133 (2011)zbMATHCrossRefGoogle Scholar
  27. 27.
    Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23(1), 110–121 (2011)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Zhixu Li
    • 1
    • 4
  • Mohamed A. Sharaf
    • 1
  • Laurianne Sitbon
    • 1
    • 3
  • Shazia Sadiq
    • 1
  • Marta Indulska
    • 1
  • Xiaofang Zhou
    • 1
    • 2
  1. 1.The University of QueenslandQueenslandAustralia
  2. 2.The School of Computer Science and TechnologySoochow UniversityJiangsuChina
  3. 3.Queensland University of TechnologyQueenslandAustralia
  4. 4.King Abdullah University of Science and TechnologyThuwalSaudi Arabia

Personalised recommendations