Skip to main content

WebPut: Efficient Web-Based Data Imputation

  • Conference paper
Web Information Systems Engineering - WISE 2012 (WISE 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7651))

Included in the following conference series:

Abstract

In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is also proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Experiments based on several real-world data collections demonstrate that WebPut outperforms existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: ACM DL (2000)

    Google Scholar 

  2. Barnard, J., Rubin, D.: Small-sample degrees of freedom with multiple imputation. Biometrika 86(4), 948–955 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  3. Batista, G., Monard, M.: An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence 17(5-6), 519–533 (2003)

    Article  Google Scholar 

  4. Brin, S.: Extracting patterns and relations from the world wide web. In: The World Wide Web and Databases, pp. 172–183 (1999)

    Google Scholar 

  5. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38 (1977)

    Google Scholar 

  6. Grzymala-Busse, J.: Three approaches to missing attribute values: A rough set perspective. In: Data Mining: Foundations and Practice, pp. 139–152 (2008)

    Google Scholar 

  7. Grzymala-Busse, J., Grzymala-Busse, W., Goodwin, L.: Coping with missing attribute values based on closest fit in preterm birth data: A rough set approach. Computational Intelligence 17(3), 425–434 (2001)

    Article  Google Scholar 

  8. Grzymała-Busse, J.W., Hu, M.: A Comparison of Several Approaches to Missing Attribute Values in Data Mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, p. 378. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  9. Gupta, R., Sarawagi, S.: Answering table augmentation queries from unstructured lists on the web. PVLDB 2(1), 289–300 (2009)

    Google Scholar 

  10. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: COLING (1992)

    Google Scholar 

  11. Li, J., Cercone, N.: Assigning missing attribute values based on rough sets theory. IEEE Granular Computing, 607–610 (2006)

    Google Scholar 

  12. Li, Z., Sitbon, L., Zhou, X.: Learning-based Relevance Feedback for Web-based Relation Completion. In: CIKM (2011)

    Google Scholar 

  13. Loshin, D.: The Data Quality Business Case: Projecting Return on Investment. Informatica (2006)

    Google Scholar 

  14. Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers. In: EACL (1999)

    Google Scholar 

  15. Quinlan, J.R.: C4. 5: programs for machine learning. Morgan Kaufmann (1993)

    Google Scholar 

  16. Ramoni, M., Sebastiani, P.: Robust learning with missing data. Machine Learning 45(2), 147–170 (2001)

    Article  MATH  Google Scholar 

  17. Shi, S., Zhang, H., Yuan, X., Wen, J.R.: Corpus-based semantic class mining: distributional vs. pattern-based approaches. In: COLING (2010)

    Google Scholar 

  18. Wang, Q., Rao, J.: Empirical likelihood-based inference under imputation for missing response data. The Annals of Statistics 30(3), 896–924 (2002)

    Article  MathSciNet  Google Scholar 

  19. Wang, R.C., Cohen, W.W.: Automatic set instance extraction using the web. In: ACL/AFNLP (2009)

    Google Scholar 

  20. Wu, C., Wun, C., Chou, H.: Using association rules for completing missing data. In: HIS (2004)

    Google Scholar 

  21. Zhang, S.: Parimputation: From imputation and null-imputation to partially imputation. IEEE Intelligent Informatics Bulletin 9(1), 32–38 (2008)

    Google Scholar 

  22. Zhang, S.: Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35(1), 123–133 (2011)

    Article  MATH  Google Scholar 

  23. Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE TKDE 23(1), 110–121 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, Z., Sharaf, M.A., Sitbon, L., Sadiq, S., Indulska, M., Zhou, X. (2012). WebPut: Efficient Web-Based Data Imputation. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds) Web Information Systems Engineering - WISE 2012. WISE 2012. Lecture Notes in Computer Science, vol 7651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35063-4_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35063-4_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35062-7

  • Online ISBN: 978-3-642-35063-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics