Skip to main content

Web Data Reconciliation: Models and Experiences

  • Chapter
Search Computing

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7538))

Abstract

An increasing number of web sites offer structured information about recognizable concepts, relevant to many application domains, such as finance, sport, commercial products. However, web data is inherently imprecise and uncertain, and conflicting values can be provided by different web sources. Characterizing the uncertainty of web data represents an important issue and several models have been recently proposed in the literature. This chapter illustrates state-of-the-art Bayesan models to evaluate the quality of data extracted from the Web and reports the results of an extensive application of the models on real life web data. Experimental results show that for some applications even simple approaches can provide effective results, while sophisticated solutions are needed to obtain a more precise characterization of the uncertainty.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 72.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amento, B., Terveen, L.G., Hill, W.C.: Does “authority” mean quality? predicting expert quality ratings of web documents. In: SIGIR, pp. 296–303 (2000)

    Google Scholar 

  2. Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies, and Techniques. Springer (2008)

    Google Scholar 

  3. Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Exploiting information redundancy to wring out structured data from the web. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010. ACM, New York (2010)

    Google Scholar 

  4. Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Redundancy-driven web data extraction and integration. In: WebDB (2010)

    Google Scholar 

  5. Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources. In: Pernici, B. (ed.) CAiSE 2010. LNCS, vol. 6051, pp. 83–97. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  6. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30(1-7), 107–117 (1998)

    Google Scholar 

  7. Cafarella, M.J., Etzioni, O., Suciu, D.: Structured queries over web text. IEEE Data Eng. Bull. 29(4), 45–51 (2006)

    Google Scholar 

  8. Clemen, R.T., Winkler, R.L.: Combining probability distributions from experts in risk analysis. Risk Analysis 19(2), 187–203 (1999)

    Google Scholar 

  9. Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12 (2007)

    Google Scholar 

  10. Dong, X., Berti-Equille, L., Hu, Y., Srivastava, D.: Global detection of complex copying relationships between sources. PVLDB 3(1), 1358–1369 (2010)

    Google Scholar 

  11. Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. PVLDB 2(1), 550–561 (2009)

    Google Scholar 

  12. Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)

    Google Scholar 

  13. Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI, pp. 1034–1041 (2005)

    Google Scholar 

  14. Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. In: VLDB, pp. 216–225 (1997)

    Google Scholar 

  15. Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: Proc. WSDM, New York, USA (2010)

    Google Scholar 

  16. Pal, A., Rastogi, V., Machanavajjhala, A., Bohannon, P.: Information integration over time in unreliable and uncertain environments. In: WWW, pp. 789–798 (2012)

    Google Scholar 

  17. Wu, M., Marian, A.: Corroborating answers from multiple web sources. In: WebDB (2007)

    Google Scholar 

  18. Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P. (2012). Web Data Reconciliation: Models and Experiences. In: Ceri, S., Brambilla, M. (eds) Search Computing. Lecture Notes in Computer Science, vol 7538. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34213-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34213-4_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34212-7

  • Online ISBN: 978-3-642-34213-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics