Web Data Reconciliation: Models and Experiences

  • Lorenzo Blanco
  • Valter Crescenzi
  • Paolo Merialdo
  • Paolo Papotti
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7538)


An increasing number of web sites offer structured information about recognizable concepts, relevant to many application domains, such as finance, sport, commercial products. However, web data is inherently imprecise and uncertain, and conflicting values can be provided by different web sources. Characterizing the uncertainty of web data represents an important issue and several models have been recently proposed in the literature. This chapter illustrates state-of-the-art Bayesan models to evaluate the quality of data extracted from the Web and reports the results of an extensive application of the models on real life web data. Experimental results show that for some applications even simple approaches can provide effective results, while sophisticated solutions are needed to obtain a more precise characterization of the uncertainty.


Video Game Probability Distribution Function Soccer Player Probability Concentration Accuracy Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amento, B., Terveen, L.G., Hill, W.C.: Does “authority” mean quality? predicting expert quality ratings of web documents. In: SIGIR, pp. 296–303 (2000)Google Scholar
  2. 2.
    Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies, and Techniques. Springer (2008)Google Scholar
  3. 3.
    Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Exploiting information redundancy to wring out structured data from the web. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010. ACM, New York (2010)Google Scholar
  4. 4.
    Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Redundancy-driven web data extraction and integration. In: WebDB (2010)Google Scholar
  5. 5.
    Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources. In: Pernici, B. (ed.) CAiSE 2010. LNCS, vol. 6051, pp. 83–97. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30(1-7), 107–117 (1998)Google Scholar
  7. 7.
    Cafarella, M.J., Etzioni, O., Suciu, D.: Structured queries over web text. IEEE Data Eng. Bull. 29(4), 45–51 (2006)Google Scholar
  8. 8.
    Clemen, R.T., Winkler, R.L.: Combining probability distributions from experts in risk analysis. Risk Analysis 19(2), 187–203 (1999)Google Scholar
  9. 9.
    Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12 (2007)Google Scholar
  10. 10.
    Dong, X., Berti-Equille, L., Hu, Y., Srivastava, D.: Global detection of complex copying relationships between sources. PVLDB 3(1), 1358–1369 (2010)Google Scholar
  11. 11.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. PVLDB 2(1), 550–561 (2009)Google Scholar
  12. 12.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)Google Scholar
  13. 13.
    Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI, pp. 1034–1041 (2005)Google Scholar
  14. 14.
    Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. In: VLDB, pp. 216–225 (1997)Google Scholar
  15. 15.
    Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: Proc. WSDM, New York, USA (2010)Google Scholar
  16. 16.
    Pal, A., Rastogi, V., Machanavajjhala, A., Bohannon, P.: Information integration over time in unreliable and uncertain environments. In: WWW, pp. 789–798 (2012)Google Scholar
  17. 17.
    Wu, M., Marian, A.: Corroborating answers from multiple web sources. In: WebDB (2007)Google Scholar
  18. 18.
    Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Lorenzo Blanco
    • 1
  • Valter Crescenzi
    • 1
  • Paolo Merialdo
    • 1
  • Paolo Papotti
    • 2
  1. 1.Università degli Studi Roma TreItaly
  2. 2.Qatar Computing Research InstituteQatar

Personalised recommendations